Lesson 1 · Module 3

Setting Up Your Environment Without Losing Your Mind

Colab, Python, and the stack every builder actually uses — no PhD required.

What does it actually take to run your first neural network, right now, today?

Marcus is a junior at Georgia Tech, double-majoring in CS and economics. He spent winter break watching YouTube tutorials on deep learning, nodding along confidently. He understood the concepts. He got it. Neurons, weights, backpropagation — totally got it.

Then he opened his laptop on January 6th and tried to actually run something. Two hours later he had sixteen browser tabs open, a broken conda environment, a CUDA version mismatch error he didn't understand, and a growing suspicion that everyone who says "just pip install tensorflow" has never actually pip installed tensorflow on a fresh machine.

This lesson is the one Marcus needed. We're skipping the local setup rabbit hole entirely — at least to start — and going straight to a tool where you can run real deep learning code in under three minutes, with free GPU access and zero configuration pain.

Why the Environment Problem Is Real (and Annoying)

Here's what a lot of tutorials don't tell you: getting the environment set up is genuinely one of the hardest parts for beginners, and it has nothing to do with deep learning. It's about operating systems, package managers, CUDA drivers, Python versions, and the fact that these things interact in ways that were designed by engineers for engineers who already know what they're doing.

The good news is that Google Colab exists. Colab is a browser-based Python notebook environment — think Google Docs but for code — that runs on Google's servers. No installation. No CUDA drivers. No version conflicts. You open a browser, go to colab.research.google.com, and you're in. Free tier gives you access to a T4 GPU, which is more than enough to train the models we're building in this module.

A lot of your peers are still fighting with local setups because they think running locally is more "legitimate." It's not. Professional ML engineers use cloud environments constantly. Colab, Kaggle Notebooks, and cloud VMs are how most real work gets done. Local setup matters eventually — but for your first model, it's just friction.

Practical Takeaway

Go to colab.research.google.com right now and create a new notebook. Sign in with your Google account. In the first cell, type print("hello world") and press Shift+Enter. That's it — you have a working Python environment with GPU access. Everything else in this module runs from exactly this starting point.

The Stack: What You're Actually Using

Deep learning in 2025 runs on a small set of tools that have become the industry default. You don't need to master all of them before building — but you need to know what they are and why they exist:

Python The language. Not because it's the fastest (it's not) but because the ecosystem around it — NumPy, PyTorch, scikit-learn — is unmatched for ML. You don't need to be an expert. You need to be comfortable enough to read and modify code.

PyTorch The deep learning framework we're using. Developed at Meta, now dominant in research and increasingly in production. It's more intuitive than TensorFlow for beginners because you write code that looks like normal Python — no symbolic computation graphs to understand upfront.

NumPy The foundational numerical computing library. PyTorch tensors and NumPy arrays are closely related. When you see matrix multiplication or array operations in ML code, NumPy is usually involved.

Matplotlib Visualization. You'll use it to plot training loss curves, display sample images, and understand what your model is doing. A model you can't visualize is a model you can't debug.

All four of these are pre-installed in Colab. You don't install anything. You just import them.

# This runs in Colab with zero setup. Copy it exactly.
import torch
import numpy as np
import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")
print(f"GPU available: {torch.cuda.is_available()}")
print(f"Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")
    

Run that cell and you'll see which GPU you've been allocated. Don't panic if you get CPU — the first model we build is small enough that it trains in under a minute on CPU anyway. GPU matters more in Lesson 4 when model size goes up.

Notebooks vs. Scripts: What You're Looking At

Colab uses Jupyter notebooks — files where code and text live in the same document, in separate "cells." This is different from running a Python script from the command line, and the difference matters for how you work.

In a notebook, cells run independently but share memory. If you define a variable in cell 3, cell 7 can use it — but only if cell 3 has been executed first. This trips up almost everyone at least once. The cell execution order in the sidebar shows you what's been run and when.

The workflow you'll use throughout this module: write a small chunk of code in one cell, run it, see the output, adjust, move to the next cell. This is called exploratory development and it's how most data scientists and ML engineers actually work. The final polished script comes later, once you know what works.

What Your Peers Are Getting Wrong

A lot of people try to write the entire training script before running any of it — like they're writing a term paper. This almost never works in ML. A model that fails silently at step 8 because of a shape mismatch in step 3 is incredibly hard to debug when you haven't verified anything along the way. Run small. Verify often. Debug early.

Your First Import Block and Why It Matters

The first cell of every ML notebook you write should be your imports — all the libraries your code needs. This isn't just style; it's practical. If an import fails, you want to know immediately, not twenty cells later when you finally call a function that doesn't exist.

Here's the import block you'll use throughout this module. Study what each line brings in:

import torch
import torch.nn as nn           # Neural network building blocks
import torch.optim as optim     # Optimizers (SGD, Adam, etc.)
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import matplotlib.pyplot as plt
# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
    

That last part — setting random seeds — deserves a mention. Neural networks initialize with random weights, and training involves randomness in how data is sampled. If you don't fix the seed, your model will produce slightly different results every time it runs, making it nearly impossible to debug or compare experiments. Seed 42 is the community convention. No one knows why. Don't fight it.

You are now looking at everything you need to start building. The next lesson digs into what you're actually building — a real neural network, defined in PyTorch, trained on real data. But the environment you've set up here is the foundation. If these cells run cleanly, you're ready.

Quiz — Lesson 1

Setting up your environment without losing your mind

1. You open Colab and run torch.cuda.is_available() and it returns False. What is the most likely explanation?

Right. Colab defaults to a CPU runtime. You need to explicitly switch to GPU under Runtime → Change runtime type → T4 GPU. PyTorch itself is always available regardless.

Not quite. PyTorch is pre-installed in all Colab environments. The issue is almost always that a GPU runtime hasn't been selected — go to Runtime → Change runtime type to fix it.

2. In a Jupyter notebook, you define a variable model = MyNet() in Cell 5. You then restart the kernel and run only Cell 7, which uses model. What happens?

Exactly. Restarting the kernel wipes memory. You have to re-run cells in order. This is one of the most common sources of confusion in notebooks — "but I defined it earlier!" Yes, earlier in a session that no longer exists.

Restarting the kernel clears everything from memory. Variables don't persist across restarts. You need to re-run Cell 5 (and everything before it that Cell 5 depends on) before Cell 7 will work.

3. Why does the standard setup block include torch.manual_seed(42)?

Correct. Neural networks involve randomness at initialization and training time. Without a fixed seed, two identical-looking runs can produce different results, making debugging and comparison nearly impossible. The number 42 is just convention.

The seed is about reproducibility. Deep learning involves randomness — random weight initialization, random data shuffling — and fixing the seed means every run of your notebook produces the same result, which is essential for debugging.

4. A classmate argues that training models locally is more "legitimate" than using Colab because local training is closer to production. How would you respond?

Right. The majority of professional ML work happens on cloud compute — AWS, GCP, Azure, internal clusters. Colab is a simplified version of that workflow, not a departure from it. Local setup has a place, but it's not more legitimate by default.

Professional ML engineers train on cloud compute constantly. Colab, Kaggle Notebooks, and cloud VMs are industry-standard. The belief that "local = serious" is mostly status anxiety — it doesn't reflect how real work gets done.

5. Which of the following best describes the role of torch.nn in the PyTorch ecosystem?

Correct. torch.nn is where the architectural components live: Linear layers, Conv2d, ReLU, CrossEntropyLoss, and so on. Data loading is torch.utils.data, GPU management is lower-level, and visualization is matplotlib.

torch.nn is the neural network module — it contains layers (Linear, Conv2d), activation functions (ReLU, Sigmoid), and loss functions (CrossEntropyLoss). Data loading is handled by torch.utils.data instead.

Lab 1: Environment Troubleshooter

You're a new ML intern. Your setup has issues. Debug it with your AI peer.

The scenario

You just started an ML internship. Your manager sent you a Colab notebook and asked you to get it running before your first standup tomorrow. The notebook imports PyTorch, checks for GPU, sets seeds, and loads data. Three of the first five cells are throwing errors you've never seen before.

Your AI peer has done this before. They're going to help you work through what's actually wrong — but they're not going to just hand you answers. You need to explain what you're seeing and make calls about what to fix.

Start by describing one of the errors you're encountering. Paste it in as if you copied it from the Colab output. Your peer will help you figure out what it means and what to do about it.

AI Peer — Setup Debugger

Lab 1

Hey — so you've got some broken cells. Walk me through what you're seeing. Copy one of the error messages in here and tell me which cell it came from. We'll figure it out together, but I need you to actually describe what's happening rather than just asking me to "fix it."

Lesson 2 · Module 3

Defining Your First Neural Network in PyTorch

Writing the architecture — layers, activations, forward pass — in code that actually runs.

What does a neural network look like as Python code, and what does each line actually do?

Priya is building a portfolio project for job applications. She wants to classify whether a Spotify song will chart based on audio features — tempo, danceability, energy, valence. She's found a dataset. She's set up Colab. She knows what she wants to predict.

Now she has to actually write a neural network. She opens the PyTorch docs and immediately runs into something called nn.Module. She reads the explanation three times. It talks about subclassing and __init__ and forward methods. It's technically accurate and completely opaque.

This lesson is what Priya actually needed: a concrete, line-by-line walkthrough of what a neural network definition looks like in PyTorch, why it's written that way, and what happens when you call it. By the end, you'll have a model definition you understand well enough to modify — which is the only kind of code that's actually useful.

The nn.Module Pattern: Why PyTorch Works This Way

Every neural network you build in PyTorch is a Python class that inherits from nn.Module. This isn't arbitrary — it's what gives PyTorch the ability to track your layers, manage parameters, enable gradient computation, and handle GPU transfer automatically. When you subclass nn.Module, you get all of that for free.

The pattern has two required parts: an __init__ method where you define your layers, and a forward method where you define what happens when data flows through the network. That's it. Everything else is optional.

import torch.nn as nn

class SongClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        # Define the layers
        self.fc1 = nn.Linear(8, 64)   # 8 input features → 64 neurons
        self.fc2 = nn.Linear(64, 32)  # 64 → 32
        self.fc3 = nn.Linear(32, 1)   # 32 → 1 output (chart/no-chart)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    

Let's be honest about what's confusing here. The super().__init__() line looks like boilerplate magic, and it kind of is — it calls the parent class's constructor, which is what registers your layers as actual PyTorch parameters. If you forget it, layers won't be tracked and training won't work. You'll just copy this every time.

The forward method is where your data actually flows. Notice that fc1, fc2, and fc3 are just functions at this point — you call them on x, and they transform it. The ReLU is applied after each linear layer (except the last), which is standard practice for hidden layers.

What Each Layer Actually Does

When you write nn.Linear(8, 64), you're creating a layer that multiplies an input vector of size 8 by a weight matrix to produce an output vector of size 64 — and adds a bias vector. The weights and biases are the parameters that get updated during training. At initialization, they're random. After training, they encode what the network has learned.

The dimensions matter. If your data has 8 features (tempo, danceability, energy, valence, loudness, speechiness, acousticness, liveness), the first layer must start with in_features=8. If it doesn't match, PyTorch will throw a shape mismatch error the moment you pass data through. This is the most common runtime error for beginners — dimensions not matching up.

nn.Linear(in, out) A fully connected layer. Every input neuron connects to every output neuron. This is the basic building block of most non-image models.

nn.ReLU() Rectified Linear Unit — outputs max(0, x). Used after linear layers to introduce non-linearity. Without activation functions, stacking linear layers is mathematically equivalent to just one linear layer.

nn.Dropout(p) Randomly zeroes out p% of neurons during training. Reduces overfitting by preventing the network from memorizing specific patterns in the training data.

Practical Takeaway

After defining your model class, always instantiate it and print it: model = SongClassifier(); print(model). PyTorch will print a structured summary of every layer and its dimensions. If a layer is missing or the dimensions look wrong, you'll catch it here — before you even load data.

Instantiating and Inspecting Your Model

Writing the class doesn't create a model — it defines a blueprint. You need to instantiate it:

model = SongClassifier()
print(model)
# Output:
# SongClassifier(
#   (fc1): Linear(in_features=8, out_features=64, bias=True)
#   (fc2): Linear(in_features=64, out_features=32, bias=True)
#   (fc3): Linear(in_features=32, out_features=1, bias=True)
#   (relu): ReLU()
#   (dropout): Dropout(p=0.3, inplace=False)
# )

# Count total parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
# → Total parameters: 2,625
    

2,625 parameters. That's a small model by any measure — GPT-4 has roughly 1.8 trillion. But those 2,625 numbers, when trained well, can learn to classify songs with meaningful accuracy. This is the point: you don't need massive models to learn something real.

The parameter count formula for a Linear layer is (in_features × out_features) + out_features (the bias). Check the math: fc1 has (8×64)+64 = 576. fc2 has (64×32)+32 = 2,080. fc3 has (32×1)+1 = 33. Total: 576 + 2,080 + 33 = 2,689... close enough, the activation and dropout layers add none.

The Forward Pass: Running Data Through Your Model

Once your model exists, you can run data through it with a single call. In PyTorch, calling model(x) is equivalent to calling model.forward(x) — PyTorch handles this through its __call__ mechanism, which also handles hooks and other internal machinery. Always use model(x), not model.forward(x) directly.

# Create a fake batch of 4 songs with 8 features each
fake_data = torch.randn(4, 8)  # shape: (batch_size, num_features)

# Run forward pass
output = model(fake_data)
print(f"Input shape:  {fake_data.shape}")   # → torch.Size([4, 8])
print(f"Output shape: {output.shape}")      # → torch.Size([4, 1])
print(output)
    

The output is 4 numbers, one per song, representing the raw score (called a logit) for whether that song charts. The numbers aren't probabilities yet — they're unbounded. We'll apply a sigmoid or use a loss function that handles this in Lesson 3.

What just happened, step by step: your random 4×8 input matrix was multiplied through fc1 to become 4×64, then ReLU zeroed out negatives, then dropout randomly killed 30% of values, then fc2 compressed to 4×32, then fc3 compressed to 4×1. Each step transformed the data, and the final number encodes the model's current (untrained, therefore meaningless) prediction. Training will make those numbers meaningful.

What Peers Miss About Architecture Design

Most beginners spend too much time on architecture before training anything. The model above — three layers, ReLU, dropout — is a reasonable starting point for most tabular data problems. You will need to adjust it, but you can only know how after you've actually trained it and seen where it fails. Design → train → evaluate → redesign. Not design → perfect → train.

Quiz — Lesson 2

Defining your first neural network in PyTorch

1. You define a model with nn.Linear(10, 50) as the first layer. Your input data has shape (32, 12). What happens when you run the forward pass?

Correct. PyTorch enforces dimension matching strictly. nn.Linear(10, 50) expects exactly 10 input features. A 12-feature input will throw a RuntimeError immediately. This is actually helpful — it catches mistakes early.

PyTorch won't silently accept wrong dimensions. The layer was defined to expect 10 features; passing 12 will throw a RuntimeError: mat1 and mat2 shapes cannot be multiplied. Dimension mismatch errors are the most common beginner error in PyTorch.

2. Why is an activation function like ReLU necessary between linear layers?

Exactly right. A composition of linear functions is still linear. The whole point of depth — stacking layers — only works if non-linearity is introduced between them. ReLU does this with a simple max(0,x) that breaks the linearity chain.

This is a fundamental concept worth revisiting. Two linear transformations composed together are still linear (L2(L1(x)) = (L2·L1)(x)). Without an activation function, depth adds nothing. Non-linearity is what allows deep networks to approximate complex functions.

3. Priya's model has nn.Dropout(0.3) in the architecture. During inference on new songs, she wants predictions — not training. What should she do?

Right. model.eval() switches the model to evaluation mode, which deactivates dropout and batch normalization. Use model.train() to switch back for training. Forgetting this is a subtle bug — your model will give different results each time you evaluate it.

The correct answer is model.eval(). PyTorch models have two modes: training (dropout active) and evaluation (dropout disabled). Inference should always happen after calling model.eval(), otherwise dropout will randomly change your predictions every run.

4. You call model(fake_data) instead of model.forward(fake_data). What is the difference in practice?

Correct. model(x) invokes __call__, which runs forward hooks, backward hooks, and other PyTorch internals before and after calling forward(). Bypassing this with model.forward(x) can silently break features like gradient checkpointing and model profiling.

Always use model(x). It invokes __call__, which wraps your forward method with PyTorch's internal machinery — hooks, gradient handling, and more. Calling model.forward(x) directly skips all of that.

5. You want to count the trainable parameters in your model. Which approach is correct?

Correct. model.parameters() returns an iterator of all parameter tensors. p.numel() returns the number of elements in each tensor. Summing them gives the total parameter count. This is the standard pattern used everywhere in the PyTorch ecosystem.

The correct approach is sum(p.numel() for p in model.parameters()). model.parameters() yields all learnable parameter tensors; numel() counts elements in each. There is no built-in parameter_count() method or torch.count_params() function.

Lab 2: Architecture Design Consultant

A startup founder needs a neural network. You're the ML person in the room.

The scenario

You're at a startup that's building a tool to predict whether a freelance job posting will get enough quality applicants within 48 hours — based on 12 features including pay rate, description length, required skills count, client rating, and category.

The founder wants a neural network. Your AI peer is a senior engineer who will help you think through the architecture, but they'll push back on choices that don't make sense. You need to propose a model design and defend it.

Start by proposing a neural network architecture for this problem. Specify: input size, number of hidden layers, neurons per layer, activation functions, and output format. Your peer will challenge your choices — be ready to justify them or revise.

AI Peer — Architecture Review

Lab 2

Alright, you're the ML lead on this one. We've got 12 input features, binary classification, and a founder who wants "a neural network." Before I say anything, I want to hear your proposed architecture first. What are you thinking — layers, neurons, activations, output? Walk me through it.

Lesson 3 · Module 3

Training Your Model: The Loop That Makes It Learn

Loss functions, optimizers, and the training loop — the exact code you run to make a network learn.

What actually happens when a neural network "trains," and how do you write the code that does it?

Jordan is applying for a data science co-op at a healthcare analytics company. The technical screen includes a take-home: "Build and train a binary classifier on the provided dataset. Submit your notebook with training loss curves." Twenty-four hours. Jordan has defined the model. The data is loaded. And now they're staring at the blank cell where the training loop should go.

They've watched the loss function get explained on YouTube. They understand gradient descent in the abstract — find the direction of steepest descent, take a step in that direction, repeat. But the actual code? Every tutorial either skips the details or buries them in a framework that hides all the interesting parts.

This lesson writes the training loop explicitly — no wrappers, no magic, just the raw PyTorch that runs every iteration. Once you understand what these 15 lines do, you can debug any training problem you encounter, because you'll know where to look.

The Three Things Training Needs

Before you write a single line of the training loop, you need three things configured:

A loss function — the mathematical measure of how wrong the model's predictions are. Different tasks use different loss functions. Binary classification uses nn.BCEWithLogitsLoss. Multi-class uses nn.CrossEntropyLoss. Regression uses nn.MSELoss. The choice is not arbitrary — it affects what gradient the model receives.
An optimizer — the algorithm that updates weights based on gradients. Adam is the default choice for most problems: it adapts the learning rate for each parameter individually and is much more stable than vanilla SGD. Learning rate is the most important hyperparameter here — 1e-3 is a standard starting point.
A data loader — an iterator that feeds your training data in batches. Training on one example at a time is slow; training on the full dataset at once doesn't fit in memory and produces noisy gradients. Batches of 32 or 64 are a reasonable default.

# Set up training components
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Wrap data in a DataLoader (assuming X_train and y_train are tensors)
dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)
    

The Training Loop, Line by Line

Here is the complete training loop. Every line matters. Every line is doing something specific.

num_epochs = 50
train_losses = []

model.train()  # Set to training mode (enables dropout)

for epoch in range(num_epochs):
    epoch_loss = 0.0

    for X_batch, y_batch in train_loader:
        # Step 1: Zero out gradients from last step
        optimizer.zero_grad()

        # Step 2: Forward pass — get predictions
        predictions = model(X_batch)

        # Step 3: Compute loss
        loss = criterion(predictions.squeeze(), y_batch.float())

        # Step 4: Backward pass — compute gradients
        loss.backward()

        # Step 5: Update weights
        optimizer.step()

        epoch_loss += loss.item()

    avg_loss = epoch_loss / len(train_loader)
    train_losses.append(avg_loss)

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")
    

Let's be specific about what's happening and why the order is non-negotiable:

zero_grad(): PyTorch accumulates gradients by default — each backward pass adds to existing gradients rather than replacing them. If you don't zero them at the start of each step, gradients from previous batches contaminate the current update. This is an intentional design choice (useful for gradient accumulation) but it means you have to be explicit.

loss.backward(): This is where the magic actually happens. PyTorch has been tracking every computation in your forward pass, building a computational graph. backward() traverses this graph in reverse, computing the gradient of the loss with respect to every parameter using the chain rule. You don't write this math — PyTorch does it automatically.

optimizer.step(): Uses the freshly computed gradients to nudge every weight slightly in the direction that reduces loss. The size of the nudge is controlled by the learning rate. Too large and the model oscillates or diverges. Too small and it learns agonizingly slowly.

Practical Takeaway

After your training loop finishes, plot the loss curve: plt.plot(train_losses); plt.xlabel('Epoch'); plt.ylabel('Loss'); plt.show(). A healthy training curve falls steeply at first and gradually flattens. If it's still falling sharply at epoch 50, train longer. If it's oscillating wildly, your learning rate is too high. If it barely moves, your learning rate is too low or your data has problems.

What "Loss" Means in Practice

The loss number the training loop prints is not accuracy. It's a scalar measuring prediction error according to the loss function. For BCEWithLogitsLoss, lower is better, and a loss around 0.693 (which is -log(0.5)) means the model is essentially guessing randomly — it hasn't learned anything yet. Getting loss below 0.4 typically corresponds to meaningfully better-than-random predictions on a binary task.

Your peers who are new to this will sometimes look at a loss of 0.35 and ask "is that good?" The only honest answer is: compared to what? Loss numbers are meaningful relative to where you started, relative to a baseline (like always predicting the majority class), and relative to your validation loss. A loss of 0.35 on training data with 0.8 on validation means you're overfitting badly. A loss of 0.35 on both means you have a real model.

Speaking of validation: every training loop should have a counterpart validation loop that runs on held-out data after each epoch. Here's the minimal version:

model.eval()
val_loss = 0.0
with torch.no_grad():
    for X_val, y_val in val_loader:
        preds = model(X_val)
        loss = criterion(preds.squeeze(), y_val.float())
        val_loss += loss.item()

print(f"Val Loss: {val_loss/len(val_loader):.4f}")
model.train()  # Switch back for next training epoch
    

torch.no_grad() disables gradient computation during validation — you're not updating weights, just measuring performance, so there's no reason to build the computational graph. This makes validation faster and uses less memory.

What the Training Loop Doesn't Tell You

The training loop will run without errors on complete garbage. If your labels are scrambled, your features are in the wrong order, or you have a data leak, the loop will happily run 50 epochs, print decreasing loss numbers, and produce a model that is worthless. The training loop is not a sanity check — you have to build those separately. Always verify one batch of data looks correct before running the full loop.

Quiz — Lesson 3

Training your model: the loop that makes it learn

1. Jordan's training loop shows loss decreasing from 0.69 to 0.35 over 50 epochs, but validation loss stays at 0.68 throughout. What is the most likely diagnosis?

Exactly. When training loss falls significantly but validation loss stays flat, the model is learning the training data specifically rather than underlying patterns. Common fixes: add dropout, reduce model size, add regularization, or get more training data.

This is a classic overfitting signature — train loss drops, val loss stagnates. The model is memorizing rather than generalizing. This is not a learning rate problem; the training loss is falling smoothly. Increase dropout, reduce model complexity, or add more data.

2. Why must optimizer.zero_grad() be called at the beginning of each training step?

Right. Gradient accumulation is a feature in PyTorch (intentionally used in some training techniques), but for a standard training loop it's a bug. Forgetting zero_grad is one of the most common silent errors — the model will appear to train but weights will be updated incorrectly.

PyTorch accumulates gradients across backward passes intentionally — it's a design choice. For a standard training loop, this means you must manually zero them before each step with optimizer.zero_grad(), otherwise old and new gradients stack up and corrupt your updates.

3. You're training a model and the loss plot shows wild oscillations — it goes from 0.6 down to 0.2 and back up to 0.8 repeatedly. The most likely cause and fix is:

Correct. Wild oscillation in loss is the hallmark symptom of a learning rate that's too high. The optimizer steps are so large that the weights jump past the optimal values and land somewhere worse. Dropping lr from 1e-3 to 1e-4 often fixes this immediately.

Oscillating loss is almost always a learning rate issue. The steps are too large — each update overshoots the minimum and lands on the other side. Fix: reduce learning rate (try 10x lower than current). The number of layers is unlikely to cause this specific pattern.

4. In the validation loop, why is torch.no_grad() used?

Right. torch.no_grad() tells PyTorch not to build the computational graph during the forward pass. Since you're not calling loss.backward() during validation, you don't need the graph — and skipping it reduces memory usage and speeds up inference.

Weights don't get updated unless you explicitly call optimizer.step() — there's no accidental update risk. torch.no_grad() is about efficiency: when you're not backpropagating, there's no reason to track all the operations needed to compute gradients. It speeds up validation and saves GPU memory.

5. You train a model for 50 epochs and the final training loss is 0.12. Validation loss is 0.13. A classmate says "your model isn't overfitting at all since the losses are so close." Is this correct?

Nuanced and correct. Close train/val loss suggests the model generalizes — but loss isn't the end goal. A model can have low loss and poor accuracy if the data is imbalanced, or if the threshold for classification is miscalibrated. Always evaluate with task-appropriate metrics.

Close train/val loss is genuinely a good sign for generalization — so the classmate is partially right. But loss alone doesn't tell you if the model works. You still need accuracy, precision, recall, or AUC on a held-out test set to make that call. Loss and task performance aren't the same thing.

Lab 3: Training Loop Debugger

Your co-op technical screen. The training loop has bugs. Fix it live.

The scenario

You're doing the Jordan scenario for real. It's your co-op technical screen. You've been given a training loop that has three subtle bugs — the kind that won't cause immediate crashes but will produce a model that doesn't actually learn. Your AI peer is acting as a technical interviewer who will help you find and explain the bugs — but you need to identify them, not just be told.

Here's the broken training loop:

        for epoch in range(50):

          for X_batch, y_batch in train_loader:

            predictions = model(X_batch)

            loss = criterion(predictions, y_batch)

            loss.backward()

            optimizer.zero_grad()

            optimizer.step()

Find at least two bugs and explain what each one causes. Your peer will probe your reasoning.

AI Peer — Technical Interview

Lab 3

Alright, you've got the broken loop in front of you. Walk me through what you see. Don't just list the bugs — explain what each one actually does to the training process. I'm going to push back on your reasoning, so be specific.

Lesson 4 · Module 3

Evaluating What You Built: Metrics That Actually Matter

Accuracy, precision, recall, and the confusion matrix — reading your model's report card honestly.

Your model trained. Your loss looks good. How do you know if it's actually any good?

Aaliyah built a model to predict whether a loan applicant will default. She's a finance major with a data science minor at NYU, and this is her capstone project. She trained it on 10,000 records. Accuracy on the test set: 94%. She was thrilled.

Her advisor looked at her confusion matrix and asked: "How many actual defaults did you correctly catch?" Aaliyah didn't know what that meant. She opened the matrix. Out of 600 true defaults in the test set, her model caught 12. It predicted "no default" for 588 real defaulters — and still got 94% accuracy, because 94% of the test set had no default and the model learned to just predict that.

Her model was a sophisticated way to do nothing useful. Accuracy had completely misled her. This lesson is about not making that mistake — knowing which metrics to use, when accuracy is actively deceptive, and how to read a confusion matrix like someone who understands what they built.

Why Accuracy Lies on Imbalanced Data

Aaliyah's situation isn't unusual — it's the default failure mode for anyone who measures only accuracy on imbalanced classification problems. If 94% of your samples belong to class 0, a model that always predicts class 0 achieves 94% accuracy without learning anything. This is called the majority class baseline, and it's the first thing you should check before celebrating any accuracy number.

The fix isn't a complicated algorithm change — it's measuring the right things. There are four basic metrics that, together, give you a complete picture of how a binary classifier is actually performing:

Precision Of all the times the model predicted "positive," what fraction were actually positive? High precision means few false alarms. In loan default prediction: of all predicted defaults, how many were real?

Recall Of all the actual positives, what fraction did the model catch? High recall means few missed positives. In loan default: of all real defaults, how many did we flag?

F1 Score The harmonic mean of precision and recall. A single number that balances both — useful when you need one metric but care about both. Low if either precision or recall is low.

Confusion Matrix A 2×2 table showing true positives, true negatives, false positives, and false negatives. The most complete single picture of binary classifier behavior. Don't submit a classification project without one.

Computing Metrics in PyTorch (Without sklearn)

You can compute all of this directly from your model's predictions. First, you need to convert raw logits to binary predictions:

model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        logits = model(X_batch).squeeze()
        probs = torch.sigmoid(logits)         # Convert to 0-1 probabilities
        preds = (probs >= 0.5).float()        # Threshold at 0.5
        all_preds.append(preds)
        all_labels.append(y_batch)

all_preds = torch.cat(all_preds)
all_labels = torch.cat(all_labels)

# Compute metrics
TP = ((all_preds == 1) & (all_labels == 1)).sum().item()
TN = ((all_preds == 0) & (all_labels == 0)).sum().item()
FP = ((all_preds == 1) & (all_labels == 0)).sum().item()
FN = ((all_preds == 0) & (all_labels == 1)).sum().item()

accuracy  = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall    = TP / (TP + FN) if (TP + FN) > 0 else 0
f1        = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

print(f"Accuracy:  {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1 Score:  {f1:.3f}")
print(f"\nConfusion Matrix:")
print(f"  TP: {TP}  FP: {FP}")
print(f"  FN: {FN}  TN: {TN}")
    

Practical Takeaway

Every time you evaluate a binary classifier, check the recall on the minority class. If it's below 0.3, your model is effectively ignoring that class regardless of what the overall accuracy says. For loan default, cancer detection, fraud detection — the minority class is often the one that actually matters. Optimizing only for accuracy will betray you every time.

The Precision-Recall Tradeoff and Threshold Choice

The 0.5 threshold in the code above is arbitrary. In most real problems, the optimal threshold isn't 0.5 — it depends on the cost of different types of mistakes.

In loan default prediction, a false negative (missed default) costs the bank far more than a false positive (flagging a good borrower for review). So the bank would set a lower threshold — say 0.3 — to catch more defaults at the cost of more false alarms. In a spam filter, you might prefer fewer false positives (blocking real email) and accept more false negatives (letting through some spam). These are business decisions, not math decisions.

You can visualize the full tradeoff by sweeping threshold from 0 to 1 and plotting precision vs. recall at each point — this is the precision-recall curve. The AUC-PR (area under that curve) is a threshold-independent metric that tells you how good your model is regardless of where you set the cutoff. A random classifier gets AUC-PR equal to the positive class frequency. A perfect classifier gets 1.0.

Here's the practical way to understand what threshold to use:

thresholds = torch.arange(0.1, 1.0, 0.1)
with torch.no_grad():
    logits = model(X_test).squeeze()
    probs = torch.sigmoid(logits)

for t in thresholds:
    preds = (probs >= t).float()
    tp = ((preds==1)&(all_labels==1)).sum().item()
    fp = ((preds==1)&(all_labels==0)).sum().item()
    fn = ((preds==0)&(all_labels==1)).sum().item()
    p = tp/(tp+fp+1e-8)
    r = tp/(tp+fn+1e-8)
    print(f"Threshold {t:.1f} → Precision: {p:.3f}, Recall: {r:.3f}")
    

Run this and you'll see a table showing the tradeoff. Pick the threshold that best matches what you're optimizing for. This is something your AI can compute — but only you know which mistakes are worse in your specific context.

Saving Your Model and What Comes Next

You've built, trained, and evaluated a real neural network. Before this session ends, save it:

# Save model weights
torch.save(model.state_dict(), 'song_classifier.pt')

# Load it back later
new_model = SongClassifier()
new_model.load_state_dict(torch.load('song_classifier.pt'))
new_model.eval()
    

state_dict() saves only the learned weights, not the architecture. That means you need your class definition to load them back — which is fine, because your code is always available. Saving weights separately from architecture is the standard convention because it keeps files small and gives you flexibility to modify the class later.

What you've built in this module — the environment setup, the model definition, the training loop, the evaluation code — is the complete template for any tabular deep learning project. The specific problem changes. The data changes. The architecture might grow. But the structure you've built here is reusable. Every real model you build from here follows this exact pattern.

What We're All Navigating Together

The evaluation problem is everywhere right now. People build models, see a big accuracy number, post it, and move on without checking whether the model is actually doing what they think. This isn't malicious — it's a gap in how evaluation gets taught. Knowing to check recall on the minority class, to look at the confusion matrix, to think about threshold choice — these put you ahead of a lot of people calling themselves "ML practitioners." Use that honestly, not as gatekeeping.

Quiz — Lesson 4

Evaluating what you built: metrics that actually matter

1. Aaliyah's fraud detection model achieves 97% accuracy on test data. The fraud rate in the dataset is 3%. Without seeing a confusion matrix, what can you already reasonably suspect?

Right. When accuracy equals the majority class frequency, it's a red flag. A model that predicts "not fraud" for every single instance gets 97% accuracy on this dataset. Check recall on the fraud class immediately — if it's near zero, the model has learned nothing useful.

With 3% fraud rate, a model that always predicts "not fraud" gets 97% accuracy automatically. This is the majority class baseline trap. High accuracy on imbalanced data is almost meaningless without looking at recall on the minority class. Always check your confusion matrix.

2. A medical imaging model for cancer detection has precision 0.90 and recall 0.40. What does this mean in practical terms?

Correct. High precision, low recall: the model is conservative — when it flags something, it's usually right, but it misses most actual positives. In cancer detection, missing 60% of real cancers is catastrophic regardless of how accurate the positive flags are. This model needs its threshold lowered to improve recall, accepting more false alarms.

Precision = when we say yes, how often are we right? Recall = of all the real positives, how many did we catch? So precision 0.90 means our positive flags are 90% accurate; recall 0.40 means we only find 40% of actual cancers. In medical contexts, low recall (missing real cases) is usually the bigger problem.

3. You lower your classification threshold from 0.5 to 0.3. What happens to precision and recall?

Correct. Lowering the threshold makes the model more likely to predict "positive" — so it catches more real positives (recall up) but also flags more negatives incorrectly (precision down). This is the fundamental precision-recall tradeoff. Which direction you move depends on which mistake is costlier in your context.

A lower threshold means you predict "positive" for more cases. That catches more true positives (recall increases) but also catches more false positives (precision decreases). This is the core tradeoff — you can't improve both simultaneously by changing threshold alone.

4. What does model.state_dict() save, and what does it not save?

Correct. state_dict() is a dictionary of parameter names to tensors — the learned weights. The model architecture (class definition) lives in your Python code, not in this file. When loading, you instantiate the class first, then load the weights into it.

state_dict() contains only the learned parameters — the weight and bias tensors for each layer. It does not contain the architecture. This means you must always have access to your model class definition when loading saved weights, because PyTorch needs to know the structure before it can fill in the values.

5. You build a song chart prediction model. Your F1 score is 0.72 on the test set. A classmate says their model gets 0.68. Is your model definitively better?

Right. Metric comparisons are only valid when conditions are identical: same test set, same preprocessing, same class distribution, same threshold. A model trained on a cleaner dataset or evaluated on an easier split can look better on paper without being a better model. Reproducible comparison requires controlled conditions.

F1 comparisons require apples-to-apples conditions. If the test sets differ, the class distribution differs, or the thresholds differ, the comparison is meaningless. Two models evaluated on different splits of the same dataset can show reversed rankings on the full dataset. Control your comparison conditions before claiming anything.

Lab 4: Model Evaluation Consultant

The stakeholder just saw 94% accuracy and wants to ship. You have to tell them the real story.

The scenario

You're the ML intern at a fintech startup. The team just trained a loan default prediction model. The CEO saw "94% accuracy" in the notebook and sent a Slack message: "Amazing! Let's put this in production next week." Your manager pulled you aside and said: "Look at the confusion matrix before you let this happen."

The confusion matrix shows: TP=18, FP=45, FN=582, TN=9355. The dataset has 10,000 test records with 600 actual defaults.

Your AI peer is acting as a skeptical colleague who will help you think through what to tell the CEO — but you need to do the analysis and make the call.

Start by calculating precision, recall, and F1 from those confusion matrix numbers. Then tell me whether you'd recommend shipping this model and how you'd explain it to the CEO.

AI Peer — Evaluation Analyst

Lab 4

Okay, you've got the numbers: TP=18, FP=45, FN=582, TN=9355. The CEO wants to ship. Walk me through the math first — what are the precision, recall, and F1? Then tell me your recommendation. Don't hedge — make a call.

Module Test

Module 3 — Your First Model: Hands-On in 30 Minutes · 15 questions · Pass at 80%

1. You open a fresh Colab notebook and run import torch; print(torch.__version__). It works. Then you run torch.cuda.is_available() and it returns False. What is the most likely fix?

Correct. Colab defaults to CPU. GPU must be explicitly enabled under Runtime settings.

PyTorch is installed and working — the issue is the runtime type. Go to Runtime → Change runtime type → T4 GPU.

2. Why does setting torch.manual_seed(42) matter when training neural networks?

Correct. Fixed seeds = reproducible experiments. Essential for debugging and fair comparison.

The seed fixes the random number generator for reproducibility. The number 42 is just convention — any integer works.

3. You define a PyTorch model class but forget to call super().__init__() in __init__. What is the most likely result?

Right. super().__init__() initializes the parent nn.Module, which is what enables parameter tracking, GPU transfer, and the training machinery.

Without super().__init__(), the nn.Module parent class isn't initialized, so layers won't be registered as parameters and gradient-based training won't work correctly.

4. What is the mathematical role of ReLU in a neural network?

Correct. ReLU = max(0, x). Simple, but its non-linearity is what makes depth meaningful — without it, any stack of linear layers collapses into a single linear transformation.

ReLU applies max(0,x) — zeros out negatives, passes positives unchanged. Its purpose is non-linearity. Without activation functions between linear layers, depth adds nothing mathematically.

5. Your model has the layers: Linear(6, 32) → ReLU → Linear(32, 16) → ReLU → Linear(16, 1). How many total trainable parameters does it have?

The math: fc1: (6×32)+32=224. fc2: (32×16)+16=528. fc3: (16×1)+1=17. Total: 224+528+17=769. ReLU has no parameters. This is the standard formula for Linear layer parameter counting.

For each Linear(in, out): parameters = (in × out) + out (the bias). fc1: (6×32)+32=224. fc2: (32×16)+16=528. fc3: (16×1)+1=17. Total=769. ReLU layers have zero learnable parameters.

6. In the training loop, which is the correct order of operations?

Correct order. Zero first (clear old gradients) → forward (get predictions) → loss (measure error) → backward (compute gradients) → step (update weights). Changing this order corrupts training.

The correct order is: zero_grad → forward → loss → backward → optimizer.step(). Zeroing must happen before backward, and stepping must happen after gradients are computed.

7. What does loss.backward() actually do in PyTorch?

Correct. backward() implements automatic differentiation — it uses the chain rule to compute ∂loss/∂θ for every parameter θ. These gradients are stored in param.grad tensors and used by the optimizer in the next step.

backward() computes gradients — it doesn't update weights (that's optimizer.step()). PyTorch builds a computational graph during the forward pass; backward() traverses it in reverse using the chain rule.

8. Your training loss curve shows rapid decrease in the first 10 epochs, then levels off completely for the remaining 40 epochs. What should you try first?

Right. A flat plateau after initial learning usually means convergence — which could be real (the model learned what it can) or a local minimum. First, check val loss. Then consider architecture changes, feature engineering, or learning rate scheduling.

Plateau after initial learning is often genuine convergence, not necessarily a bug. Check val loss — is it similar? If so, the model may have learned all it can from the current setup. Consider feature engineering or increased model capacity.

9. Why must model.eval() be called before running predictions on new data?

Correct. model.eval() switches behavior of dropout (disabled) and batch norm (uses running stats). Without it, predictions vary each run because dropout randomly zeros neurons — a silent bug that's hard to catch.

model.eval() deactivates dropout and switches batch norm to use running statistics rather than batch statistics. Without it, dropout keeps randomly zeroing neurons, making every inference call return slightly different results.

10. A dataset has 95% class-0 examples and 5% class-1. A model achieves 95.1% accuracy. What additional information is essential before concluding the model is useful?

Right. 95.1% is barely above the majority-class baseline of 95%. Recall on class-1 tells you whether the model actually learned anything about the minority class or is just predicting the majority class universally.

A model that predicts class-0 for every instance gets 95% accuracy here. 95.1% is not meaningfully above that baseline. You need class-1 recall to know if the model has learned anything at all about the minority class.

11. Recall = 0.85, Precision = 0.60. Which statement is true?

Correct. Recall 0.85 → catches 85% of true positives (misses 15%). Precision 0.60 → of all positive predictions, 60% are real (40% are false alarms). F1 = 2×(0.6×0.85)/(0.6+0.85) ≈ 0.706 (harmonic mean, not arithmetic).

Recall=0.85 means you catch 85% of real positives, missing 15%. Precision=0.60 means 60% of your positive predictions are correct — 40% are false alarms. F1 is the harmonic mean (not arithmetic mean): 2×(0.6×0.85)/(0.6+0.85)≈0.706.

12. You want to save your trained model and load it later to make predictions. What is the correct approach?

Correct. Saving state_dict() is the standard PyTorch pattern. You save just the weights dictionary, and load it into a freshly instantiated model class. It's portable, lightweight, and doesn't depend on your specific Python environment.

The recommended approach is saving state_dict() — just the weights — and using load_state_dict() to restore them. Saving the whole model object with torch.save(model) works but creates fragile dependencies on your Python environment and file paths.

13. In a Jupyter/Colab notebook, you run Cell 8 first, then Cell 3, then Cell 8 again. Cell 8 uses variables from Cell 3. What is the result of the second Cell 8 run?

Correct. Notebooks share a single Python memory space. Once Cell 3 has run, its variables are available to any subsequent cell. The execution order matters, not the cell position.

Notebooks share memory across all cells within a session. After running Cell 3, its variables exist in memory and are available to Cell 8. The order you run cells in is what matters, not their numbered position in the notebook.

14. What is the purpose of DataLoader with shuffle=True during training?

Right. Without shuffling, the model sees the same sequence of batches every epoch. If data is sorted by class or date, this creates patterns in the gradient updates that don't reflect the true data distribution. Shuffling breaks this.

Shuffling randomizes sample order. Without it, if your data is sorted (e.g., all class-0 first, then class-1), the model sees unrepresentative batches and receives biased gradient signals. Shuffling each epoch ensures each batch is a representative sample of the whole dataset.

15. A startup wants to ship a model that predicts whether rental applicants will pay rent on time. Train accuracy: 91%, val accuracy: 90%, recall on "late payer" class: 0.08. What is your recommendation?

Right call. Recall 0.08 is near-zero predictive validity on the class that actually matters for the use case. The model is sophisticated-looking majority-class prediction. Shipping it means making consequential decisions about real people's housing based on a model that provides no meaningful signal — which raises both technical and ethical flags.

Recall of 0.08 means the model catches only 8% of actual late payers — it fails on 92% of them. High accuracy here is the imbalanced-data illusion. Making housing decisions using a model that has essentially no predictive power on the target class is harmful regardless of overall accuracy. Rebuild before shipping.