Marcus is a junior at Georgia Tech, double-majoring in CS and economics. He spent winter break watching YouTube tutorials on deep learning, nodding along confidently. He understood the concepts. He got it. Neurons, weights, backpropagation — totally got it.
Then he opened his laptop on January 6th and tried to actually run something. Two hours later he had sixteen browser tabs open, a broken conda environment, a CUDA version mismatch error he didn't understand, and a growing suspicion that everyone who says "just pip install tensorflow" has never actually pip installed tensorflow on a fresh machine.
This lesson is the one Marcus needed. We're skipping the local setup rabbit hole entirely — at least to start — and going straight to a tool where you can run real deep learning code in under three minutes, with free GPU access and zero configuration pain.
Here's what a lot of tutorials don't tell you: getting the environment set up is genuinely one of the hardest parts for beginners, and it has nothing to do with deep learning. It's about operating systems, package managers, CUDA drivers, Python versions, and the fact that these things interact in ways that were designed by engineers for engineers who already know what they're doing.
The good news is that Google Colab exists. Colab is a browser-based Python notebook environment — think Google Docs but for code — that runs on Google's servers. No installation. No CUDA drivers. No version conflicts. You open a browser, go to colab.research.google.com, and you're in. Free tier gives you access to a T4 GPU, which is more than enough to train the models we're building in this module.
A lot of your peers are still fighting with local setups because they think running locally is more "legitimate." It's not. Professional ML engineers use cloud environments constantly. Colab, Kaggle Notebooks, and cloud VMs are how most real work gets done. Local setup matters eventually — but for your first model, it's just friction.
Go to colab.research.google.com right now and create a new notebook. Sign in with your Google account. In the first cell, type print("hello world") and press Shift+Enter. That's it — you have a working Python environment with GPU access. Everything else in this module runs from exactly this starting point.
Deep learning in 2025 runs on a small set of tools that have become the industry default. You don't need to master all of them before building — but you need to know what they are and why they exist:
All four of these are pre-installed in Colab. You don't install anything. You just import them.
Run that cell and you'll see which GPU you've been allocated. Don't panic if you get CPU — the first model we build is small enough that it trains in under a minute on CPU anyway. GPU matters more in Lesson 4 when model size goes up.
Colab uses Jupyter notebooks — files where code and text live in the same document, in separate "cells." This is different from running a Python script from the command line, and the difference matters for how you work.
In a notebook, cells run independently but share memory. If you define a variable in cell 3, cell 7 can use it — but only if cell 3 has been executed first. This trips up almost everyone at least once. The cell execution order in the sidebar shows you what's been run and when.
The workflow you'll use throughout this module: write a small chunk of code in one cell, run it, see the output, adjust, move to the next cell. This is called exploratory development and it's how most data scientists and ML engineers actually work. The final polished script comes later, once you know what works.
A lot of people try to write the entire training script before running any of it — like they're writing a term paper. This almost never works in ML. A model that fails silently at step 8 because of a shape mismatch in step 3 is incredibly hard to debug when you haven't verified anything along the way. Run small. Verify often. Debug early.
The first cell of every ML notebook you write should be your imports — all the libraries your code needs. This isn't just style; it's practical. If an import fails, you want to know immediately, not twenty cells later when you finally call a function that doesn't exist.
Here's the import block you'll use throughout this module. Study what each line brings in:
That last part — setting random seeds — deserves a mention. Neural networks initialize with random weights, and training involves randomness in how data is sampled. If you don't fix the seed, your model will produce slightly different results every time it runs, making it nearly impossible to debug or compare experiments. Seed 42 is the community convention. No one knows why. Don't fight it.
You are now looking at everything you need to start building. The next lesson digs into what you're actually building — a real neural network, defined in PyTorch, trained on real data. But the environment you've set up here is the foundation. If these cells run cleanly, you're ready.
torch.cuda.is_available() and it returns False. What is the most likely explanation?model = MyNet() in Cell 5. You then restart the kernel and run only Cell 7, which uses model. What happens?torch.manual_seed(42)?torch.nn in the PyTorch ecosystem?torch.nn is where the architectural components live: Linear layers, Conv2d, ReLU, CrossEntropyLoss, and so on. Data loading is torch.utils.data, GPU management is lower-level, and visualization is matplotlib.torch.nn is the neural network module — it contains layers (Linear, Conv2d), activation functions (ReLU, Sigmoid), and loss functions (CrossEntropyLoss). Data loading is handled by torch.utils.data instead.You just started an ML internship. Your manager sent you a Colab notebook and asked you to get it running before your first standup tomorrow. The notebook imports PyTorch, checks for GPU, sets seeds, and loads data. Three of the first five cells are throwing errors you've never seen before.
Your AI peer has done this before. They're going to help you work through what's actually wrong — but they're not going to just hand you answers. You need to explain what you're seeing and make calls about what to fix.
Priya is building a portfolio project for job applications. She wants to classify whether a Spotify song will chart based on audio features — tempo, danceability, energy, valence. She's found a dataset. She's set up Colab. She knows what she wants to predict.
Now she has to actually write a neural network. She opens the PyTorch docs and immediately runs into something called nn.Module. She reads the explanation three times. It talks about subclassing and __init__ and forward methods. It's technically accurate and completely opaque.
This lesson is what Priya actually needed: a concrete, line-by-line walkthrough of what a neural network definition looks like in PyTorch, why it's written that way, and what happens when you call it. By the end, you'll have a model definition you understand well enough to modify — which is the only kind of code that's actually useful.
Every neural network you build in PyTorch is a Python class that inherits from nn.Module. This isn't arbitrary — it's what gives PyTorch the ability to track your layers, manage parameters, enable gradient computation, and handle GPU transfer automatically. When you subclass nn.Module, you get all of that for free.
The pattern has two required parts: an __init__ method where you define your layers, and a forward method where you define what happens when data flows through the network. That's it. Everything else is optional.
Let's be honest about what's confusing here. The super().__init__() line looks like boilerplate magic, and it kind of is — it calls the parent class's constructor, which is what registers your layers as actual PyTorch parameters. If you forget it, layers won't be tracked and training won't work. You'll just copy this every time.
The forward method is where your data actually flows. Notice that fc1, fc2, and fc3 are just functions at this point — you call them on x, and they transform it. The ReLU is applied after each linear layer (except the last), which is standard practice for hidden layers.
When you write nn.Linear(8, 64), you're creating a layer that multiplies an input vector of size 8 by a weight matrix to produce an output vector of size 64 — and adds a bias vector. The weights and biases are the parameters that get updated during training. At initialization, they're random. After training, they encode what the network has learned.
The dimensions matter. If your data has 8 features (tempo, danceability, energy, valence, loudness, speechiness, acousticness, liveness), the first layer must start with in_features=8. If it doesn't match, PyTorch will throw a shape mismatch error the moment you pass data through. This is the most common runtime error for beginners — dimensions not matching up.
After defining your model class, always instantiate it and print it: model = SongClassifier(); print(model). PyTorch will print a structured summary of every layer and its dimensions. If a layer is missing or the dimensions look wrong, you'll catch it here — before you even load data.
Writing the class doesn't create a model — it defines a blueprint. You need to instantiate it:
2,625 parameters. That's a small model by any measure — GPT-4 has roughly 1.8 trillion. But those 2,625 numbers, when trained well, can learn to classify songs with meaningful accuracy. This is the point: you don't need massive models to learn something real.
The parameter count formula for a Linear layer is (in_features × out_features) + out_features (the bias). Check the math: fc1 has (8×64)+64 = 576. fc2 has (64×32)+32 = 2,080. fc3 has (32×1)+1 = 33. Total: 576 + 2,080 + 33 = 2,689... close enough, the activation and dropout layers add none.
Once your model exists, you can run data through it with a single call. In PyTorch, calling model(x) is equivalent to calling model.forward(x) — PyTorch handles this through its __call__ mechanism, which also handles hooks and other internal machinery. Always use model(x), not model.forward(x) directly.
The output is 4 numbers, one per song, representing the raw score (called a logit) for whether that song charts. The numbers aren't probabilities yet — they're unbounded. We'll apply a sigmoid or use a loss function that handles this in Lesson 3.
What just happened, step by step: your random 4×8 input matrix was multiplied through fc1 to become 4×64, then ReLU zeroed out negatives, then dropout randomly killed 30% of values, then fc2 compressed to 4×32, then fc3 compressed to 4×1. Each step transformed the data, and the final number encodes the model's current (untrained, therefore meaningless) prediction. Training will make those numbers meaningful.
Most beginners spend too much time on architecture before training anything. The model above — three layers, ReLU, dropout — is a reasonable starting point for most tabular data problems. You will need to adjust it, but you can only know how after you've actually trained it and seen where it fails. Design → train → evaluate → redesign. Not design → perfect → train.
nn.Linear(10, 50) as the first layer. Your input data has shape (32, 12). What happens when you run the forward pass?nn.Linear(10, 50) expects exactly 10 input features. A 12-feature input will throw a RuntimeError immediately. This is actually helpful — it catches mistakes early.RuntimeError: mat1 and mat2 shapes cannot be multiplied. Dimension mismatch errors are the most common beginner error in PyTorch.nn.Dropout(0.3) in the architecture. During inference on new songs, she wants predictions — not training. What should she do?model.eval() switches the model to evaluation mode, which deactivates dropout and batch normalization. Use model.train() to switch back for training. Forgetting this is a subtle bug — your model will give different results each time you evaluate it.model.eval(). PyTorch models have two modes: training (dropout active) and evaluation (dropout disabled). Inference should always happen after calling model.eval(), otherwise dropout will randomly change your predictions every run.model(fake_data) instead of model.forward(fake_data). What is the difference in practice?model(x) invokes __call__, which runs forward hooks, backward hooks, and other PyTorch internals before and after calling forward(). Bypassing this with model.forward(x) can silently break features like gradient checkpointing and model profiling.model(x). It invokes __call__, which wraps your forward method with PyTorch's internal machinery — hooks, gradient handling, and more. Calling model.forward(x) directly skips all of that.model.parameters() returns an iterator of all parameter tensors. p.numel() returns the number of elements in each tensor. Summing them gives the total parameter count. This is the standard pattern used everywhere in the PyTorch ecosystem.sum(p.numel() for p in model.parameters()). model.parameters() yields all learnable parameter tensors; numel() counts elements in each. There is no built-in parameter_count() method or torch.count_params() function.You're at a startup that's building a tool to predict whether a freelance job posting will get enough quality applicants within 48 hours — based on 12 features including pay rate, description length, required skills count, client rating, and category.
The founder wants a neural network. Your AI peer is a senior engineer who will help you think through the architecture, but they'll push back on choices that don't make sense. You need to propose a model design and defend it.
Jordan is applying for a data science co-op at a healthcare analytics company. The technical screen includes a take-home: "Build and train a binary classifier on the provided dataset. Submit your notebook with training loss curves." Twenty-four hours. Jordan has defined the model. The data is loaded. And now they're staring at the blank cell where the training loop should go.
They've watched the loss function get explained on YouTube. They understand gradient descent in the abstract — find the direction of steepest descent, take a step in that direction, repeat. But the actual code? Every tutorial either skips the details or buries them in a framework that hides all the interesting parts.
This lesson writes the training loop explicitly — no wrappers, no magic, just the raw PyTorch that runs every iteration. Once you understand what these 15 lines do, you can debug any training problem you encounter, because you'll know where to look.
Before you write a single line of the training loop, you need three things configured:
Here is the complete training loop. Every line matters. Every line is doing something specific.
Let's be specific about what's happening and why the order is non-negotiable:
zero_grad(): PyTorch accumulates gradients by default — each backward pass adds to existing gradients rather than replacing them. If you don't zero them at the start of each step, gradients from previous batches contaminate the current update. This is an intentional design choice (useful for gradient accumulation) but it means you have to be explicit.
loss.backward(): This is where the magic actually happens. PyTorch has been tracking every computation in your forward pass, building a computational graph. backward() traverses this graph in reverse, computing the gradient of the loss with respect to every parameter using the chain rule. You don't write this math — PyTorch does it automatically.
optimizer.step(): Uses the freshly computed gradients to nudge every weight slightly in the direction that reduces loss. The size of the nudge is controlled by the learning rate. Too large and the model oscillates or diverges. Too small and it learns agonizingly slowly.
After your training loop finishes, plot the loss curve: plt.plot(train_losses); plt.xlabel('Epoch'); plt.ylabel('Loss'); plt.show(). A healthy training curve falls steeply at first and gradually flattens. If it's still falling sharply at epoch 50, train longer. If it's oscillating wildly, your learning rate is too high. If it barely moves, your learning rate is too low or your data has problems.
The loss number the training loop prints is not accuracy. It's a scalar measuring prediction error according to the loss function. For BCEWithLogitsLoss, lower is better, and a loss around 0.693 (which is -log(0.5)) means the model is essentially guessing randomly — it hasn't learned anything yet. Getting loss below 0.4 typically corresponds to meaningfully better-than-random predictions on a binary task.
Your peers who are new to this will sometimes look at a loss of 0.35 and ask "is that good?" The only honest answer is: compared to what? Loss numbers are meaningful relative to where you started, relative to a baseline (like always predicting the majority class), and relative to your validation loss. A loss of 0.35 on training data with 0.8 on validation means you're overfitting badly. A loss of 0.35 on both means you have a real model.
Speaking of validation: every training loop should have a counterpart validation loop that runs on held-out data after each epoch. Here's the minimal version:
torch.no_grad() disables gradient computation during validation — you're not updating weights, just measuring performance, so there's no reason to build the computational graph. This makes validation faster and uses less memory.
The training loop will run without errors on complete garbage. If your labels are scrambled, your features are in the wrong order, or you have a data leak, the loop will happily run 50 epochs, print decreasing loss numbers, and produce a model that is worthless. The training loop is not a sanity check — you have to build those separately. Always verify one batch of data looks correct before running the full loop.
optimizer.zero_grad() be called at the beginning of each training step?optimizer.zero_grad(), otherwise old and new gradients stack up and corrupt your updates.torch.no_grad() used?torch.no_grad() tells PyTorch not to build the computational graph during the forward pass. Since you're not calling loss.backward() during validation, you don't need the graph — and skipping it reduces memory usage and speeds up inference.optimizer.step() — there's no accidental update risk. torch.no_grad() is about efficiency: when you're not backpropagating, there's no reason to track all the operations needed to compute gradients. It speeds up validation and saves GPU memory.You're doing the Jordan scenario for real. It's your co-op technical screen. You've been given a training loop that has three subtle bugs — the kind that won't cause immediate crashes but will produce a model that doesn't actually learn. Your AI peer is acting as a technical interviewer who will help you find and explain the bugs — but you need to identify them, not just be told.
Here's the broken training loop:
Find at least two bugs and explain what each one causes. Your peer will probe your reasoning.
Aaliyah built a model to predict whether a loan applicant will default. She's a finance major with a data science minor at NYU, and this is her capstone project. She trained it on 10,000 records. Accuracy on the test set: 94%. She was thrilled.
Her advisor looked at her confusion matrix and asked: "How many actual defaults did you correctly catch?" Aaliyah didn't know what that meant. She opened the matrix. Out of 600 true defaults in the test set, her model caught 12. It predicted "no default" for 588 real defaulters — and still got 94% accuracy, because 94% of the test set had no default and the model learned to just predict that.
Her model was a sophisticated way to do nothing useful. Accuracy had completely misled her. This lesson is about not making that mistake — knowing which metrics to use, when accuracy is actively deceptive, and how to read a confusion matrix like someone who understands what they built.
Aaliyah's situation isn't unusual — it's the default failure mode for anyone who measures only accuracy on imbalanced classification problems. If 94% of your samples belong to class 0, a model that always predicts class 0 achieves 94% accuracy without learning anything. This is called the majority class baseline, and it's the first thing you should check before celebrating any accuracy number.
The fix isn't a complicated algorithm change — it's measuring the right things. There are four basic metrics that, together, give you a complete picture of how a binary classifier is actually performing:
You can compute all of this directly from your model's predictions. First, you need to convert raw logits to binary predictions:
Every time you evaluate a binary classifier, check the recall on the minority class. If it's below 0.3, your model is effectively ignoring that class regardless of what the overall accuracy says. For loan default, cancer detection, fraud detection — the minority class is often the one that actually matters. Optimizing only for accuracy will betray you every time.
The 0.5 threshold in the code above is arbitrary. In most real problems, the optimal threshold isn't 0.5 — it depends on the cost of different types of mistakes.
In loan default prediction, a false negative (missed default) costs the bank far more than a false positive (flagging a good borrower for review). So the bank would set a lower threshold — say 0.3 — to catch more defaults at the cost of more false alarms. In a spam filter, you might prefer fewer false positives (blocking real email) and accept more false negatives (letting through some spam). These are business decisions, not math decisions.
You can visualize the full tradeoff by sweeping threshold from 0 to 1 and plotting precision vs. recall at each point — this is the precision-recall curve. The AUC-PR (area under that curve) is a threshold-independent metric that tells you how good your model is regardless of where you set the cutoff. A random classifier gets AUC-PR equal to the positive class frequency. A perfect classifier gets 1.0.
Here's the practical way to understand what threshold to use:
Run this and you'll see a table showing the tradeoff. Pick the threshold that best matches what you're optimizing for. This is something your AI can compute — but only you know which mistakes are worse in your specific context.
You've built, trained, and evaluated a real neural network. Before this session ends, save it:
state_dict() saves only the learned weights, not the architecture. That means you need your class definition to load them back — which is fine, because your code is always available. Saving weights separately from architecture is the standard convention because it keeps files small and gives you flexibility to modify the class later.
What you've built in this module — the environment setup, the model definition, the training loop, the evaluation code — is the complete template for any tabular deep learning project. The specific problem changes. The data changes. The architecture might grow. But the structure you've built here is reusable. Every real model you build from here follows this exact pattern.
The evaluation problem is everywhere right now. People build models, see a big accuracy number, post it, and move on without checking whether the model is actually doing what they think. This isn't malicious — it's a gap in how evaluation gets taught. Knowing to check recall on the minority class, to look at the confusion matrix, to think about threshold choice — these put you ahead of a lot of people calling themselves "ML practitioners." Use that honestly, not as gatekeeping.
model.state_dict() save, and what does it not save?state_dict() is a dictionary of parameter names to tensors — the learned weights. The model architecture (class definition) lives in your Python code, not in this file. When loading, you instantiate the class first, then load the weights into it.state_dict() contains only the learned parameters — the weight and bias tensors for each layer. It does not contain the architecture. This means you must always have access to your model class definition when loading saved weights, because PyTorch needs to know the structure before it can fill in the values.You're the ML intern at a fintech startup. The team just trained a loan default prediction model. The CEO saw "94% accuracy" in the notebook and sent a Slack message: "Amazing! Let's put this in production next week." Your manager pulled you aside and said: "Look at the confusion matrix before you let this happen."
The confusion matrix shows: TP=18, FP=45, FN=582, TN=9355. The dataset has 10,000 test records with 600 actual defaults.
Your AI peer is acting as a skeptical colleague who will help you think through what to tell the CEO — but you need to do the analysis and make the call.
import torch; print(torch.__version__). It works. Then you run torch.cuda.is_available() and it returns False. What is the most likely fix?torch.manual_seed(42) matter when training neural networks?super().__init__() in __init__. What is the most likely result?super().__init__() initializes the parent nn.Module, which is what enables parameter tracking, GPU transfer, and the training machinery.super().__init__(), the nn.Module parent class isn't initialized, so layers won't be registered as parameters and gradient-based training won't work correctly.loss.backward() actually do in PyTorch?backward() implements automatic differentiation — it uses the chain rule to compute ∂loss/∂θ for every parameter θ. These gradients are stored in param.grad tensors and used by the optimizer in the next step.backward() computes gradients — it doesn't update weights (that's optimizer.step()). PyTorch builds a computational graph during the forward pass; backward() traverses it in reverse using the chain rule.model.eval() be called before running predictions on new data?model.eval() switches behavior of dropout (disabled) and batch norm (uses running stats). Without it, predictions vary each run because dropout randomly zeros neurons — a silent bug that's hard to catch.model.eval() deactivates dropout and switches batch norm to use running statistics rather than batch statistics. Without it, dropout keeps randomly zeroing neurons, making every inference call return slightly different results.state_dict() is the standard PyTorch pattern. You save just the weights dictionary, and load it into a freshly instantiated model class. It's portable, lightweight, and doesn't depend on your specific Python environment.state_dict() — just the weights — and using load_state_dict() to restore them. Saving the whole model object with torch.save(model) works but creates fragile dependencies on your Python environment and file paths.DataLoader with shuffle=True during training?