L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 7 · Lesson 1

PyTorch: The Framework That Actually Makes Sense

Why the whole field migrated away from TensorFlow, and what that tells you about picking tools
If the best researchers all switched to the same framework, shouldn't you know why?

Priya is three weeks into a machine learning internship at a mid-sized biotech startup. Her manager hands her a GitHub repo and says, "Get familiar with the training loop — we ship a new model version Friday." She opens the code. It's PyTorch. She's spent the last semester doing homework in TensorFlow because that's what her professor used. The syntax feels alien. She spends two hours reading docs instead of doing actual work.

On Slack, she DMs a friend who graduated a year ahead: "Did you ever learn PyTorch or did you just use TF?" The reply comes back fast: "TF is basically dead in research. PyTorch is what everyone uses. You can pick it up in a week, honestly — the mental model is way more intuitive."

That word — intuitive — is doing a lot of work. What does it actually mean for a framework to be intuitive? And why did the entire research community essentially vote with their commits?

The Framework Wars: What Actually Happened

In 2016, TensorFlow launched from Google with enormous institutional momentum. It was fast, production-ready, and backed by the most powerful tech company in AI. By most metrics, it should have won. And for a while, it did — at companies. But in academia and research labs, something else happened.

Facebook AI Research released PyTorch in 2017, and within two years it had majority share in academic papers. By 2022, the split at major ML conferences like NeurIPS was roughly 75% PyTorch, 25% TensorFlow. By 2024, that gap has only grown. Google's own DeepMind lab migrated significant work to JAX (a PyTorch cousin in spirit), and TensorFlow 2.x was largely a reactive redesign attempting to copy what PyTorch did first.

This isn't just trivia. It's a case study in how developer experience beats institutional muscle when the community is technical enough to choose. The researchers who built the best models chose PyTorch because it let them think and debug faster. And the models those researchers built became the foundations everyone else builds on. So the toolchain propagated.

The Real Reason PyTorch Won

TensorFlow (v1) used a "define-and-run" static graph — you described a computation graph, compiled it, then executed it. Errors were cryptic because execution was separate from definition. PyTorch used "define-by-run" (eager execution) — the graph builds dynamically as your Python code runs. This means you can use a Python debugger, print tensors mid-computation, and write loops that actually behave like Python loops. For researchers iterating on novel architectures, this was transformative.

The PyTorch Mental Model: Tensors and Autograd

Everything in PyTorch revolves around two concepts. Once you have these, the rest is API surface.

Tensor An n-dimensional array — the generalization of scalars (0D), vectors (1D), and matrices (2D) to arbitrary dimensions. PyTorch tensors work like NumPy arrays but can live on GPU and track gradients.
Autograd PyTorch's automatic differentiation engine. Any tensor with requires_grad=True tracks all operations done to it. Call .backward() and it computes gradients for every variable in the computational graph — automatically. This is how learning happens.

Here's the core training loop in PyTorch, stripped down to essentials:

for each batch: zero the gradients → run a forward pass → compute loss → call loss.backward() → optimizer.step()

That's it. Five steps, repeated thousands of times. Everything else — model architecture, data loading, evaluation — is scaffolding around this loop. Once you see that the loop never changes, PyTorch stops feeling complicated and starts feeling like a very clean contract.

torch.nn: Building Models Without Writing Math by Hand

The torch.nn module is where you define neural network architectures. The core class is nn.Module — every model you write is a subclass of it. You define two things: the layers in __init__, and what happens on a forward pass in forward().

PyTorch ships with everything you'd need: linear layers (nn.Linear), convolutions (nn.Conv2d), attention mechanisms, batch normalization, dropout, and dozens more. You compose these like LEGO blocks. A simple two-layer classifier is maybe 10 lines. A transformer encoder is maybe 40, if you're building it from scratch — and you probably won't be, because Hugging Face (Lesson 2) ships those pre-built.

What matters right now: understand that nn.Module is the abstraction that lets you treat any neural network — from a 3-layer MLP to a 70-billion-parameter LLM — as a Python object with a forward() method. This uniformity is what makes the ecosystem composable.

Practical Takeaway

Before your next project or interview, spend 45 minutes writing a bare PyTorch training loop from scratch — no tutorials, just the docs. Define a tiny model (2 linear layers), use a simple loss (MSELoss or CrossEntropyLoss), and train it on dummy data. You'll understand 80% of what real production code does. Most people skip this and feel confused forever.

What Your Peers Are Getting Wrong

The most common mistake people in our age range make with PyTorch isn't the syntax — it's the abstraction level. Lots of folks jump straight to PyTorch Lightning or Keras wrappers because they want to "write less code." That's fine for projects. But it means they can't read or debug the underlying training loop when something breaks. And things always break.

There's a real career difference between "I use PyTorch Lightning" and "I understand what Lightning is abstracting." The second person can debug a NaN loss. The first person opens a Stack Overflow tab. Both things are navigable — but knowing which one you are is important. If you're still learning the fundamentals, go deeper before you go higher.

Another common gap: not understanding the device abstraction. PyTorch tensors live on a specific device — CPU or GPU. Operations between tensors on different devices will crash with an unhelpful error. The fix is two lines of code (tensor.to(device)) but you have to understand why it's necessary. This is also why free GPU access matters, which is Lesson 3's whole topic.

Lesson 1 Quiz

PyTorch: The Framework That Actually Makes Sense · 5 questions
1. What was the primary technical reason researchers preferred PyTorch over TensorFlow v1?
Correct. Eager execution meant the computation graph built dynamically as Python ran — so you could use print statements, breakpoints, and standard debuggers. TF1's static graph approach made debugging a separate, painful process.
Not quite. The distinction wasn't about inference speed or mobile support — it was about the development experience for researchers building novel architectures. TF1's define-and-run graph made iteration slow and debugging cryptic.
2. You're training a model and your loss suddenly becomes NaN after epoch 3. You're using PyTorch Lightning. What's the core problem with only knowing Lightning but not the underlying training loop?
Exactly. Abstractions are great until they fail. If you don't understand that the loop has distinct steps — forward pass, loss computation, backward pass, optimizer step — you can't narrow down where the numerical instability is originating.
Lightning doesn't magically prevent NaN losses, and it definitely supports GPUs. The issue is that without understanding the underlying loop, you don't have a mental model for isolating where in the computation the problem originates.
3. What does requires_grad=True do when set on a PyTorch tensor?
Right. This is the flag that opts the tensor into PyTorch's automatic differentiation system. Any operation on a requires_grad tensor creates a node in the computational graph, so that .backward() can propagate gradients all the way back through it.
requires_grad is specifically about gradient tracking, not device placement or precision. Setting it to True tells autograd to build a computational graph for this tensor so .backward() can compute derivatives with respect to it.
4. By approximately 2022, what was the rough share of PyTorch vs TensorFlow usage in major ML research conference papers?
That's the reported split at venues like NeurIPS around 2022. PyTorch had clearly won research mindshare, though TensorFlow retained significant production/enterprise use, particularly in older codebases and Google-ecosystem teams.
The actual split was roughly 75/25 in PyTorch's favor at major research conferences around 2022. TF wasn't extinct — it still had enterprise presence — but the research community had clearly moved.
5. In PyTorch's nn.Module system, where do you define the model's layers, and where do you define how data flows through them?
Correct. __init__ is where you instantiate your layers (nn.Linear, nn.Conv2d, etc.) as attributes. forward() is where you define the computation — how an input tensor flows through those layers. PyTorch calls forward() automatically when you call the model like a function.
It's the other way around: layers go in __init__() and the computation logic goes in forward(). This separation is intentional — __init__ registers layers so PyTorch can track their parameters; forward() defines the computation graph.

Lab 1: PyTorch Architecture Consultant

You're advising a team on their PyTorch setup. Take a position.

Your Role

You're a junior ML engineer at a 12-person startup. A product manager just forwarded you a message from a new hire asking whether the team should migrate their training code from "raw PyTorch" to PyTorch Lightning to "save time." Your tech lead asked you to draft a recommendation. The lab AI will play your tech lead — opinionated, direct, willing to push back on weak reasoning.

Start by telling the AI: what's your recommendation on the PyTorch vs Lightning question, and why? Be specific about trade-offs. The AI will challenge your reasoning and ask follow-up questions. You need at least 3 exchanges to complete the lab.
Tech Lead — Maya Chen
AI LAB PARTNER
Hey — glad you're thinking about this. The new hire sent a pretty persuasive message. Before I push back on anything, let me hear your take: should we move to Lightning or stay in raw PyTorch? And I want your actual reasoning, not "it depends." Make a call.
Module 7 · Lesson 2

Hugging Face: The GitHub of AI Models

How one platform became the distribution layer for the entire open-source AI ecosystem
When the best models in the world are free to download, what exactly are you competing on?

Marcus is a senior studying computer science at DePaul, trying to build a portfolio project that stands out. He has an idea: a tool that reads Chicago city council meeting transcripts and automatically flags when members contradict their previous positions. Civic accountability, built with AI. He thinks he'll need six months to train a model capable of this.

His roommate, who just got a job at a data science consultancy, asks him: "Have you looked at Hugging Face? There are like 500,000 models on there. Someone probably already made something that handles legislative text."

Marcus searches. In 20 minutes he finds a fine-tuned BERT model trained on political speech. In 40 minutes he has it running locally. In three hours he has a working prototype. The six-month project became a weekend project — not because the AI got easier, but because the distribution of AI got way better.

What Hugging Face Actually Is

Hugging Face started as an NLP chatbot company in 2016. By 2018, they'd pivoted to building developer tools around transformer models, releasing the Transformers library. That library became the de facto standard for working with pretrained language models. Then they launched the Hub — a repository for sharing models, datasets, and demo apps — and everything changed.

As of mid-2024, the Hugging Face Hub hosts over 750,000 models and 150,000 datasets, with thousands being uploaded weekly. The platform is community-run in the same way GitHub is — anyone can upload, fork, and build on others' work. But unlike GitHub, it's optimized for model artifacts: versioned model weights, inference APIs, and standardized metadata that makes it searchable.

The company has raised over $235 million in funding and is valued around $4.5 billion. But its real leverage is softer than that: it's the place where open-source AI momentum lives. When Meta releases Llama, it goes on Hugging Face. When Stability AI releases a new image model, it goes on Hugging Face. When a grad student fine-tunes something useful, it probably goes on Hugging Face.

The Transformers Library: Your Shortcut to State-of-the-Art

The transformers Python library is the technical core of Hugging Face's value. It standardizes how you load and use pretrained models across hundreds of architectures — BERT, GPT-2, T5, Llama, Whisper, CLIP, and hundreds more. The API is consistent enough that switching from a sentence classifier to a text generator is essentially changing two lines of code.

Pipeline The highest-level abstraction in the Transformers library. One function call that handles tokenization, model inference, and output decoding. Eleven task types built-in: text-classification, text-generation, summarization, translation, question-answering, image-classification, and more.
AutoClass AutoModel, AutoTokenizer, AutoConfig — classes that automatically detect the correct model architecture from a model name or path. You don't need to know if a model is BERT or RoBERTa; AutoModel loads the right class.
Model Hub ID A string like "bert-base-uncased" or "mistralai/Mistral-7B-v0.1" that uniquely identifies a model on the Hub. Pass it to from_pretrained() and the library handles download, caching, and instantiation.

The from_pretrained() pattern is what unlocks everything. Pass a Hub ID, get back a fully initialized model with weights trained on billions of tokens. You're not starting from scratch — you're starting from a very good starting point and adapting it to your specific problem.

Fine-Tuning vs. Prompting vs. RAG: Knowing Which Tool to Reach For

Once you have a pretrained model from Hugging Face, you have three main ways to make it useful for your specific problem:

Prompting: Just write better inputs. Works immediately, costs nothing, requires no training. Best when the model is already capable of the task and you just need to guide it. Worst when you need consistent structured outputs or domain-specific vocabulary the model doesn't know.

RAG (Retrieval-Augmented Generation): Connect the model to a vector database of your own documents. The model retrieves relevant chunks before generating. Best when you have a specific knowledge base the model wasn't trained on. Doesn't update the model's weights — it just gives it better context.

Fine-tuning: Actually train the model further on your data, adjusting weights. Best when you need the model to change its style, learn new facts persistently, or specialize deeply. Most expensive in time and compute. Hugging Face's PEFT library (Parameter-Efficient Fine-Tuning) lets you fine-tune massive models by only updating a small subset of parameters — LoRA adapters being the most popular technique.

Common Peer Mistake

A lot of people in the "I'm learning AI" crowd immediately jump to fine-tuning because it sounds more technical and impressive. Then they spend two weeks setting up a training pipeline for a task that good prompting would have solved in an hour. Fine-tuning is a last resort, not a first move. Exhaust prompting and RAG first — they're faster, cheaper, and often good enough.

Spaces: Free Demos That Actually Get You Noticed

Hugging Face Spaces is a hosting platform for ML demos, built on Gradio or Streamlit. It's free for CPU-tier apps and cheap for GPU-backed ones. More importantly, it's where the community discovers what's possible — good Spaces get linked in newsletters, discussed in Discord servers, and sometimes go viral in the ML Twitter/X community.

If you're building a portfolio, a working Space is worth five GitHub repos of training scripts. Recruiters and researchers can run it without cloning anything. It demonstrates that you can go from model to deployed product, which is the gap most people can't cross. Put your project on a Space. Use Gradio — it takes 10 lines of Python.

Practical Takeaway

This week, search the Hugging Face Hub for a model relevant to something you care about — your field of study, a hobby, a problem you've seen. Download it, run it locally on three examples, and think about what you'd change. You don't have to build anything. Just form a relationship with the tool. The people who are comfortable with Hugging Face before they need it are the ones who ship things fast when they do.

Lesson 2 Quiz

Hugging Face: The GitHub of AI Models · 5 questions
1. What does the from_pretrained() method do in the Hugging Face Transformers library?
Exactly. Pass a Hub model ID like "bert-base-uncased" and from_pretrained() handles the download, caches it locally, and returns a fully initialized model object with all the weights from pretraining. This is the core pattern for using Hugging Face models.
from_pretrained() specifically loads models with their pretrained weights already in place. It's the opposite of training from scratch — you're starting from a model that's already learned from a huge corpus, then optionally adapting it.
2. You need a model that can answer questions about your company's internal documentation (500 PDFs). The model should use only those documents as its source. Which approach fits best?
RAG is the right call here. You have a specific, bounded knowledge base that the model wasn't trained on. RAG lets the model retrieve actual document chunks at inference time — so answers are grounded in your specific content without the cost and complexity of fine-tuning.
Fine-tuning to embed facts in model weights is unreliable (models hallucinate around fine-tuned facts), and prompting can't fit 500 PDFs in a context window. RAG is purpose-built for this use case: a defined external knowledge base that the model queries at inference time.
3. What is a Hugging Face Space, and what is its primary practical value for someone building an AI portfolio?
Correct. Spaces host your ML demo as a live web app. The portfolio value is real: a working demo someone can interact with in 30 seconds is worth more than a GitHub repo they'd have to clone and configure. It proves you can ship, not just train.
Spaces are the public-facing demo layer — live apps built with Gradio or Streamlit that anyone can use in a browser. For portfolio purposes, this matters because it lets recruiters and collaborators actually experience your work rather than just read about it.
4. Hugging Face's PEFT library enables fine-tuning large models more efficiently. What does PEFT's LoRA technique actually do?
Right. LoRA (Low-Rank Adaptation) freezes the pretrained model weights and injects small trainable "adapter" matrices alongside specific layers. Only the adapters train — the base model stays frozen. This means you can fine-tune a 7B-parameter model on a consumer GPU because you're only updating millions of parameters, not billions.
LoRA works by freezing the base model and training small additional low-rank matrices that sit alongside the original weights. You get most of the benefit of fine-tuning while training a fraction of the parameters — which is why it's become the standard for adapting large models on limited hardware.
5. Marcus from the lesson story found a working pretrained model for his civic accountability project in 20 minutes instead of spending six months training one. What fundamental shift in the AI ecosystem made this possible?
That's the key insight. The models didn't get easier to train — distribution got dramatically better. When 750,000+ models are publicly available, searchable, and runnable with one function call, the question shifts from "can I build this model?" to "has someone already built something close?" Usually the answer is yes.
The story's point is about distribution, not model simplicity or funding. Hugging Face created infrastructure for sharing trained model weights — the community equivalent of "why write this code when someone else already open-sourced it?" Applied to model weights, that changes what's feasible for an individual builder entirely.

Lab 2: Hugging Face Model Selector

A startup needs a model recommendation. Make the call.

Your Role

You're consulting for a small edtech startup. They want to build a feature that automatically generates quiz questions from uploaded textbook chapters (PDF → text already handled). They have a $200/month compute budget and a one-developer team. The AI plays their CTO — skeptical, budget-conscious, and asking you to justify your Hugging Face model recommendation specifically.

Tell the AI which Hugging Face model or model type you'd recommend for this task, and your reasoning. Consider: task type, model size, compute cost, and whether they should use a pipeline or fine-tune. Be specific — "a T5-based model" is a position; "a large language model" is not. Minimum 3 exchanges.
CTO — Jordan Park
AI LAB PARTNER
Okay, I've got 20 minutes. We need to pick a Hugging Face model for quiz generation from textbook text. I've seen people throw GPT-4 at everything, but we can't afford the API costs at scale. What's your actual recommendation, and why that specific model type over the alternatives?
Module 7 · Lesson 3

Free GPUs: Colab, Kaggle, and the Art of Not Paying for Compute

Understanding the actual limits of free tiers — and how to squeeze the most out of them before you need to spend
If you can train real models for free, why do so many people never ship anything?

Keisha is trying to fine-tune a small language model on a dataset of her own journaling entries — a personal project, nothing commercial, just curiosity about whether a model trained on her own writing might feel different from ChatGPT. She doesn't have a GPU. Her laptop has integrated graphics. She posts in an ML Discord asking how people run training without spending money.

The responses are a mess. One person says Google Colab is fine. Another says Colab disconnects too often to be useful. A third recommends Kaggle. A fourth says she should just get a Lambda Labs instance. Someone says Colab Pro is worth it. Nobody agrees, and nobody explains why any of these apply to her specific situation.

Here's what nobody told her clearly: each free tier has specific constraints that matter depending on what you're doing. The person who says Colab is fine is running 10-minute inference jobs. The person who says it disconnects is running 6-hour training runs. Both are correct. The framework for choosing isn't "which is best" — it's "what does my workload look like?"

Google Colab: What You Actually Get

Google Colab free tier gives you a T4 GPU with 15GB of VRAM and 12–16GB of RAM. The catch everyone mentions is the disconnection policy: after 12 hours of runtime, your session terminates. If your browser tab is idle for too long, it also disconnects. Your files don't persist beyond the session unless you mount Google Drive or save explicitly.

For the right workloads, this is genuinely great. Running inference on a medium-sized model? 20 minutes, done. Fine-tuning a small model for a few epochs? Totally feasible in a single session. Experimenting with a new architecture or debugging a training loop? Perfect. The T4 is a real GPU — not a toy.

Where Colab fails: anything requiring more than one continuous session. Long training runs where you're saving checkpoints and resuming. Jobs that exceed 12 hours. Multi-GPU workloads (free tier is single GPU). Large model inference where 15GB VRAM isn't enough (a 7B-parameter model in float16 needs ~14GB — tight).

Colab Pro / Pro+ Paid tiers ($10/$50/month) that add priority GPU access, A100 availability, longer runtimes, and more RAM. If you're doing serious work and don't want a cloud bill, Pro+ is often cheaper than AWS for moderate usage. Not free, but cheap.
Kaggle Notebooks: The Underrated Alternative

Kaggle's free GPU tier is consistently underrated. You get 30 hours of GPU per week (T4 or P100) with sessions up to 9 hours. Unlike Colab, Kaggle notebooks save their state and outputs — your files persist as dataset artifacts. You can also schedule notebooks to run without a browser window open, which Colab free tier doesn't support.

The 30-hours-per-week limit sounds restrictive but is actually generous if you're thoughtful. 30 hours is enough to fine-tune a 1B-parameter model with LoRA on a reasonable dataset. Many competition winners do serious work entirely on Kaggle free tier. The platform also has a massive dataset library built in, so if your project uses public data, you might not even need to upload anything.

The real Kaggle advantage: session persistence and scheduled runs. You can kick off a training job, close your laptop, and come back to results. This fundamentally changes what's feasible compared to Colab free, where you need to babysit the browser.

Head-to-Head Summary

Use Colab when: you need fast iteration, you're debugging or experimenting, your job fits in one 12-hour session, or you want Drive integration for files.

Use Kaggle when: your training run exceeds one session, you need persistent outputs, you want to walk away and come back, or you're working with public datasets already on the platform.

Paid Cloud: When and Why You'd Actually Spend Money

Eventually, free tiers hit walls. Here's when it becomes worth paying:

You need a specific GPU: A100s have 80GB VRAM, enabling models that simply can't fit in 15GB. If you're working with 13B+ parameter models without quantization, you might genuinely need one. Lambda Labs and Vast.ai offer A100s for $1–3/hour — often dramatically cheaper than AWS or GCP for the same hardware.

Your training run takes days: Anything over 30 hours of total GPU time per week exceeds what Kaggle gives you for free. If you're training a model seriously — not just experimenting — budget for it. A multi-day run on a $1/hour GPU might cost $20–50 total. That's reasonable for a project that matters.

You need reliability: Free tiers are interrupted. If you're running something for a deadline — a paper submission, a product launch, a client demo — pay for dedicated compute where you control the runtime.

Practical Takeaway

Before spending any money on compute, do this: estimate your GPU-hours. How many training steps? At what batch size? What's your model's forward-pass time per batch? Multiply it out. If the total comes to under 25 hours, Kaggle free tier handles it. Under 12 hours and it's a clean Colab run. Over 50 hours? Budget $30–80 on Lambda Labs or Vast.ai — it's cheaper than Colab Pro+ if it's just one job. The people who waste money on compute are the ones who didn't estimate first.

The Environment Problem Nobody Warns You About

Here's the issue that trips up most people new to cloud notebooks: your environment is ephemeral. Every time you start a new Colab session, you get a fresh machine. Your pip installs from last session? Gone. Your downloaded model weights? Gone (unless you saved them to Drive). Your environment variables? Gone.

The fix is to treat setup as code. At the top of every notebook: install your dependencies, mount your storage, and load your data. Make the setup cell idempotent — running it twice shouldn't cause errors. This is the same discipline production engineers use for containerized environments, and the habit transfers directly.

Kaggle is better here because it lets you add packages to a "persisted" environment that survives sessions. But even there, getting into the habit of explicit, script-level environment setup is valuable. When you eventually move to a real server or a Docker container, you'll already think this way.

Lesson 3 Quiz

Free GPUs: Colab, Kaggle, and the Art of Not Paying for Compute · 5 questions
1. Google Colab's free tier provides a T4 GPU with approximately how much VRAM, and what is the maximum session runtime?
Correct. The T4 has 15GB VRAM, which accommodates medium-sized models. The 12-hour session limit is the main practical constraint — after that, the runtime terminates and any unsaved state is lost.
Colab free tier gives you a T4 GPU with about 15GB VRAM and sessions that run up to 12 hours before automatic termination. These specific limits determine what workloads are actually feasible on the free tier.
2. You're fine-tuning a model on a dataset that will take about 18 hours of GPU time. You want to run it without keeping a browser window open. Which free platform is better suited, and why?
Kaggle is the right call. Its sessions can run without browser supervision, outputs persist as dataset artifacts, and 18 hours fits within two Kaggle sessions (9hr + 9hr) within the weekly 30-hour limit. Colab free disconnects if the tab goes idle, and terminates at 12 hours regardless.
Colab's 12-hour hard limit and browser-dependency make it wrong for unattended 18-hour runs. Kaggle allows scheduled, unattended runs up to 9 hours per session with persistent outputs — so two consecutive sessions would cover 18 hours within a single week's free allocation.
3. Why does Colab free tier require you to reinstall pip packages and reload files at the start of every new session?
Exactly. Each Colab session spins up a fresh container. Nothing installed or downloaded in the previous session exists — the disk is wiped. This is why you should treat setup code as a script that runs every time, not a one-time installation. Mount Drive for persistence.
The reason is purely infrastructure: Colab gives you a fresh virtual machine per session. When the session ends, that VM is reclaimed. There's no persistence of the filesystem by default — it's not a privacy policy, it's just how ephemeral cloud environments work.
4. You're estimating GPU-hours for a project. Your model does a forward pass in 0.2 seconds per batch, batch size 32, and you have 50,000 training examples, planning 10 epochs. Roughly how many GPU-hours do you need for training?
Good estimation. 50,000 examples ÷ 32 per batch = ~1,563 batches per epoch × 10 epochs = ~15,625 batches × 0.2 seconds each = ~3,125 seconds per forward pass per epoch direction. With backward passes roughly doubling this, ~8–9 GPU-hours total. Colab free tier handles it fine, with time to spare for debugging.
Work it out: 50,000 ÷ 32 ≈ 1,563 batches/epoch × 10 epochs = 15,625 batches. At 0.2 sec/batch forward pass, that's 3,125 seconds. Backward pass roughly doubles this → ~6,250 seconds ≈ 1.7 hours just for forward/backward. Add overhead and you're around 8–9 GPU-hours. One Colab session handles it.
5. When does it make sense to pay for compute on Lambda Labs or Vast.ai instead of using free tiers?
Right. The decision framework is: can free tiers handle the hardware requirement, time requirement, and reliability requirement? If any three fails, budget for it. A $30 Lambda Labs job for a project that matters is a reasonable expense — treating compute as always-free or always-paid is both wrong.
The right mental model is to exhaust free tiers first, but recognize when specific constraints — VRAM, total compute hours, or reliability for a deadline — make paid compute the correct call. The lesson's advice: estimate GPU-hours first, then decide. Training absolutely can exceed free tier limits for large models.

Lab 3: Compute Budget Advisor

A project needs a compute plan. Don't let them overspend or underprepare.

Your Role

A friend is working on a capstone project: fine-tuning a 3B-parameter language model on a 50,000-example dataset for 5 epochs. They have no GPU, no budget, and a deadline 10 days away. They want your advice on which free platform to use and whether they'll need to spend money. The AI plays your friend — eager but inexperienced, with follow-up questions about specifics.

Advise your friend on whether Colab, Kaggle, or paid compute is right for this job. Be specific about: estimated GPU-hours, VRAM requirements for a 3B model with LoRA, and which platform matches those needs. Minimum 3 exchanges — the AI will push you on your estimates.
Friend — Alex (capstone project)
AI LAB PARTNER
Okay so I'm kind of panicking. I need to fine-tune a 3B-parameter model on 50k examples for 5 epochs and I have literally no GPU and no money. My advisor said just use "the cloud" but gave zero specifics. I've heard of Colab and Kaggle but I don't know if they can handle this. Can you actually help me figure out if I need to spend money, and where to run this thing?
Module 7 · Lesson 4

Putting It Together: A Real Project Workflow

How PyTorch, Hugging Face, and free compute actually combine — and the decisions that make or break a build
The tools exist. The models exist. The compute is (mostly) free. So why do most projects never ship?

Darius has been "learning deep learning" for eight months. He's completed three courses, read two textbooks, and watched countless YouTube tutorials. He can explain backpropagation at a dinner party. He has not shipped a single project.

He posts in an online forum: "I feel like I know the theory but I can't turn it into anything real. Every time I try to start a project I get stuck on setup before I even get to the model."

The replies are split between "just build something" (useless advice) and "you need to learn X first" (the same trap he's in). What nobody says clearly is: the gap between understanding tools and using them is entirely about workflow, not knowledge. Darius doesn't need to learn more. He needs a repeatable process that gets him from idea to running code in under 30 minutes.

The Five-Step Real-Project Workflow

This is the actual workflow that experienced ML practitioners use for new projects — not the idealized version, but the one that accounts for time constraints, imperfect data, and free compute limits.

Step 1 — Define the task precisely. Not "I want an AI that understands text" but "I want a model that classifies customer support tickets into 8 categories with at least 85% accuracy on our test set." The more specific, the easier every subsequent decision becomes. Task type determines model family. Model family determines data format. Data format determines preprocessing. Vague tasks create vague projects that never ship.

Step 2 — Search the Hub before writing code. Go to huggingface.co/models. Filter by task. Look for models with the most downloads and recent updates. Check the model card — does it describe training data similar to your domain? Read the example code. Can you get a baseline running in under 15 lines? If yes, that's your starting point. If nothing close exists, now you know you're in fine-tuning territory.

Step 3 — Get something running on CPU first. Seriously. On your laptop, in whatever environment you have. Run a forward pass. Feed it a single example and inspect the output. Does it make sense? Is the output shape correct? Are you getting reasonable logits? This step costs you 20 minutes and saves you 3 hours of debugging on GPU where the iteration cycle is slower.

Step 4 — Move to Kaggle/Colab for real training. Set up your notebook with the setup-as-code discipline from Lesson 3. Install dependencies in the first cell. Load your data from the Hub or Drive. Confirm your GPU is active (torch.cuda.is_available()). Run one epoch, inspect the loss curve. Is it decreasing? Are you getting vanishing gradients? Now you're iterating on the real problem.

Step 5 — Ship something, even if small. A Hugging Face Space with your model. A GitHub repo with a working inference script. A demo notebook with clear output. The threshold should be: can someone else run this and see a result? Perfectionism at this stage is just a different name for not shipping. A working prototype beats a perfect plan every time.

The Debugging Toolkit: What Actually Goes Wrong

Knowing the common failure modes saves hours. These are the real ones, not the textbook ones:

Loss stays flat Most likely causes: learning rate too low, wrong loss function for your task type, labels not correctly formatted (0-indexed vs 1-indexed), or a bug in your data loader returning the same batch repeatedly. Check the data loader first — it's almost always the data.
NaN loss Learning rate too high (gradients explode), numerical instability in a custom loss function, or a divide-by-zero in preprocessing. Add gradient clipping (torch.nn.utils.clip_grad_norm_) as a first fix. Then check your loss function inputs for zeros or infinities.
CUDA out of memory Reduce batch size (halving it frees roughly half the activation memory). Enable gradient checkpointing (model.gradient_checkpointing_enable()). Use mixed precision (torch.cuda.amp.autocast()). If you're still out of memory, you need a model with fewer parameters or more VRAM.
Tensor device mismatch You have tensors on CPU and GPU in the same operation. Fix: establish a device variable at the top (device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')), then consistently call .to(device) on every tensor and the model.
On Peer Habits

Most people we know in this space spend too much time reading about ML and not enough time with error messages. Error messages are where the real learning happens — they force you to understand the actual failure mode, not the idealized explanation. Every error you've debugged and understood is worth five tutorials you've passively watched. This is the discipline that separates people who ship from people who study indefinitely.

Connecting the Tools: A Real Architecture Decision

Here's how the three tools of this module actually connect in a real project. Say you want to build a system that identifies whether a Reddit post is asking for advice versus venting (a real classification task used in mental health research).

Hugging Face: Search the Hub, find mental-health-research/roberta-base-mental-health or similar, load it with AutoModelForSequenceClassification.from_pretrained(). You have a pretrained base with domain-relevant pretraining. Add a classification head for 2 classes (it's built in when you specify num_labels=2).

PyTorch: Write your training loop. DataLoader feeds batches of tokenized text. Forward pass produces logits. Cross-entropy loss. Backward pass. AdamW optimizer step. You're doing this in raw PyTorch so you can inspect the loss at each step, plot it, and catch issues immediately. The loop is 30 lines.

Kaggle: Your dataset has 20,000 examples, you're running 3 epochs with LoRA on a 125M-parameter model. Estimated GPU time: ~4 hours. Kaggle free tier handles it in one session. You push the trained model back to the Hub with model.push_to_hub("your-username/reddit-advice-classifier"). Ship a Space. Done.

The Actual Skill: Judgment Under Constraint

Knowing PyTorch syntax is table stakes. Knowing how to search the Hub efficiently is learnable in a weekend. Knowing when free compute is enough is just arithmetic. The actual skill — the one that makes someone genuinely useful in an ML context — is judgment under constraint: given a real problem, a real deadline, and a real resource limit, what is the fastest path to something working?

That judgment develops through doing, not reading. The people around you who are shipping things aren't smarter — they've just built the habit of starting with inadequate information and iterating. The tools in this module are specifically designed to lower the cost of starting: Hugging Face gives you a standing start with pretrained models, PyTorch gives you a debuggable loop, and free compute gives you a GPU without a credit card.

The only thing left is to start. Specifically — tonight, if you can.

Final Practical Takeaway

Pick one small, specific problem you actually care about. Find a Hugging Face model related to it. Load it in a Kaggle notebook. Run it on three examples. Inspect the output. That's all. You don't need to train anything, fine-tune anything, or ship anything today. Just form a working relationship with the full pipeline — model to inference — in an environment with real GPU access. Most people never do this first step, which is why most people stay in tutorial mode indefinitely.

Lesson 4 Quiz

Putting It Together: A Real Project Workflow · 5 questions
1. According to the five-step workflow, why should you run your model on CPU locally before moving to cloud GPU?
Exactly right. CPU debugging is fast-iteration territory — you can test data loading, model instantiation, forward pass shapes, and output format without waiting on GPU queues or worrying about session timeouts. Catching a data loader bug on CPU takes 2 minutes; catching it mid-GPU training run costs you much more.
The reason is iteration speed and cost. Running a quick forward pass on CPU catches bugs in your code logic — wrong tensor shapes, bad data formatting, model instantiation errors — before those bugs consume GPU time. GPU iteration cycles are slower, and on free tiers, GPU time is a limited resource.
2. Your training loss is completely flat after 500 steps. You've checked your learning rate and it seems reasonable. What should you check next, according to the debugging section?
The data loader is almost always the culprit when loss is suspiciously flat. A bug that feeds the same batch every step means the model is repeatedly fitting one example — gradients look consistent and loss eventually stops changing. Print your first 5 batches and verify they're different before anything else.
Flat loss after ruling out learning rate should make you suspicious of the data pipeline first. A data loader returning the same batch repeatedly produces exactly this symptom — the model iterates on identical data, gradient updates become trivially small, and loss plateaus. Always check data before changing model architecture.
3. You're getting a CUDA out of memory error. You've already halved your batch size once. What are two additional things you could try before concluding you need more VRAM?
Correct. Gradient checkpointing trades compute for memory — it recomputes activations during backward pass instead of storing them all, roughly halving activation memory. Mixed precision uses fp16 instead of fp32, halving the memory footprint of activations and often of the model itself. These two together often recover enough headroom to continue on the same GPU.
The two standard memory-reduction techniques in PyTorch are gradient checkpointing (recomputes activations during backward, reducing stored activation memory) and mixed precision training (fp16 instead of fp32, halving memory usage). Neither reduces model quality significantly for most tasks and both are two-line additions to your training loop.
4. Darius from the lesson's story has been studying deep learning for 8 months without shipping anything. The lesson's diagnosis is that his problem is primarily what?
This is the lesson's central argument. Darius's problem isn't theory or ideas or hardware — the tools exist, the compute is free, the models are on the Hub. The gap is a repeatable workflow: a concrete sequence of steps he can execute on any new idea to get something running fast. Knowledge without process produces indefinite studying.
The lesson's explicit diagnosis is that Darius's gap is workflow, not knowledge. He can explain backpropagation. He knows the theory. What he lacks is a reliable process — a sequence of steps he can execute on any new project to reach running code without getting stuck in setup. That's a different problem with a different solution.
5. In the Reddit advice/venting classifier example, push_to_hub() is called on the trained model. What does this accomplish?
Correct. push_to_hub() is the distribution step — it uploads your trained model weights to a Hub repository under your username. Anyone can then load your model with from_pretrained("your-username/model-name"). This is how the open-source AI ecosystem grows: training a model and sharing it is now as simple as a GitHub push.
push_to_hub() uploads your model weights and configuration to Hugging Face Hub. No review queue, no API conversion, no safety evaluation — it's a direct upload. Once pushed, anyone with your model ID can load it with from_pretrained(). This is the community distribution mechanism that makes the Hub work.

Lab 4: Project Architecture Review

Walk the AI through your full project plan. Defend every decision.

Your Role

You're pitching a small AI project to a senior ML engineer at a company you want to intern at. They've agreed to review your project plan before you start building. The AI plays the engineer — experienced, direct, will push hard on vague answers. They want to see that you can connect tools to requirements, not just name-drop frameworks.

Describe a project you'd genuinely want to build (or use the Reddit classifier from the lesson). Tell the AI: (1) the precise task definition, (2) which Hugging Face model you'd start with and why, (3) whether you'd fine-tune or prompt-engineer, (4) which compute platform you'd use and why, and (5) what "done" looks like. The AI will challenge each choice. Minimum 3 exchanges.
Senior ML Engineer — Sam Rivera
AI LAB PARTNER
Alright, I've got 30 minutes. Walk me through your project plan — and I mean all of it: what you're building, which model you're starting from and why not something else, whether you're fine-tuning or not and why, what compute you're planning to use, and how you'll know when it's done. I'll stop you when I hear something vague. Start whenever.

Module 7 Test

Tools of the Trade: PyTorch, Hugging Face, and Free GPUs · 15 questions · Pass at 80%
1. What is "eager execution" in PyTorch, and what problem does it solve compared to TensorFlow v1?
Correct. TF1 required defining the entire computation graph before any execution — making debugging a separate, indirect process. PyTorch's eager mode runs each operation as Python encounters it, so you can inspect tensors, use breakpoints, and write normal Python control flow.
Eager execution means operations execute immediately as Python runs them — no separate compilation step. This makes PyTorch behave like standard Python, enabling normal debugging tools. TF1's static graph approach separated graph definition from execution, which made debugging cryptic.
2. In PyTorch's training loop, why must you call optimizer.zero_grad() before loss.backward()?
Right. PyTorch was deliberately designed to accumulate gradients, because sometimes you want that behavior (e.g., simulating larger batch sizes by accumulating over multiple steps). But for standard training, you need to zero them at the start of each batch or your gradients compound incorrectly across batches.
PyTorch accumulates gradients by default — each call to .backward() adds to the .grad attribute of each tensor rather than replacing it. This is a design choice (gradient accumulation is sometimes useful), but for normal training you must zero them before each backward pass or prior gradients corrupt the current update.
3. What is the difference between nn.Module's __init__ and forward methods?
Correct. __init__ is where you instantiate layers like nn.Linear or nn.Conv2d as instance attributes — this registers their parameters with PyTorch's parameter tracking. forward defines the computation: how input tensors flow through those layers. PyTorch calls forward automatically when you call the model like a function.
They serve distinct purposes: __init__ is for layer definition and parameter registration; forward is for the computation graph. PyTorch needs this separation because it tracks model parameters (for gradient computation and saving/loading) separately from the computation itself.
4. How did PyTorch's market share in ML research conference papers change between its 2017 release and approximately 2022?
That's the documented trajectory. PyTorch's research adoption was swift — within two years of launch it was competitive, and by 2022 it had a decisive majority at venues like NeurIPS. TensorFlow retained enterprise and production use but lost the research community that drives architectural innovation.
PyTorch's rise was real and rapid, reaching approximately 75% of research conference papers by 2022. TensorFlow wasn't discontinued — it retained enterprise users — but the research community, which drives new architectures and techniques, had clearly migrated to PyTorch.
5. What does Hugging Face's pipeline() function abstract away from the user?
Correct. pipeline() is the highest-level inference abstraction — you specify a task like "text-classification" or "summarization," optionally a model, and pass your raw input. It handles tokenization, runs inference, decodes outputs, and returns results in a human-readable format. Great for quickly testing a model's capabilities.
pipeline() is the inference abstraction. Pass a task name and raw input; get back a usable result. It internally handles three things you'd otherwise write yourself: tokenizing input, running the model forward pass, and decoding the output tokens or logits back into something meaningful.
6. A startup wants to build a chatbot that only answers questions about their product documentation (200 pages). They have no ML engineers, only developers. Which approach is most appropriate?
RAG is the right fit: bounded knowledge base, no ML engineers, developers can implement it with existing tools. The docs become searchable, retrieved chunks ground the LLM's answers in actual product content, and the system updates when docs update without retraining anything.
Training from scratch requires enormous data and resources. Fine-tuning to embed facts in weights is unreliable (models hallucinate around fine-tuned facts). Including all docs in every prompt quickly exceeds context windows. RAG — vector database + retrieval at inference time — is purpose-built for this bounded knowledge base scenario.
7. Kaggle's free GPU tier provides how many GPU-hours per week, and what is the maximum single session length?
Correct — 30 hours per week with up to 9 hours per session. Combined with Kaggle's persistent outputs and support for unattended runs, this is genuinely sufficient for many real fine-tuning jobs on smaller models.
Kaggle free gives you 30 GPU-hours/week with sessions up to 9 hours. The key advantages over Colab free tier are: you can run without a browser window open, and outputs persist as dataset artifacts across sessions.
8. What does LoRA (Low-Rank Adaptation) do to reduce the compute cost of fine-tuning large models?
Right. LoRA injects small trainable matrices alongside the frozen pretrained weights. Only these adapters train. Because their dimensionality is low-rank, the parameter count is tiny relative to the base model — enabling fine-tuning of a 7B model on hardware that couldn't train even a fraction of its original parameters.
LoRA freezes the base model entirely and inserts small trainable "adapter" matrices at specific layers. You train only those adapters — millions of parameters instead of the model's billions. At inference time, the adapter weights can be merged back in with no latency cost. This is why it became the standard approach for consumer-hardware fine-tuning.
9. Why does the lesson recommend establishing a single device variable at the top of your training script and calling .to(device) consistently?
Correct. PyTorch doesn't broadcast across devices — a CPU tensor and a GPU tensor can't be used in the same operation. A single device variable and consistent .to(device) calls prevent the class of "RuntimeError: Expected all tensors to be on the same device" errors that catch people off guard.
The reason is device consistency enforcement. PyTorch raises a RuntimeError if you try to operate on tensors from different devices (e.g., CPU input with GPU model weights). A device variable at the top of your script ensures every tensor — your model, your data, your loss — is explicitly placed consistently.
10. In the five-step workflow, what is Step 2 and why does it come before writing any code?
Exactly. Searching the Hub before writing code answers a critical question: what am I actually building, and has someone already built the hard part? Finding a relevant pretrained model changes your plan from "train a model" to "adapt a model" — a fundamentally cheaper and faster project.
Step 2 in the lesson workflow is Hub search. The logic: if a pretrained model already handles 80% of your task, you shouldn't plan around building from scratch. Hub search sets your realistic starting point and determines whether you need fine-tuning or can get away with prompting a good base model.
11. You're running fine-tuning and your loss suddenly spikes to NaN at step 847. What's the most likely immediate fix to try?
Right. NaN loss mid-training almost always means gradients have exploded — the loss surface sent updates into a region where values became infinite, which propagates as NaN. Gradient clipping caps the gradient norm before the optimizer step, preventing explosion. Then investigate your learning rate and loss function numerics.
Sudden NaN loss is the signature of gradient explosion — large gradients compound and produce infinite values that propagate as NaN. The immediate mitigation is gradient clipping, which caps gradient norms before they update weights. Then investigate: is your learning rate too high? Are there zeros or infinities in your input data that could destabilize the loss function?
12. What distinguishes Hugging Face Spaces from a GitHub repository for showcasing ML projects?
Exactly. The portfolio value of a working Space is that it's interactive — someone can use your model in 30 seconds without any setup. A GitHub repo of training scripts requires cloning, environment setup, and assuming the person has compatible hardware. Interactivity removes friction and demonstrates end-to-end capability.
The key distinction is interactivity. A GitHub repo is code to read; a Space is a product to use. The portfolio signal is different too: a Space shows you can take a trained model through to a deployed user experience — the gap most ML learners never cross.
13. When is it appropriate to pay for GPU compute on Lambda Labs or Vast.ai rather than using Kaggle or Colab free tiers?
Right. The decision should be based on three constraints: hardware (does the job need VRAM that free tiers don't have?), compute hours (does it exceed 30 hours/week?), and reliability (is there a deadline where interruptions are unacceptable?). If any three applies, paid compute is justified.
The lesson's framework: exhaust free tiers first, but pay when you hit specific constraints. A 1B-parameter model with LoRA might run fine on Kaggle's T4. The threshold isn't model size — it's whether the actual hardware, time, and reliability requirements can be met for free. Estimate GPU-hours before deciding.
14. The lesson argues that "judgment under constraint" is the actual core ML skill. What does this mean in practice?
That's it. Knowing that fine-tuning, RAG, and prompting all exist isn't the skill. The skill is: here's a real thing I need to build, here's my actual time and compute, here's my goal — which approach gets me there fastest? That decision-making under real constraints is what separates people who ship from people who study indefinitely.
Judgment under constraint is the decision-making skill, not the technical implementation skill. It's about knowing when to fine-tune vs. prompt, when to use Kaggle vs. pay for compute, when to start from a Hub model vs. build something custom — and making those calls correctly with incomplete information under time pressure.
15. In the Reddit advice/venting classifier project described in Lesson 4, why is gradient checkpointing mentioned as a potential technique?
Right. Gradient checkpointing trades extra compute for less memory — during the backward pass, it recomputes intermediate activations instead of having stored them all during forward. The result is that models which would hit CUDA OOM errors can train on the same GPU by using ~30-40% more compute but dramatically less VRAM.
Gradient checkpointing is specifically a memory optimization for training. Instead of storing all intermediate activations during the forward pass (which is what consumes most GPU memory), it recomputes them during backward. You pay extra compute time but get lower peak VRAM — allowing larger models or batch sizes on constrained hardware like the T4.