Deep Learning: Build Real Things

1. Transfer learning means:

Transfer learning is the dominant applied ML paradigm — leverage a pretrained model's learned representations and fine-tune on your specific task. You don't need the data or compute to train from scratch.

Transfer learning: start from pretrained weights (learned general features from large datasets), fine-tune on your specific task with much less data. This is how most production AI applications are built today.

2. You're building a loan denial model. False positives (approving a bad loan) cost the company $10,000. False negatives (denying a good applicant) cost $500 in lost business. Which metric priority is correct?

Right. A false positive costs 20x a false negative in this scenario. You want to be selective about approvals — high precision means when you approve a loan, it's very likely a good one. Some good applicants will be denied (lower recall), but that's the cheaper error.

The asymmetric costs drive the metric choice. False positives cost $10,000 vs. $500 for false negatives — a 20x difference. You want high precision (when you say yes, you're right) even at the cost of lower recall (some good applicants get denied). Accuracy ignores this cost structure entirely.

3. A startup wants to build a chatbot that only answers questions about their product documentation (200 pages). They have no ML engineers, only developers. Which approach is most appropriate?

RAG is the right fit: bounded knowledge base, no ML engineers, developers can implement it with existing tools. The docs become searchable, retrieved chunks ground the LLM's answers in actual product content, and the system updates when docs update without retraining anything.

Training from scratch requires enormous data and resources. Fine-tuning to embed facts in weights is unreliable (models hallucinate around fine-tuned facts). Including all docs in every prompt quickly exceeds context windows. RAG — vector database + retrieval at inference time — is purpose-built for this bounded knowledge base scenario.

4. High-confidence wrong predictions (model confident but incorrect) are especially valuable for error analysis because:

Right. High-confidence failures reveal systematic model biases — the model has learned something wrong and is very sure about it. These are more informative than low-confidence errors, which might just be ambiguous examples.

High-confidence wrong predictions are your model's systematic failures — not random noise. The model "knows" something wrong, which means a specific pattern in the data or architecture is causing it. These are the most diagnostic examples to examine manually.

5. What is the difference between nn.Module's __init__ and forward methods?

Correct. __init__ is where you instantiate layers like nn.Linear or nn.Conv2d as instance attributes — this registers their parameters with PyTorch's parameter tracking. forward defines the computation: how input tensors flow through those layers. PyTorch calls forward automatically when you call the model like a function.

They serve distinct purposes: __init__ is for layer definition and parameter registration; forward is for the computation graph. PyTorch needs this separation because it tracks model parameters (for gradient computation and saving/loading) separately from the computation itself.

6. You open a fresh Colab notebook and run import torch; print(torch.__version__) . It works. Then you run torch.cuda.is_available() and it returns False. What is the most likely fix?

Correct. Colab defaults to CPU. GPU must be explicitly enabled under Runtime settings.

PyTorch is installed and working — the issue is the runtime type. Go to Runtime → Change runtime type → T4 GPU.

7. How did PyTorch's market share in ML research conference papers change between its 2017 release and approximately 2022?

That's the documented trajectory. PyTorch's research adoption was swift — within two years of launch it was competitive, and by 2022 it had a decisive majority at venues like NeurIPS. TensorFlow retained enterprise and production use but lost the research community that drives architectural innovation.

PyTorch's rise was real and rapid, reaching approximately 75% of research conference papers by 2022. TensorFlow wasn't discontinued — it retained enterprise users — but the research community, which drives new architectures and techniques, had clearly migrated to PyTorch.

8. What does a reliability diagram (calibration plot) with points well above the diagonal indicate?

Correct. If points are above the diagonal, actual outcomes occur more frequently than the model's predicted probability suggests. A model that says "30% chance" for events that happen 60% of the time is underconfident — it's not confident enough given the true rates.

In a reliability diagram, the diagonal is perfect calibration. Points above the diagonal mean actual rates exceed predicted probabilities — the model is underconfident. Points below the diagonal mean predicted probabilities exceed actual rates — that's overconfidence.

9. Kaggle's free GPU tier provides how many GPU-hours per week, and what is the maximum single session length?

Correct — 30 hours per week with up to 9 hours per session. Combined with Kaggle's persistent outputs and support for unattended runs, this is genuinely sufficient for many real fine-tuning jobs on smaller models.

Kaggle free gives you 30 GPU-hours/week with sessions up to 9 hours. The key advantages over Colab free tier are: you can run without a browser window open, and outputs persist as dataset artifacts across sessions.

10. To confirm your training loop is mechanically correct before running a full training job, you should:

Correct. The "overfit 10 examples" test is the most reliable quick sanity check for a training loop. If you can drive loss near zero on 10 examples, the mechanics — data batching, loss function, backward pass, optimizer step — are all working.

Overfitting 10 examples is the canonical training loop sanity check. If you can't overfit 10 examples, there's a fundamental issue with your loop — not your model capacity or data quantity.

11. A 3×3 convolutional filter sliding across an image computes, at each position:

Right. Convolution is a dot product (element-wise multiplication then sum) between filter weights and the patch of image pixels at each position. The result is one number — the filter's activation at that location.

At each position, the filter computes a weighted sum — its weights multiplied element-wise by the underlying pixel patch, then summed. That's a dot product.

12. Which of the following is NOT a valid technique to reduce overfitting?

Correct. Adding more parameters increases model capacity, which makes overfitting worse, not better. Dropout, augmentation, and early stopping all directly counteract overfitting.

More parameters = more capacity = more potential to memorize training data. That makes overfitting worse. The other three options (dropout, augmentation, early stopping) are all standard anti-overfitting tools.

13. Adam optimizer outperforms SGD on most first attempts primarily because:

Correct. Adam's key advantage is per-parameter adaptive learning rates. Each weight gets updates sized appropriately to its own gradient history, making convergence more reliable without manual learning rate tuning.

Adam's signature feature is adaptive per-parameter learning rates. Instead of applying the same learning rate to every weight, it tracks gradient history for each weight individually and adjusts accordingly. That's why it needs less tuning and converges faster than vanilla SGD.

14. You define a PyTorch model class but forget to call super().__init__() in __init__ . What is the most likely result?

Right. super().__init__() initializes the parent nn.Module , which is what enables parameter tracking, GPU transfer, and the training machinery.

Without super().__init__() , the nn.Module parent class isn't initialized, so layers won't be registered as parameters and gradient-based training won't work correctly.

15. Which loss function is standard for regression tasks, and why does it amplify large errors?

Right. MSE's squaring property means a 10x bigger error costs 100x more in loss. That makes the optimizer prioritize fixing the worst mistakes first.

MSE squares the error term. A small error squared is tiny; a large error squared is very large. That amplification is intentional.

16. The "curse of dimensionality" in one-hot encoding refers to:

Right. In very high-dimensional spaces, distances between random points tend to converge — nothing is meaningfully closer or farther than anything else. Sparse one-hot vectors exacerbate this: all word pairs have the same Euclidean distance (√2), making similarity computations meaningless.

The curse of dimensionality is fundamentally about distance metrics degrading in high-dimensional sparse spaces. When every word vector has exactly one non-zero dimension and is otherwise identical in structure, there's no geometric signal to exploit for similarity computations.

17. When is F1 score a better metric than accuracy?

Right. F1 score balances precision and recall, making it much more informative than accuracy when classes are imbalanced or when both types of errors matter. Accuracy collapses meaningful performance differences into one number that can be gamed by the majority class.

F1 is most valuable when class imbalance makes accuracy misleading, or when both precision (not too many false alarms) and recall (not too many misses) matter for your application. Accuracy alone can be gamed by predicting the majority class for everything.

18. What does LoRA (Low-Rank Adaptation) do to reduce the compute cost of fine-tuning large models?

Right. LoRA injects small trainable matrices alongside the frozen pretrained weights. Only these adapters train. Because their dimensionality is low-rank, the parameter count is tiny relative to the base model — enabling fine-tuning of a 7B model on hardware that couldn't train even a fraction of its original parameters.

LoRA freezes the base model entirely and inserts small trainable "adapter" matrices at specific layers. You train only those adapters — millions of parameters instead of the model's billions. At inference time, the adapter weights can be merged back in with no latency cost. This is why it became the standard approach for consumer-hardware fine-tuning.

19. EfficientNet's key innovation over earlier architectures like VGG and plain ResNet was:

Right. EfficientNet's compound scaling approach balances width (channel count), depth (layer count), and resolution simultaneously — rather than arbitrarily scaling one dimension — resulting in better accuracy for a given computational budget.

EfficientNet's innovation is compound scaling: systematically scaling width, depth, and resolution together based on a fixed scaling coefficient, rather than just making one dimension bigger.

20. Which task does aspect-based sentiment analysis (ABSA) perform that standard sentiment analysis does not?

Right. ABSA decomposes the heterogeneous sentiment in real reviews into attribute-specific signals. "Food was excellent but service was terrible" is a mixed-sentiment document. ABSA surfaces that the food aspect has strong positive sentiment and the service aspect has strong negative sentiment — which produces actionable operational insight that document-level sentiment loses entirely.

ABSA's distinguishing feature is aspect decomposition, not sarcasm detection or continuous scoring. The practical value is that real customer feedback rarely has uniform sentiment — ABSA preserves the variation across product dimensions that document-level analysis collapses.

Final Exam