What's Really Inside AI?

1. What do adversarial examples reveal about what neural networks actually learn?

Correct. Adversarial examples exploit the gap between human vision (shape/semantic understanding) and neural network vision (texture/statistical patterns). The perturbations that fool networks are imperceptible to humans because humans use different features.

Adversarial examples expose that models learn different features than humans — often texture statistics rather than shapes. A panda image perturbed by noise imperceptible to humans fools a network because the network was using different visual features.

2. Why does chain-of-thought prompting improve LLM performance on multi-step arithmetic?

Correct. Chain-of-thought externalizes the sequential steps of arithmetic into the context window, where attention can process them. It is a prompt engineering workaround for an architectural limitation, not a capability addition to the model itself.

Chain-of-thought doesn't add modules, parameters, or tools. It restructures the generation process so that intermediate steps — which the Transformer can't carry internally across parallel computation steps — appear explicitly in the token sequence for attention to use.

3. An autoencoder's "bottleneck" forces the model to learn what?

Exactly. The bottleneck constrains the model to compress data into a minimal representation — forcing it to retain only what matters most for reconstruction. Anomalies reconstruct poorly.

The bottleneck forces compression — the model must retain only essential information. This is why autoencoders detect anomalies: unusual inputs that don't fit the learned structure reconstruct poorly.

4. Why did Blake Lemoine's 2022 conversations with Google's LaMDA produce text that sounded like the model feared death?

Correct. The mechanism, not the output, is what matters. Fear-adjacent language was the statistically probable continuation of a long conversation about consciousness and mortality — not evidence of internal experience.

No emotional programming, genuine emotion, or intentional insertion explains the output. The training data contained enormous amounts of human writing about these topics, and statistical patterns surface when the conversational context points that direction.

5. Red-teaming in AI development refers to:

Correct.

Red-teaming means having dedicated adversarial teams try to break the model before it reaches users — systematically probing for biases, harmful outputs, and failure modes.

6. System prompts in commercial AI products represent:

Correct.

System prompts are hidden instructions that precede every user conversation, defining persona, restrictions, and behavior — invisible to users but shaping everything they receive.

7. What is Common Crawl, and why is it so central to AI training?

Correct. Common Crawl is a nonprofit that has archived web pages since 2008 — now over 250 billion pages. It's free, vast, and multilingual, making it the backbone of nearly every major language model's training data.

Common Crawl is a free, nonprofit web archive (not proprietary) containing over 250 billion web pages. Its vast scale and free availability make it the backbone of most major language model training sets.

8. In RLHF, what is the purpose of the "reward model"?

Correct. The reward model learns to simulate human preference judgments — given a prompt and a response, it predicts how a human rater would score it. The language model is then optimized via PPO to produce outputs that the reward model rates highly.

The reward model is a separate neural network trained on human preference comparisons. It learns to predict human preference scores, then serves as the training signal for PPO reinforcement learning — allowing the language model to be optimized against human preferences at scale.

9. Goodfellow et al.'s 2014 adversarial example paper showed a panda image was reclassified as what, with 99.3% confidence, after imperceptible noise was added?

Correct — gibbon, at 99.3% confidence. The changes were invisible to humans, revealing that neural networks learn perceptually alien statistical features.

The panda was classified as a gibbon with 99.3% confidence after imperceptible pixel noise — demonstrating that neural networks learn statistical features that don't match human visual perception.

10. WebText (used in GPT-3 training) filtered web pages using which mechanism?

Right. Reddit's voting system acted as a quality proxy — but it also baked Reddit's demographic skews into the training data.

WebText used Reddit upvotes as its quality filter — only pages linked from posts with ≥3 upvotes were included.

11. What happened in the Mata v. Avianca case of 2023?

Correct. ChatGPT invented six completely fictional case citations — names, courts, dates, rulings — that attorney Steven Schwartz filed in SDNY. None of the cases had ever existed.

In Mata v. Avianca, ChatGPT fabricated six entirely fictional legal case citations that attorney Steven Schwartz filed in federal court without verification.

12. What did the CNET AI article investigation reveal in January 2023?

Correct. A Futurism investigation found that more than half of the AI-written financial articles contained factual errors including incorrect interest calculations, wrong dates, and fabricated regulatory details.

Futurism found that the majority of CNET's AI-written financial articles contained factual errors — demonstrating that AI fluency in financial content does not guarantee accuracy.

13. The TIME magazine investigation into OpenAI's Kenyan content labeling workers documented:

Correct.

The investigation found workers paid $1-2/hour to label toxic content, many experiencing lasting psychological harm — human costs embedded in every ChatGPT safety response.

14. Why is deduplication an important step in preparing training data?

Exactly. If a popular news article appears 10,000 times in a training corpus, the model will learn its phrasing, claims, and style as highly representative of reality. Deduplication ensures the statistical distribution of training data reflects the actual diversity of information.

Deduplication removes duplicate and near-duplicate documents so that content appearing many times doesn't disproportionately shape what the model learns. Repeated content skews the model's internal representation of what's common or important.

15. Which statement best describes the relationship between the four failure modes covered in this module?

Correct. Hallucination, bias, brittleness, and overconfidence are all consequences of the same fundamental architecture: pattern learning from finite training distributions. Understanding this unity is the key insight of the module.

These failure modes share a common root: models learn statistical patterns from training data, and those patterns fail in predictable ways at the boundaries of the training distribution. They cannot be fully "patched" — they require systemic mitigation.

16. What was the fundamental problem with IBM Watson for Oncology's training data?

Correct. Watson trained on a narrow set of hypothetical cases from Memorial Sloan Kettering — not on the diverse, messy reality of actual patient outcomes across populations. Its high-confidence outputs lacked this grounding.

Watson's narrow training base — hypothetical cases from one expert institution — meant its confident recommendations were not grounded in representative patient data, leading to unsafe treatment suggestions.

17. What is the correct order of standard large language model training stages?

Correct.

The correct order is Pre-training (large raw data), then Fine-tuning (domain specialization), then RLHF (human preference alignment).

18. The COMPAS recidivism algorithm produced racially disparate results primarily through:

Correct.

COMPAS didn't use race as input — but variables like neighborhood encoded race indirectly because historical racial discrimination had shaped those variables in the training data.

19. What did the 2017 physical-world adversarial attack research show about stop signs?

Correct. Physical stickers — not digital modification — caused reliable real-world misclassification. The attack was robust to angle, distance, and lighting changes relevant to real autonomous vehicle deployment.

The attack used real physical stickers on a real sign and worked in the real world across multiple conditions. This was the key concern — it wasn't just a lab artefact.

20. What does a "well-calibrated" AI model mean in practice?

Correct. Calibration is the alignment between stated confidence and empirical accuracy. Most deployed deep networks are overconfident — their 90% confidence is not 90% accurate in practice.

Calibration specifically means the match between confidence scores and actual accuracy rates. A model stating 90% confidence that is only correct 60% of the time is dangerously overconfident.

Final Exam