Module 5 · Lesson 1

Anatomy of an AI Paper

Every landmark result in AI — from AlexNet to GPT-4 — arrived in a structured document. Learning to read that structure is the first skill researchers develop.

What are the standard sections of an ML research paper, and what does each one actually tell you?

In June 2017, Google Brain researchers Ashish Vaswani, Noam Shazeer, and six colleagues submitted an eight-page manuscript titled "Attention Is All You Need" to arXiv. Within weeks it became one of the most-cited papers in computing history. The ideas it contained were not hidden — they were printed in plain sections anyone could read. But you had to know which section to look at first.

The Standard Sections

Machine-learning papers follow a remarkably consistent structure regardless of venue — NeurIPS, ICML, ICLR, arXiv. Once you recognize the scaffold, you can extract the key contribution of any paper in under ten minutes.

Abstract (100–250 words)

States the problem, the proposed solution, and the headline result. Read this first. If the abstract doesn't excite you, the paper probably won't either. In "Attention Is All You Need," the abstract immediately stated the model "dispenses with recurrence and convolutions entirely."

Introduction

Motivates the problem with citations, previews the contribution, and outlines the paper. Often contains the clearest English statement of what is new. The Transformer introduction explicitly listed four contributions in bullet form.

Related Work

Places the paper in the landscape of prior work. Tells you who the authors see as competitors or predecessors. Skimming this section tells you which earlier papers you should also read.

Method / Architecture

The technical core. Equations, diagrams, pseudocode. In Transformer papers this is where multi-head attention and positional encoding are defined. You don't need to understand every equation on first read — focus on the block diagram.

Experiments

Shows that the method works. Contains benchmark comparisons, ablation studies (what happens when you remove each component), and training details. The most important table is usually labeled "Main Results" or "State of the Art Comparison."

Ablation Studies

A subset of experiments that systematically disables individual components. Tells you which parts of the method actually matter. In "Attention Is All You Need," ablations confirmed multi-head attention was essential; fewer heads hurt significantly.

Conclusion & Limitations

Summarizes findings and (in well-written papers) honestly states where the method fails. The 2021 DALL-E paper's limitations section acknowledged the model "struggled to bind attributes to objects" in complex scenes.

Appendix

Supplementary proofs, hyperparameter tables, additional figures. Usually skipped on first read unless you are reproducing results. GPT-3's appendix ran to dozens of pages of task-by-task performance tables.

A Reading Strategy That Works

Experienced researchers rarely read a paper front-to-back on first pass. The standard approach, popularized in a 2016 essay by Stanford PhD student Siddharth Krishnamurthy and independently described by AI researcher Andrej Karpathy in his public notes, is the three-pass method:

Pass 1 (5–10 minutes): Read the abstract, introduction, section headings, and conclusion only. Decide whether the paper is worth a deeper read.

Pass 2 (30–60 minutes): Read the full paper, skipping proofs and dense derivations. Pay attention to all figures and tables — they contain the densest information per square inch of any section.

Pass 3 (several hours): Reconstruct the paper's logic from scratch. Try to re-derive key equations. This pass is reserved for papers you need to implement or build upon.

Real Case — AlexNet 2012

The 2012 ImageNet paper "ImageNet Classification with Deep Convolutional Neural Networks" by Krizhevsky, Sutskever, and Hinton is eight pages. Its key contribution — using ReLU activations and dropout to train a deep CNN on two GPUs — appears in sections 3.1 and 4.1. A reader who only read the abstract and skimmed the method section could identify the core innovation in under fifteen minutes. That paper launched the modern deep learning era.

What the Title and Abstract Signal

AI paper titles follow recognizable patterns. A title like "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2018) contains the acronym, the method type, and the application domain. Titles ending in "Is All You Need" or "Without [Something]" signal that the authors are claiming to remove a prior assumption. Titles with colons typically put the acronym before the colon and the plain-English description after.

Abstract parsing skill is the single highest-leverage reading skill. A well-written abstract answers five questions: What problem? Why hard? What did we do? How did we measure it? What did we find? Practice finding each answer in under thirty seconds.

Key Terms

Ablation studyAn experiment that removes one component at a time to measure its individual contribution to overall performance.

BenchmarkA standardized dataset and evaluation protocol used to compare methods. Examples: ImageNet, GLUE, MMLU.

arXivA free preprint server at arxiv.org where AI researchers post papers before (or instead of) formal peer review. Most landmark AI papers appear here first.

ContributionThe specific new thing a paper adds to the field — a new architecture, algorithm, dataset, or theoretical insight.

Lesson 1 Quiz

Anatomy of an AI Paper — 4 questions

Which section of an ML paper typically contains the clearest plain-English statement of what is new about the work?

Correct. The introduction motivates the problem and previews contributions, often in the clearest prose in the paper. The abstract is shorter and more compressed.

Not quite. While the abstract summarizes the paper, the introduction is where authors typically spell out contributions in the clearest language, often in a bulleted list.

An ablation study is best described as:

Correct. Ablation studies systematically disable individual components to reveal which parts of a method actually matter for performance.

Not quite. Ablation studies specifically test what happens when components are removed one at a time — they reveal which parts drive performance.

In the three-pass reading method, what is the main goal of the first pass?

Correct. The first pass — abstract, intro, headings, conclusion — takes 5–10 minutes and answers: is this paper relevant and interesting enough to read further?

Not quite. The first pass is a quick scan (5–10 min) to decide relevance. Deep dives into equations and figures happen in passes two and three.

The 2012 AlexNet paper by Krizhevsky, Sutskever, and Hinton is historically significant because it:

Correct. AlexNet's use of ReLU activations, dropout, and data augmentation on two GPUs produced a top-5 error rate of 15.3% on ImageNet — nearly 11 points better than the second-place entry — launching the deep learning era.

Not quite. The Transformer was introduced in 2017 by Vaswani et al. AlexNet (2012) showed deep CNNs with ReLU and dropout could dramatically beat prior computer vision methods on ImageNet.

Lab 1 — Paper Anatomy Practice

Practice identifying sections and reading strategies with your AI research coach

Your Task

You'll practice reading AI papers strategically. Ask your coach to walk you through how to read a specific paper, quiz you on what each section contains, or help you practice parsing an abstract. Try at least three exchanges.

Suggested start: "Walk me through how to do a first-pass read of the BERT paper" — or — "Quiz me on what an ablation study reveals."

Research Coach

Paper Reading

Welcome to Lab 1. I'm your paper-reading coach. Ask me to walk you through the anatomy of a real AI paper, quiz you on sections, or help you practice parsing an abstract quickly. What would you like to work on?

Module 5 · Lesson 2

Metrics, Benchmarks, and What They Actually Measure

A number without context is noise. Understanding benchmarks lets you judge whether a claimed breakthrough is real or an artifact of evaluation design.

How do you tell whether a benchmark result reflects genuine progress — or just clever benchmark selection?

In 2018, NYU researchers released GLUE — the General Language Understanding Evaluation benchmark — to create a unified test for NLP models. Within two years, BERT and its successors had saturated the benchmark, scoring above the human baseline. By 2019, GLUE was replaced by the harder SuperGLUE. By 2021, models were saturating that too. The benchmark arms race illustrates a critical lesson: a metric is only as good as the capabilities it actually probes.

Common ML Evaluation Metrics

Different tasks use different metrics. Confusing them — or failing to notice which one a paper uses — is a common pitfall for paper readers.

Accuracy% of predictions correct. Simple but misleading on imbalanced datasets.

F1 ScoreHarmonic mean of precision and recall. Standard for classification tasks.

BLEUn-gram overlap for translation. Correlates poorly with human judgment at high scores.

PerplexityHow well a language model predicts a test set. Lower = better. Used in GPT papers.

Top-1 / Top-5 ErrorImage classification. Top-5 = correct label in model's 5 best guesses.

MMLU Score% correct on 57-subject multiple-choice questions. Standard LLM reasoning benchmark since 2021.

How to Read a Results Table

The main results table is the most information-dense element in any ML paper. To read it correctly, check four things before looking at the numbers:

1. What is the metric? Is higher better or lower better? BLEU and accuracy go up; perplexity and error rate go down.

2. What is the test set? Is it the standard held-out test split, or a custom set the authors created? Custom test sets should raise scrutiny.

3. Are the baselines fair? Are the competing methods trained on the same data, with comparable compute? In 2020, papers comparing to GPT-2 baselines while training on far more data were widespread and frequently misleading.

4. Is variance reported? Single-run results without confidence intervals or standard deviations across seeds are unreliable. The 2022 ML Reproducibility Challenge found that over 30% of submitted reproductions failed to match reported results within reported margins.

Real Case — "Stochastic Parrots" and BLEU

The 2020 critique of large language models ("On the Dangers of Stochastic Parrots," Bender et al., 2021) noted that BLEU scores for machine translation had risen dramatically — but human evaluations showed improvements in fluency without corresponding gains in factual accuracy. BLEU measures surface n-gram overlap, not meaning. A model that reproduces plausible-sounding text can score high on BLEU while being factually wrong. This mismatch between metric and capability is one of the most important critical reading skills.

Benchmark Saturation and Leaderboard Gaming

When models score above the human baseline on a benchmark, it does not necessarily mean AI is superhuman at the underlying task. It often means the benchmark is too narrow. The ImageNet benchmark was "solved" in 2015 when ResNet achieved lower top-5 error than the measured human rate — yet models trained on ImageNet fail on rotated, cropped, or adversarially perturbed images that humans handle effortlessly.

Leaderboard gaming — releasing many model variants and cherry-picking the best results for publication — became common enough that major venues including NeurIPS introduced "test set secrecy" policies starting in 2019 to prevent overfitting to held-out test data.

When reading a paper, always ask: Is this a new benchmark the authors created specifically to show their method in the best light? If yes, look for a second set of results on standard public benchmarks.

Key Terms

BenchmarkA fixed dataset + evaluation protocol used to compare models. Only valid when the test set is not used during training.

Held-out test setData the model has never seen during training or validation. Results on the training or validation set are not reliable for comparison.

Overfitting to benchmarkImproving a metric without improving the underlying capability. Can happen through hyperparameter tuning on the test set or cherry-picking architectures.

Confidence intervalA range expressing uncertainty in a measurement. Results that lack confidence intervals or error bars are harder to evaluate critically.

Lesson 2 Quiz

Metrics, Benchmarks, and What They Actually Measure — 4 questions

A model achieves top-5 accuracy above the reported human baseline on ImageNet. This most accurately means:

Correct. Benchmark performance is specific to the benchmark's distribution and task framing. ImageNet-trained models fail on rotated or adversarially perturbed images humans handle easily.

Not quite. "Solving" a benchmark only means outperforming on that specific set and metric — not achieving general superhuman vision capability.

BLEU score is a metric primarily used for:

Correct. BLEU measures how much n-gram overlap exists between a model's output and reference human translations. It does not directly measure factual accuracy or meaning.

Not quite. BLEU (Bilingual Evaluation Understudy) is a translation metric that measures n-gram overlap with human reference translations.

When reading a paper's results table, which of these should raise the most scrutiny?

Correct. Custom benchmarks designed by the authors can be constructed — intentionally or not — in ways that favor the proposed method. Standard public benchmarks with independent test sets are more trustworthy.

Not quite. The highest-scrutiny flag is relying solely on a custom benchmark without any public benchmark comparison — authors may be selecting evaluation conditions that favor their method.

GLUE was replaced by SuperGLUE in 2019 primarily because:

Correct. Benchmark saturation — when models match or exceed human baselines — means the benchmark can no longer meaningfully rank new models. GLUE saturated within two years of release.

Not quite. The core issue was saturation: BERT-era models had exceeded the human baseline on GLUE, rendering it useless for distinguishing current model capabilities.

Lab 2 — Benchmark Analysis

Practice evaluating metrics and results tables critically

Your Task

Practice critically reading results tables and benchmarks. Ask your coach to walk you through a real paper's results section, quiz you on what makes a fair comparison, or help you identify red flags in benchmark design. Aim for at least three substantive exchanges.

Suggested start: "What are the red flags I should look for in an LLM benchmark comparison?" — or — "Walk me through how to read the main results table in a typical NLP paper."

Research Coach

Benchmarks & Metrics

Ready to practice benchmark analysis. I can walk you through reading results tables from real papers, explain when a metric is misleading, or quiz you on evaluation pitfalls. What would you like to explore?

Module 5 · Lesson 3

Spotting Hype, Limitations, and Missing Baselines

Press releases and Twitter threads describe the best-case interpretation of a paper. Critical readers learn to find what's missing — and why that matters as much as what's there.

What patterns of omission and framing should alert a careful reader that a claimed result deserves extra scrutiny?

In May 2018, Google CEO Sundar Pichai demonstrated Google Duplex at Google I/O — an AI system that called a hair salon and a restaurant and made appointments in natural conversation. The demonstration audio was flawless and the audience gasped. Subsequent reporting by The New York Times and others found that the demo calls had been carefully selected from a larger set; many calls required human operator intervention not shown on stage. There was no published paper, no test set, no metric. The limitations section was a press conference.

Red Flags in AI Papers and Announcements

Learning to distinguish genuine progress from well-packaged hype is one of the most valuable skills for anyone working in or adjacent to AI research. The following patterns, when present, warrant additional scrutiny.

⚑No public test set or code. If results cannot be reproduced independently, they cannot be verified. The 2020 GPT-3 paper was released without public weights — though OpenAI provided API access. The 2022 Chinchilla paper by Hoffmann et al. at DeepMind similarly reported benchmark results without releasing model weights for independent evaluation.
⚑Comparisons to weak or outdated baselines. Comparing a 2024 model to a 2020 baseline without including 2023 state-of-the-art is misleading. Look for the phrase "competitive baselines" — when present, it's a signal that the authors have chosen them carefully.
⚑Compute asymmetry. If the proposed model used 10x more compute than the baselines, the gain may reflect resources rather than a better algorithm. The 2019 Megatron-LM paper acknowledged this explicitly; many papers do not.
⚑Cherry-picked examples in qualitative results. Example outputs shown in figures are always selected to look good. A figure showing four impressive image-generation outputs tells you nothing about failure rate. Look for failure case sections — their absence is itself informative.
⚑Missing error analysis. Where does the model fail, and why? The 2021 DALL-E paper included a limitations section noting difficulty binding attributes to the correct objects. Papers without any failure mode discussion are rarely being fully honest about the method's scope.
⚑Single-dataset evaluation. A result that holds on one dataset but has not been tested on others may be specific to the data distribution rather than the method. The 2022 ML Reproducibility Challenge confirmed this pattern repeatedly across NLP papers.
⚑"Emergent abilities" framing without precise definition. The 2022 paper "Emergent Abilities of Large Language Models" (Wei et al.) prompted a significant debate. A 2023 follow-up by Schaeffer et al. argued that many "emergent" abilities were artifacts of the choice of metric — using a nonlinear metric (like exact-match accuracy) on a task where the model smoothly improves creates the illusion of a discontinuous jump.

How to Find the Limitations Section

Since 2021, NeurIPS has required a "Broader Impacts" section, and many venues now request explicit limitation statements. But limitations are sometimes buried or minimized. Search the paper for the word "limitation" — if it doesn't appear, the conclusion section typically contains the authors' most candid assessment. If neither contains honest limitations, that itself is a signal about the paper's quality.

The strongest papers in AI research tend to have the most detailed limitations sections. The 2021 paper "TruthfulQA: Measuring How Models Mimic Human Falsehoods" (Lin et al.) devoted substantial space to discussing where its own benchmark could mislead; this level of self-critique is a marker of rigorous research culture.

Real Case — The Reproducibility Crisis in ML

A 2019 study by Dodge et al. ("Show Your Work: Improved Reporting of Experimental Results with Information-Theoretic Significance") found that many NLP results in top venues depended critically on hyperparameter tuning that was not reported. Independently reproducing these results required hundreds of GPU-hours the original papers did not mention. The Papers With Code Reproducibility Challenge, running annually since 2019, has systematically documented that approximately 25–30% of reproduced ML papers fail to replicate within reported margins when the original hyperparameters are not provided.

Key Terms

ReproducibilityThe ability of independent researchers to obtain the same results using the same method and data. A hallmark of scientific validity.

Compute asymmetryWhen the proposed method uses significantly more computational resources than the baselines it outperforms — making fair comparison difficult.

Cherry-pickingShowing only the best examples from a model's outputs without reporting the failure rate or systematic evaluation.

Emergent abilityA capability that appears suddenly at a certain model scale without being explicitly trained. The concept is contested — some apparent emergences are metric artifacts.

Lesson 3 Quiz

Spotting Hype, Limitations, and Missing Baselines — 4 questions

A paper proposes a new language model and shows results only on a single custom dataset the authors created. The most appropriate critical response is to:

Correct. Single custom dataset results cannot be trusted without corroboration on standard public benchmarks, since custom sets may be designed to favor the proposed method.

Not quite. Custom datasets alone are insufficient evidence. Standard public benchmarks allow independent comparison and reduce the risk of evaluation being tailored to the method.

The 2023 paper by Schaeffer et al. argued that many "emergent abilities" in large language models were:

Correct. Schaeffer et al. showed that when a linear metric is substituted for the nonlinear exact-match metric, the apparent "emergence" disappears — performance improves smoothly with scale.

Not quite. Schaeffer et al. argued that nonlinear metrics (like exact-match accuracy) applied to tasks where models improve smoothly create the illusion of sudden emergent jumps.

Which of the following is the strongest signal of a rigorous, honest research paper?

Correct. Detailed, honest limitations sections are a hallmark of rigorous research. Strong papers like TruthfulQA (Lin et al., 2021) devoted significant space to their own methodology's weaknesses.

Not quite. Cherry-picked examples and no failures across every benchmark are actually red flags. A detailed limitations section showing where the method fails is the strongest signal of honest reporting.

The Papers With Code Reproducibility Challenge has found that approximately what fraction of reproduced ML papers fail to replicate within reported margins when hyperparameters aren't provided?

Correct. The annual reproducibility challenge consistently finds that roughly 25–30% of papers fail to replicate when standard hyperparameters are not published alongside results.

Not quite. The Papers With Code Reproducibility Challenge found approximately 25–30% failure rate — a significant minority that underscores why reproducibility matters in ML research.

Lab 3 — Hype Detection Practice

Practice identifying red flags and missing information in AI research claims

Your Task

Practice your critical reading skills. Describe an AI paper or news announcement to your coach and ask for a hype-detection analysis. Or ask the coach to present you with a realistic paper abstract and quiz you on what's missing. Aim for at least three substantive exchanges.

Suggested start: "Here's an abstract from an AI paper — tell me what critical questions I should ask about it." — or — "What are the top three signs that an AI paper is overstating its results?"

Research Coach

Critical Reading

Let's practice spotting hype and missing information. You can share an abstract or AI headline with me and I'll help you identify what to scrutinize, or I can quiz you on red flag patterns from real papers. What would you like to work on?

Module 5 · Lesson 4

Following the Research Landscape

Staying current with AI research is a professional skill. The field moves fast enough that a six-month-old paper can already be superseded — knowing where to look and how to filter is as important as reading itself.

What are the most reliable tools, venues, and habits for tracking AI research developments without being overwhelmed?

In January 2023, arXiv's cs.LG (machine learning) category received over 6,000 new submissions — roughly 200 per day. In January 2019, the same category received around 1,500. No researcher reads every paper. Every working AI professional has a triage system. Learning yours is part of becoming a researcher.

Where AI Research Appears

AI research reaches the public through a hierarchy of venues with different levels of peer review, speed, and prestige.

arXiv

Preprints, days after completion. Most major AI papers appear here before formal peer review. No peer review — quality varies enormously. Essential for staying current.

NeurIPS

Neural Information Processing Systems. Premier venue for machine learning research. Submissions reviewed by 2–5 reviewers. Acceptance rate typically 20–25%. Annual conference in December.

ICML

International Conference on Machine Learning. Peer-reviewed. Strong theory and applied ML. Comparable prestige to NeurIPS. Annual conference in July.

ICLR

International Conference on Learning Representations. Open peer review — reviews are public on OpenReview.net. Strong for deep learning, LLMs, representation learning. Annual conference in May.

ACL / EMNLP

NLP-focused venues. Association for Computational Linguistics and Empirical Methods in NLP. BERT, GPT-2, and most landmark NLP papers appeared here or at ICLR.

Nature / Science

Interdisciplinary high-impact journals. DeepMind's AlphaFold 2 result (2021) and AlphaGo (2016) both appeared in Nature. Papers here reach non-specialist audiences but are less common for pure ML work.

Practical Tools for Staying Current

Papers With Code (paperswithcode.com): Links papers to their code implementations and shows state-of-the-art rankings by benchmark. Free. Updated daily. The benchmark leaderboards are the fastest way to see which methods currently lead on any given task.

Semantic Scholar: Academic search engine with citation graphs, influence scores, and alerts. Owned by the Allen Institute for AI. Setting alerts on key authors means you are notified when they publish.

Hugging Face Daily Papers: Community-curated selection of 5–10 significant arXiv papers per day. Lower noise than following arXiv directly. The upvote system surfaces papers with broader community interest.

Connected Papers: Visualizes citation networks around a seed paper. Useful for understanding how a method fits into the broader literature — and for finding papers that cite a foundational work without being cited by it.

Twitter/X and Bluesky researcher accounts: Many AI researchers share preprints and commentary publicly. Following the authors of papers you find important gives you the informal discussion layer that peer review doesn't capture.

Real Case — AlphaFold 2 and Open Review

DeepMind's AlphaFold 2 was announced at CASP14 in November 2020 with a median GDT score of 92.4 — near experimental accuracy for protein structure prediction. The full paper appeared in Nature in July 2021. DeepMind then released the model weights and structure database for free via the European Bioinformatics Institute. By 2023, over 200 million protein structures were publicly available. This case illustrates that the most impactful AI results often come with substantial open-access releases — and that a CASP competition result preceded the formal paper by eight months.

Building a Reading Habit That Scales

No one can read every paper. The researchers who stay effectively current use triage strategies. A common approach: spend 15 minutes each morning scanning Hugging Face Daily Papers or Papers With Code highlights. First-pass read (abstract + intro) any paper that intersects your work. Full second-pass read only papers you need to implement or directly compete with. Deep third-pass read fewer than 10 papers per year.

Citation tracking is equally important. When you find a paper that is foundational to your work, set a Google Scholar or Semantic Scholar alert to be notified when it is cited. New papers that cite foundational work are often the most relevant to your specific area.

Reading groups — formal or informal — compound this effort. Two people doing first-pass reads on different papers and sharing summaries weekly doubles your coverage. Most AI research teams at companies like Google DeepMind, Anthropic, and Meta FAIR run internal weekly reading groups.

Key Terms

PreprintA paper posted to arXiv or similar server before formal peer review. Not officially peer-reviewed but often the fastest way to see new results.

Peer reviewEvaluation of a paper by independent experts before publication. Conference reviews (NeurIPS, ICML) are typically 2–5 reviewers with a single round of revision.

Citation alertAn automated notification (via Google Scholar or Semantic Scholar) when a specified paper receives a new citation. Useful for tracking how a foundational result is being extended.

State of the art (SOTA)The best-performing method on a given benchmark at a given point in time. SOTA changes frequently in fast-moving areas like LLMs.

Lesson 4 Quiz

Following the Research Landscape — 4 questions

Which of the following best describes what Papers With Code provides to researchers?

Correct. Papers With Code connects papers to their implementations and maintains benchmark leaderboards — making it the fastest way to see which methods currently lead on any task.

Not quite. Papers With Code is specifically known for linking papers to code implementations and maintaining benchmark leaderboards. Citation networks are Connected Papers; community curation is Hugging Face Daily Papers.

DeepMind's AlphaFold 2 protein structure prediction result was first publicly announced at:

Correct. AlphaFold 2's breakthrough performance (median GDT 92.4) was revealed at the CASP14 competition in November 2020. The full methodology paper appeared in Nature in July 2021.

Not quite. The result was announced at CASP14 (Critical Assessment of Protein Structure Prediction) in November 2020 — a competition, not a traditional conference — eight months before the Nature publication.

ICLR differs from NeurIPS and ICML primarily in that it:

Correct. ICLR pioneered open peer review on OpenReview.net — reviews, ratings, and author responses are all publicly visible, making the review process itself a research artifact.

Not quite. ICLR's distinguishing feature is open peer review on OpenReview.net — the review process, scores, and discussions are publicly visible for all submissions.

According to the lesson, approximately how many papers per day was arXiv's cs.LG category receiving in January 2023?

Correct. cs.LG received over 6,000 submissions in January 2023 — approximately 200 per day — up from around 1,500 for the same month in 2019, illustrating why triage strategies are essential.

Not quite. The lesson cited over 6,000 submissions to cs.LG in January 2023 — roughly 200 per day. This volume makes systematic triage strategies essential for any working researcher.

Lab 4 — Research Landscape Navigation

Build your personal system for staying current with AI research

Your Task

Work with your coach to design a research-tracking workflow that fits your goals. Ask for recommendations on which venues, tools, and habits make sense for your specific interest area in AI. Aim for at least three substantive exchanges.

Suggested start: "I'm interested in LLM safety research — what venues and tracking tools should I prioritize?" — or — "Help me design a 20-minute daily research reading habit."

Research Coach

Research Navigation

Let's build your research-tracking system. Tell me which area of AI you're most focused on, and I'll help you identify the right venues, tools, and habits to stay current without being overwhelmed. What's your primary area of interest?

Module 5 Test

Reading AI Research — 15 questions · 80% to pass

1. Which section of an ML paper would you read to understand which prior work the authors consider most closely related to theirs?

Correct. The Related Work section explicitly positions the paper among prior approaches and tells you who the authors see as predecessors or competitors.

Not quite. Related Work is where authors survey the landscape of prior methods — it's the section that tells you which earlier papers to read next.

2. In the three-pass reading method, what distinguishes the third pass from the second?

Correct. The third pass involves reconstructing the paper from scratch — re-deriving equations, questioning every assumption. It is reserved for papers you need to implement or build upon.

Not quite. The third pass is the deepest: attempting to reconstruct the paper's logic independently, re-derive equations, and identify every assumption. It takes several hours.

3. The "Attention Is All You Need" paper (Vaswani et al., 2017) introduced:

Correct. The 2017 paper introduced the Transformer — a sequence-to-sequence architecture based entirely on attention mechanisms, abandoning recurrent and convolutional layers.

Not quite. "Attention Is All You Need" introduced the Transformer architecture — based purely on attention, with no recurrence or convolutions.

4. The GLUE benchmark was replaced by SuperGLUE in 2019. The primary reason was:

Correct. Benchmark saturation — models exceeding the measured human baseline — rendered GLUE unable to meaningfully rank new models, necessitating SuperGLUE.

Not quite. Benchmark saturation was the core issue: BERT-era models exceeded the human baseline on GLUE, so it could no longer differentiate model quality.

5. Perplexity as an evaluation metric for language models indicates:

Correct. Perplexity measures how well a language model predicts a test set. Lower perplexity = better prediction. It is commonly used in language model papers including the GPT series.

Not quite. Perplexity measures how surprised the model is by test text — lower perplexity means the model predicts the test distribution more accurately.

6. When a results table shows a new method outperforming baselines, which factor should make you most skeptical of the comparison?

Correct. Compute asymmetry — when the proposed method has substantially more resources than baselines — means the gain may reflect resources rather than a better algorithm. This is a key red flag.

Not quite. Compute asymmetry is the key concern: if the new method uses 10x more compute, the improvement may reflect resources rather than algorithmic progress.

7. The BLEU score metric is primarily criticized because:

Correct. BLEU measures n-gram overlap — a surface form match — not semantic accuracy or factual correctness. A fluent but factually wrong output can score high on BLEU.

Not quite. BLEU's core limitation is that n-gram overlap doesn't capture meaning. A model can score high on BLEU while being factually inaccurate.

8. NeurIPS requires authors to include which section, added as a requirement from 2021 onward?

Correct. NeurIPS added a Broader Impacts requirement to encourage authors to consider societal implications and honestly address limitations of their work.

Not quite. NeurIPS introduced the Broader Impacts section requirement — addressing potential societal effects and limitations — as a mandatory component of submissions.

9. Connected Papers is a tool primarily used for:

Correct. Connected Papers creates a visual citation network around a seed paper — useful for discovering related work you haven't encountered and understanding how a method fits into the research landscape.

Not quite. Connected Papers visualizes citation networks. Citation alerts are a Semantic Scholar / Google Scholar feature; code is on Papers With Code.

10. AlphaFold 2's breakthrough protein structure prediction result was first made public at:

Correct. AlphaFold 2's CASP14 performance (median GDT 92.4) was announced at the competition in November 2020. The full Nature paper followed in July 2021.

Not quite. The result was announced at CASP14 (the protein structure prediction competition) in November 2020, ahead of the Nature paper in July 2021.

11. A paper that shows only cherry-picked qualitative examples without reporting systematic evaluation statistics:

Correct. A figure with four impressive outputs is always selected from many outputs. Without systematic evaluation, you have no information about failure rate, which is essential for assessing the method.

Not quite. Cherry-picked examples are uninformative about typical performance and failure rate — they are always selected from the best outputs of many.

12. The Schaeffer et al. (2023) critique of emergent abilities argued that many apparent emergence phenomena were caused by:

Correct. When linear metrics are substituted for nonlinear metrics like exact-match accuracy, the apparent "emergence" disappears and performance improves smoothly — suggesting the jump was a metric artifact.

Not quite. Schaeffer et al. showed that nonlinear metrics applied to smoothly-improving models create the illusion of discontinuous jumps — the emergence was a measurement artifact.

13. ICLR's open peer review on OpenReview.net means that:

Correct. OpenReview.net publishes all submissions, their reviews, ratings, and author rebuttals publicly — making the review process itself a transparent research artifact.

Not quite. ICLR's open review means all submissions, reviews, scores, and author responses are publicly visible on OpenReview.net — even for rejected papers.

14. The Papers With Code Reproducibility Challenge has found that roughly what percentage of reproduced ML papers fail to replicate within reported margins?

Correct. The annual challenge consistently finds approximately 25–30% of papers fail to replicate — a substantial fraction that makes code release and hyperparameter reporting essential.

Not quite. The reproducibility challenge finds roughly 25–30% failure rate — underscoring the importance of code release and complete experimental reporting.

15. Which of the following is the most reliable daily triage tool for staying current with important AI research without reading all of arXiv directly?

Correct. Hugging Face Daily Papers and Papers With Code provide community-curated highlights — surfacing the most significant papers without requiring you to scan all 200+ daily arXiv submissions.

Not quite. At 200 papers/day, reading all of cs.LG is impossible. Hugging Face Daily Papers and Papers With Code highlights curate the most significant 5–10 papers — a practical daily triage strategy.