Intro
L1
Β·
Quiz
Β·
Lab
L2
Β·
Quiz
Β·
Lab
L3
Β·
Quiz
Β·
Lab
L4
Β·
Quiz
Β·
Lab
Module Test
What's Coming Next Β· Introduction

We Have Been Here Before β€” and We Have Always Underestimated It

Every era has its transformative technology. Every era also produces confident predictions about what that technology cannot do.

In September 1878, the New York Sun ran a skeptical editorial about Thomas Edison's claims for electric light. The piece argued that practical incandescent illumination was a "sheer nonsense" β€” that gas companies had nothing to worry about. Edison filed for the patent on his working bulb two months later. The pattern is not confined to electricity: in 1943, IBM chairman Thomas Watson reportedly said the world market for computers was perhaps five machines. In 1981, Bill Gates said 640 kilobytes of memory ought to be enough for anyone. Smart people, close to the technology, repeatedly failed to see the trajectory they were standing on.

The same failure mode is operating right now around artificial intelligence. Between 2012 and 2022, AI systems went from barely recognizing cats in photos to writing legal briefs, generating photorealistic images, passing the bar exam in the 90th percentile, and folding proteins that had stumped biochemists for fifty years. Each of those milestones was declared impossible or "decades away" by credentialed experts just before it happened. The capability curve has not leveled off. Knowing how to read it β€” what counts as real progress, what is hype, and what mechanisms actually drive improvement β€” is the core skill this course builds.

This course, What's Coming Next, will not tell you which specific products will exist in 2030. Nobody knows that. What it will give you is a set of durable frameworks: how to evaluate benchmark claims, how to distinguish scaling gains from architectural breakthroughs, how to spot the difference between a genuine capability jump and a well-funded press release. Four modules, starting right here with how to read progress itself.

If you finish every module, here's who you become:

  • You'll understand why credentialed experts repeatedly failed to see AI's trajectory β€” and how to avoid the same pattern yourself.
  • You will be able to distinguish a genuine capability jump from a well-funded press release using concrete signal-reading frameworks.
  • You'll know how the research pipeline actually works β€” from academic paper to shipped product β€” and how long each stage realistically takes.
  • When a major AI announcement drops, you'll know exactly which questions to ask before updating your view of what the technology can do.
  • You will understand what agentic AI means in practice β€” how the shift from tools to actors changes risk, opportunity, and your own decisions.
  • You'll read infrastructure investments, geopolitical competition, and hardware bets as leading indicators of where the frontier is actually heading.
  • You are becoming someone who engages with AI progress as an informed actor β€” not a passive observer swept along by each news cycle.
What's Coming Next Β· Module 1 Β· Lesson 1

Benchmarks, Curves, and What They Actually Mean

The language of AI progress is full of numbers β€” but numbers without context mislead as often as they inform.
When a headline says an AI "surpassed human performance," what question should you ask first?

On October 11, 2015, a paper from Microsoft Research announced that its ResNet model had achieved a 3.57% error rate on the ImageNet Large Scale Visual Recognition Challenge β€” beating the commonly cited human baseline of 5.1%. Tech headlines declared that AI could now "see better than humans." What the headlines did not mention: the human baseline had been measured on a random sample of 1,500 images by a single annotator working quickly. When researchers ran a more careful human test, trained annotators scored around 3.5% β€” essentially matching the machine. The AI had not surpassed human vision. It had matched a specific human, on a specific dataset, on a specific task. The distinction matters enormously.

That episode established a template repeated constantly since: a real capability advance gets wrapped in a misleading comparison, the comparison travels faster than the correction, and policy makers, investors, and the public build mental models on flawed foundations. Reading AI progress well starts with learning to disaggregate the claim from the benchmark from the underlying capability.

What a Benchmark Actually Measures

A benchmark is a standardized test β€” a fixed dataset with a scoring rule. It measures one thing: performance on that dataset under those rules. It does not measure general capability, robustness, real-world usefulness, or what the system will do on inputs outside the test set. Every AI benchmark has these properties whether researchers acknowledge them or not.

The ImageNet dataset, introduced in 2010, contains about 1.2 million images across 1,000 categories. Achieving low error on it is genuinely impressive β€” but it tells you almost nothing about how well a system handles medical images, satellite photos, handwritten documents, or any image type underrepresented in the dataset. When GPT-4 scored in the 90th percentile on the Uniform Bar Examination in March 2023, that is a real achievement. It also tells you nothing about whether the model can reliably navigate a real client intake, maintain confidentiality across sessions, or recognize when a question is outside its competence.

Goodhart's Law is the central hazard: once a benchmark becomes the target, it ceases to be a good measure. AI labs optimize their training pipelines against benchmark datasets β€” sometimes inadvertently, sometimes deliberately. A model that scores 95% on a reading comprehension benchmark may score 60% on questions that test the same skill but are phrased differently. The benchmark has been solved; the underlying capability has not necessarily been acquired.

Critical Question

Every time you see "AI achieves human-level performance," ask: human-level on what specific task, measured by what specific test, compared against which humans doing the task under what conditions? All four answers change the meaning of the claim.

The Scaling Story: What the Curves Show

In 2020, OpenAI researchers published "Scaling Laws for Neural Language Models," demonstrating that language model performance improved predictably as researchers increased three variables: the number of model parameters, the volume of training data, and the amount of compute used for training. The relationship followed a power law β€” each order-of-magnitude increase in compute produced a roughly fixed percentage improvement in loss.

This was a genuinely important finding because it suggested that progress was not dependent on new algorithmic breakthroughs β€” you could simply build bigger and predict roughly how much better your system would get. GPT-3 (175 billion parameters, released June 2020) and GPT-4 (architecture not fully disclosed, released March 2023) both followed this logic. So did Google's PaLM (540 billion parameters, April 2022) and Anthropic's Claude series.

But scaling laws have limits. The curves measure a specific loss metric β€” how well the model predicts the next token β€” which does not map cleanly onto all downstream tasks. A model can improve its training loss while making the same types of factual errors. More troublingly, some capabilities appear to emerge discontinuously: a model at one scale fails completely at a task, then at a larger scale succeeds reliably. These emergent capabilities are difficult to predict from the scaling curves alone, which complicates forecasting.

Scaling Law A mathematical relationship showing that model performance improves predictably as compute, data, or parameter count increases β€” following a power law function documented by Kaplan et al. (2020) and Hoffmann et al. (2022, "Chinchilla").
Emergent Capability A task where performance is near-zero at smaller scales and jumps sharply at larger scales, apparently unpredictably. Wei et al. (2022) documented dozens of such transitions in the BIG-Bench evaluation suite.
Benchmark Saturation When AI systems achieve near-ceiling scores on a benchmark, making it useless for distinguishing further progress. ImageNet saturated around 2021; GLUE saturated in 2019 and was replaced by SuperGLUE, which also saturated within two years.

Reading Progress Honestly

Two failure modes dominate public discourse about AI progress: uncritical hype and reflexive dismissal. Both produce equally wrong predictions. The hype failure mode latches onto every benchmark record and extrapolates to general intelligence. The dismissal failure mode latches onto every chatbot error and concludes the technology is fundamentally limited.

The evidence-based approach requires holding two things simultaneously: real, documented, substantial progress has occurred across many domains since 2012; and that progress has been uneven, benchmark-sensitive, and repeatedly mischaracterized. AlphaFold 2, released by DeepMind in July 2021, solved the protein structure prediction problem that had resisted biology for fifty years β€” that is a genuine scientific breakthrough with real consequences for drug discovery. At the same time, large language models routinely fail at simple counting tasks, spatial reasoning, and multi-step arithmetic that a ten-year-old handles without difficulty. Both facts are true.

The frameworks in the rest of this module β€” how compute translates to capability, how to read research papers about AI, and how to evaluate claims about what's coming β€” all depend on starting from this honest baseline: progress is real, substantial, and uneven, and reading it well requires specificity rather than either enthusiasm or cynicism.

Module Framing

This module builds four skills: reading benchmark claims critically (L1), understanding what drives capability jumps (L2), evaluating AI research papers and announcements (L3), and applying these frameworks to specific near-term AI developments (L4). Each lesson includes a hands-on lab to practice the skill against real examples.

Key sources for this lesson: Russakovsky et al., "ImageNet Large Scale Visual Recognition Challenge" (2015); He et al., "Deep Residual Learning for Image Recognition" (2015); Kaplan et al., "Scaling Laws for Neural Language Models" (2020); Hoffmann et al., "Training Compute-Optimal Large Language Models" (2022); Wei et al., "Emergent Abilities of Large Language Models" (2022).

Lesson 1 Quiz β€” Benchmarks & Curves

Five questions Β· Select the best answer for each
1. In October 2015, headlines claimed AI had surpassed human vision on ImageNet. What was the primary flaw in that comparison?
Correct. The human baseline used a single annotator working quickly on 1,500 images. Careful retesting with trained annotators produced human error rates matching the AI β€” the claim of "surpassing" was an artifact of the measurement, not the capability.
Not quite. The core problem was the human baseline methodology: a single rushed annotator vs. a carefully optimized AI test. Retesting showed trained humans matched the machine.
2. What does Goodhart's Law predict will happen when a benchmark becomes the primary optimization target for AI development?
Correct. Goodhart's Law: once a measure becomes a target, it ceases to be a good measure. Models optimized against a benchmark solve the benchmark's particular patterns rather than the general skill the benchmark was meant to proxy.
Goodhart's Law specifically predicts the opposite of broad transfer. Optimizing against a target decouples the score from the underlying capability β€” the benchmark stops measuring what it was designed to measure.
3. The 2020 OpenAI scaling laws paper demonstrated that language model performance improved predictably as researchers increased which three variables?
Correct. Kaplan et al. (2020) showed power-law relationships between performance and each of: parameter count, dataset size, and compute β€” with each scaling roughly independently.
The three variables in Kaplan et al. (2020) were model parameters, training data volume, and training compute. Architecture complexity and fine-tuning were not the primary variables in the scaling laws framework.
4. What is "benchmark saturation" and why does it matter for tracking AI progress?
Correct. GLUE saturated in 2019, SuperGLUE in 2021, and ImageNet effectively saturated by the early 2020s. Each saturation forced researchers to create harder benchmarks β€” but each new benchmark also risks the same optimization pressure that caused saturation in the first place.
Saturation specifically means AI scores approach the ceiling of what the benchmark can measure, not issues of access or complexity. Once saturated, a benchmark can no longer distinguish between improvements in the underlying capability.
5. AlphaFold 2 (2021) is cited as a genuine scientific breakthrough rather than benchmark performance. What distinguishes it from typical AI benchmark achievements?
Correct. AlphaFold 2 solved the protein folding problem β€” a challenge that had defined structural biology for decades β€” and its predictions have been used directly in drug discovery research. This is distinct from scoring well on an artificial test; the capability has genuine scientific utility.
AlphaFold 2's significance is not primarily about the benchmark margin. It solved a real scientific problem with direct downstream consequences. The distinction from typical AI benchmarks is real-world applicability and genuine domain impact.

Lab 1 β€” Dissecting an AI Benchmark Claim

Practice interrogating a real AI performance headline with the frameworks from Lesson 1

Your Task

You will interrogate AI benchmark claims the way a careful analyst would. The AI assistant has been primed with the Lesson 1 frameworks β€” use it to work through the specific questions below, or bring your own benchmark claim to examine.

Complete at least 3 exchanges to mark this lab done. Push back, ask follow-ups, and try to find the limits of what a benchmark claim actually tells you.

Starter prompt: "A 2023 headline read: 'GPT-4 passes the bar exam at the 90th percentile β€” lawyers should be worried.' Walk me through what this claim actually tells us and what it doesn't, using the benchmark analysis framework."
AI Lab Assistant
Benchmark Analysis
Ready to work through benchmark claims with you. Try the starter prompt above, or bring any AI performance headline you want to dissect. I'll help you apply the four key questions: what task, what test, which humans, and what conditions.
What's Coming Next Β· Module 1 Β· Lesson 2

What Actually Drives Capability Jumps

Progress in AI does not come from a single source β€” understanding the levers helps you predict which improvements are likely and which are wishful thinking.
When an AI system makes a dramatic leap in capability, what three mechanisms are most likely driving it β€” and how do you tell them apart?

On December 2, 2022, OpenAI released ChatGPT. Within five days it had one million users; within two months, one hundred million β€” the fastest consumer application ramp in history. Analysts scrambled to explain the leap from GPT-3 (which had been publicly available since 2021 and had not caused comparable disruption) to this new system. The common explanation was scale: GPT-4 must be much larger. That explanation was wrong, or at least incomplete. The dominant factor was reinforcement learning from human feedback (RLHF) β€” a training technique that had been developed at OpenAI and Anthropic through 2021 and 2022, which aligned model outputs with human preferences. The capability jump was primarily an alignment and interface improvement, not a raw compute increase. Understanding which lever moved is the central skill.

The Three Primary Levers

AI capability improvements come from three separable sources, and distinguishing them matters for forecasting:

Lever 1 β€” Compute & Scale. More parameters, more training data, more GPU-hours. This lever has driven the majority of headline progress since 2012. It is the most predictable lever β€” scaling laws let researchers estimate gains in advance. It is also the most expensive and subject to diminishing returns. The 2022 Chinchilla paper (Hoffmann et al.) showed that most large models had been undertrained relative to their parameter count β€” they had scaled parameters without proportionally scaling data, leaving performance on the table.

Lever 2 β€” Algorithmic Improvement. New architectures, training techniques, or inference methods that increase efficiency independent of raw scale. The transformer architecture (Vaswani et al., 2017) was an algorithmic breakthrough that enabled the entire modern LLM era. RLHF was an algorithmic breakthrough. Mixture-of-experts architectures (used in Mixtral and likely in GPT-4) let models deploy more effective parameters per inference operation. Algorithmic improvements are less predictable than scaling β€” they arrive irregularly β€” but their effects can be dramatic and they reduce the compute cost of reaching a given capability level.

Lever 3 β€” Data Quality & Curation. The content and structure of training data, not just its volume. Phi-1 (June 2023) and Phi-2 (December 2023), released by Microsoft Research, demonstrated that a 2.7 billion parameter model trained on carefully curated "textbook quality" data could match models ten times its size on several reasoning benchmarks. The implication: much of the apparent scale requirement in earlier models was compensating for noisy, low-quality training data. Data curation is arguably the least-discussed lever and may have the largest remaining headroom.

Forecasting Rule

When evaluating a claimed AI advance, identify which lever drove it. Scale advances are predictable and continuous. Algorithmic advances are irregular and potentially large. Data quality advances are underappreciated and may compound with both of the above.

Interaction Effects and the Chinchilla Finding

The three levers interact. The Chinchilla paper's central finding was that optimal training requires roughly equal scaling of compute and data: a model trained with 10Γ— more parameters but the same data is less efficient than one with 3Γ— more parameters and 3Γ— more data. This reframing led directly to Llama 2 (Meta, July 2023) and Mistral (September 2023), both of which achieved GPT-3.5-class performance at a fraction of the parameter count by following compute-optimal training recipes.

The practical implication: capability improvements do not require ever-larger models. The field is simultaneously scaling up (GPT-4, Gemini Ultra, Claude 3 Opus) and scaling down while maintaining performance (Llama 2 13B, Phi-2, Gemma 7B). Both trends are real and both carry forecasting implications β€” the former about what frontier systems can do, the latter about how widely capable AI can be deployed.

A second interaction effect: inference-time compute is emerging as a fourth lever. Chain-of-thought prompting (Wei et al., 2022), which asks models to reason step by step before answering, substantially improves performance on multi-step problems β€” not by changing the model, but by changing how much compute is used at inference time. OpenAI's o1 model (September 2024) extended this into a formal test-time compute scaling paradigm, where the model explicitly searches over reasoning steps. This suggests the scaling story is more complex than "bigger training = better model."

RLHF Reinforcement Learning from Human Feedback. A training technique where human raters evaluate model outputs and those preferences are used to fine-tune the model. Central to ChatGPT, Claude, and most deployed assistants post-2022.
Compute-Optimal Training Training configuration that maximizes performance for a given compute budget by balancing parameter count and data volume, as specified by the Chinchilla scaling laws (Hoffmann et al., 2022).
Test-Time Compute Compute used during inference (answering a query) rather than training. Chain-of-thought reasoning and systems like OpenAI o1 use more inference compute to improve output quality without retraining.

Why This Matters for Forecasting

Each lever has a different forecasting signature. Compute scaling is expensive and slowing at the frontier β€” training runs for GPT-4-class models reportedly cost over $100 million, and the marginal returns per dollar are decreasing. If compute were the only lever, the pace of capability progress would be determined almost entirely by how much capital AI labs can deploy. But algorithmic improvement and data curation are not capital-constrained in the same way β€” a small team with a good idea can publish something that shifts the trajectory.

The honest forecaster's position in 2024 is: compute scaling continues but is increasingly expensive; algorithmic improvements are arriving faster than most predicted (attention mechanisms, RLHF, mixture-of-experts, chain-of-thought, test-time scaling all emerged within a decade); data quality improvements have significant remaining headroom. Taken together, these suggest continued capability progress at a pace that is likely to remain faster than the public's prior suggests β€” though no single lever is guaranteed to remain productive indefinitely.

Key sources: Vaswani et al., "Attention Is All You Need" (2017); Stiennon et al., "Learning to Summarize from Human Feedback" (2020); Hoffmann et al., "Training Compute-Optimal LLMs / Chinchilla" (2022); Gunasekar et al., "Textbooks Are All You Need / Phi-1" (2023); Wei et al., "Chain-of-Thought Prompting" (2022).

Lesson 2 Quiz β€” Drivers of Capability

Five questions Β· Select the best answer for each
1. ChatGPT's dramatic public impact compared to GPT-3 was primarily attributable to which factor?
Correct. RLHF was the primary differentiator β€” it made the model's outputs usable and helpful in a way raw GPT-3 was not, without necessarily requiring a proportional scale increase. The interface improvement was also significant.
The dominant technical factor was RLHF (Reinforcement Learning from Human Feedback), which shaped outputs to match human preferences. This was an algorithmic/training technique improvement, not primarily a scale increase.
2. What was the central finding of the 2022 Chinchilla paper by Hoffmann et al.?
Correct. Chinchilla showed that previous models like GPT-3 and Gopher were "parameter-heavy and data-light" β€” they would have performed better with fewer parameters and more training tokens. This reframed how the field thought about training efficiency.
The Chinchilla finding was specifically that prior large models had been trained suboptimally β€” scaling parameters without proportionally scaling data. Optimal training requires roughly 20 training tokens per parameter.
3. Microsoft's Phi-1 and Phi-2 models (2023) demonstrated which important principle about AI capability?
Correct. Phi-1 (1.3B parameters) and Phi-2 (2.7B parameters) trained on "textbook quality" curated data matched models 10Γ— their size on several benchmarks β€” demonstrating that much of the apparent compute requirement in prior models was compensating for noisy data.
The Phi models specifically demonstrated the data quality lever: carefully curated training data allowed very small models to punch well above their weight class. This suggested significant remaining headroom in data curation as a capability driver.
4. "Test-time compute scaling," as exemplified by OpenAI's o1 model, refers to what?
Correct. Test-time compute scaling lets models spend more computation on hard problems at inference time β€” searching over reasoning chains rather than producing a single answer immediately. This is a fourth lever distinct from training scale, architecture, and data quality.
Test-time compute is specifically about inference time β€” how much compute is used when answering a query, not during training. The o1 model explicitly extended chain-of-thought reasoning into a formal test-time scaling paradigm.
5. Which of the three primary capability levers is considered most predictable for forecasting purposes?
Correct. Scaling laws provide a quantitative relationship between compute budget and expected performance improvement, making compute scaling the most predictable of the three levers. Algorithmic improvements arrive irregularly and are harder to forecast in advance.
Compute scaling, governed by scaling laws, is the most predictable lever. Algorithmic improvements β€” like the transformer, RLHF, or chain-of-thought β€” arrive unpredictably. Data quality improvements are also irregular and difficult to forecast from first principles.

Lab 2 β€” Identifying the Lever Behind an AI Advance

Practice diagnosing which capability driver β€” scale, algorithms, or data β€” is responsible for a given AI improvement

Your Task

For each AI development below, identify which lever (or combination) drove the improvement and what that implies for future progress. Discuss with the assistant to sharpen your analysis.

Complete at least 3 exchanges to mark this lab done.

Starter prompt: "Meta's Llama 2 (2023) achieved GPT-3.5-class performance at roughly 13 billion parameters β€” dramatically smaller than GPT-3's 175 billion. Which capability lever primarily explains this gap, and what does it suggest about near-term AI deployment trends?"
AI Lab Assistant
Capability Levers
Ready to work through capability driver analysis. Try the starter prompt, or bring any AI development you want to diagnose. We'll work through which of the three levers β€” compute/scale, algorithmic improvement, or data quality β€” is doing the work, and what that implies for what comes next.
What's Coming Next Β· Module 1 Β· Lesson 3

Reading AI Research and Announcements Critically

A press release and a peer-reviewed paper are not the same thing β€” and even peer-reviewed papers require careful reading.
What are the five questions a careful reader asks before accepting an AI research claim?

On May 10, 2023, Google announced Med-PaLM 2, a medical AI model that had scored 86.5% on the USMLE (United States Medical Licensing Examination) β€” well above the passing threshold of 60% and within the range of expert physician performance. News coverage declared that AI was approaching doctor-level medical knowledge. What the coverage typically omitted: the USMLE tests recall and reasoning about textbook cases, not the ability to take a patient history, examine a patient, manage uncertainty across a relationship spanning years, or navigate the social complexity of delivering a difficult diagnosis. The benchmark was real. The generalization from the benchmark was not. No lie was told; the inference was simply unsupported by the evidence presented.

The Five Critical Reading Questions

These five questions apply to AI research papers, blog posts, press releases, and news articles equally:

1. What exactly was measured? Get specific. Not "medical knowledge" β€” the USMLE multiple-choice subset. Not "human-level reasoning" β€” performance on BIG-Bench Hard at a specific temperature setting. The task specificity almost always narrows the claim significantly.

2. Who is the comparison baseline, and how was it measured? As in the ImageNet case, the human (or prior model) baseline is frequently measured under different conditions than the AI system. If the paper doesn't describe how the baseline was produced, treat the comparison with caution.

3. Was there test set contamination? Large language models are trained on internet text, which may include benchmarks and their answer keys. If a model's training data includes the test set it is being evaluated on, the score is invalid. Contamination is difficult to fully rule out and is frequently under-discussed in papers. The 2023 paper "Are Large Language Models Data Contamination Detectors?" (Shi et al.) showed that several prominent benchmarks had significant contamination in standard training corpora.

4. Who funded the research, and do they have a stake in the result? This does not mean industry research is invalid β€” much of the most important AI research comes from labs with commercial interests. But it should affect your prior. A paper from OpenAI showing GPT-4 outperforms competitors deserves the same skepticism you would apply to a pharmaceutical company's trial of its own drug.

5. Has it been independently replicated? Many headline AI results have not been independently replicated at the time of announcement. The peer review process in machine learning is often post-hoc β€” papers appear on arXiv before review, and high-profile results at major conferences have been retracted. Independent replication is the strongest evidence that a result is real.

Real Example β€” Gemini Ultra (December 2023)

Google's announcement of Gemini Ultra claimed it was the first model to surpass human expert performance on MMLU (Massive Multitask Language Understanding). The claim was technically accurate β€” Gemini Ultra scored 90.0% vs. the 89.8% human expert baseline β€” but the margin was within noise, and the human baseline (from 2021) had been criticized for being measured under conditions favorable to the AI comparison. Independent analysis by researchers at Stanford and MIT subsequently found performance more variable across question types than the announcement suggested.

The Anatomy of an AI Paper

Machine learning papers follow a recognizable structure that, once understood, lets you extract the essential information quickly. The abstract and introduction describe the claimed contribution. The methods section describes what was built and how. The experiments section is where the actual evidence lives β€” and where careful readers focus most attention.

In the experiments section, look for: the specific benchmarks used and whether they are well-validated; the ablation studies (tests that remove one component at a time to establish what each contributes); the failure modes and limitations section (often in an appendix and often understated); and the comparison models (whether comparisons use the same compute budget and whether implementations are from the original authors or reimplemented).

A useful heuristic from Yann LeCun (Chief AI Scientist at Meta): "If a paper doesn't have a failure analysis, treat the results with suspicion." Real systems fail in specific, diagnosable ways. A paper that only shows success cases is either cherry-picking or has not been stress-tested.

For press releases and blog posts, the additional filter is: what are they not saying? A company releasing a model will highlight the benchmarks it performs well on and omit those where it underperforms. Reading competitor announcements is sometimes more informative than reading a company's own β€” they have incentive to surface the genuine weaknesses.

Test Set Contamination When a model's training data includes examples from the benchmark it is later evaluated on, inflating scores. Increasingly concerning as LLMs are trained on large web crawls that may include benchmark datasets and answer keys.
Ablation Study An experiment that removes one component of a system to measure its contribution. Good papers include ablations to show that the claimed innovation, not a confound, drives the result.
MMLU Massive Multitask Language Understanding. A 57-subject multiple-choice benchmark (Hendrycks et al., 2020) covering topics from elementary math to professional law and medicine. Now frequently saturated by frontier models.

The Announcement Cycle

AI announcements follow a recognizable cycle in the current environment: model released β†’ benchmark numbers published β†’ tech press coverage β†’ broader media coverage β†’ policy response. At each stage, precision decreases. The benchmark numbers in the original paper are usually accurate (though subject to the caveats above). By the time the result reaches a news article, the context is often stripped. By the time it influences policy, the original paper may be months old and partially superseded.

The corrective is to maintain a short list of primary sources: arXiv (for preprints), the original lab blogs, and a small set of researchers who have demonstrated ability to read papers carefully and report on them honestly. The AI research community has several such people β€” Andrej Karpathy, Lilian Weng, Percy Liang, and others have built reputations for technical accuracy. Following them rather than general tech media substantially improves signal quality.

Key sources: Singhal et al., "Large Language Models Encode Clinical Knowledge / Med-PaLM 2" (2023); Hendrycks et al., "Measuring Massive Multitask Language Understanding / MMLU" (2020); Shi et al., "Detecting Pretraining Data from Large Language Models" (2023); Gemini Team, "Gemini: A Family of Highly Capable Multimodal Models" (2023).

Lesson 3 Quiz β€” Critical Reading

Five questions Β· Select the best answer for each
1. Med-PaLM 2 scored 86.5% on the USMLE in 2023. What is the most significant limitation of inferring from this that AI is approaching doctor-level medical capability?
Correct. Clinical medicine involves history-taking, physical examination, managing uncertainty across long-term relationships, and complex social judgment β€” none of which is tested by USMLE multiple-choice. The benchmark measures a real subset of medical knowledge, but the inference to general medical capability is unsupported.
The core problem is the gap between what the USMLE tests (textbook recall and reasoning) and what clinical medicine requires (history-taking, examination, long-term management, social judgment). The financial interest concern is real but secondary to this conceptual issue.
2. What is "test set contamination" and why is it particularly problematic for evaluating large language models?
Correct. LLMs are trained on large web crawls that may include benchmark datasets and their solutions. If the model has "seen" the test during training, scores are inflated and cannot support valid capability claims. Shi et al. (2023) showed significant contamination in standard training corpora.
Contamination specifically means training data exposure to test examples or answers. For LLMs trained on internet-scale data, this is especially concerning because web crawls routinely include academic papers, study guides, and answer repositories that overlap with common benchmarks.
3. When evaluating a research paper's claim that a new AI system outperforms a human baseline, which aspect of the baseline measurement deserves the most scrutiny?
Correct. The ImageNet case established this pattern: AI tested carefully against humans tested casually produces misleading comparisons. The conditions (time pressure, tools, domain expertise of participants, number of annotators) all affect the baseline, and differences in conditions rather than genuine capability differences can drive "human-level" claims.
The most critical question is condition comparability. If the AI is tested rigorously and humans are tested casually, the comparison is invalid regardless of the statistical properties of the sample. The ImageNet example showed exactly how this misleads in practice.
4. Why did Google's Gemini Ultra announcement in December 2023 face criticism despite the model genuinely achieving a higher MMLU score than the human expert baseline?
Correct. A 0.2 percentage point margin (90.0% vs 89.8%) is within measurement noise. The 2021 human expert baseline had methodological limitations. Independent researchers subsequently found performance varied substantially across question types, suggesting the headline number was not representative of consistent capability.
The primary issues were: the margin was too small to be meaningful, the human baseline methodology was contested, and independent analysis found less consistent performance than announced. The announcement was technically accurate but created an impression of decisive human-level performance that the data did not support.
5. What does Yann LeCun's heuristic β€” "if a paper doesn't have a failure analysis, treat the results with suspicion" β€” imply about evaluating AI research?
Correct. Every real system has characteristic failure modes. Papers that document them demonstrate the researchers have probed the system thoroughly, understand its limits, and are reporting honestly. Absence of failure analysis suggests either incomplete testing or selective reporting β€” both of which should reduce your confidence in the claimed results.
The implication is about thoroughness and honesty, not peer review requirements. Real systems always fail in specific ways; a paper that only shows success has not been stress-tested or is reporting selectively. Understanding failure modes is essential for assessing whether a capability is genuine and generalizable.

Lab 3 β€” Critical Reading Practice

Apply the five critical reading questions to a real AI research announcement

Your Task

Work through the five critical reading questions (what was measured, baseline methodology, contamination, funding, replication) applied to a real AI announcement. The assistant will help you find the gaps between what was claimed and what the evidence supports.

Complete at least 3 exchanges to mark this lab done.

Starter prompt: "In January 2024, Google DeepMind announced that AlphaCode 2 could solve competitive programming problems at the level of the top 15% of human competitors on Codeforces. Walk me through all five critical reading questions for this claim."
AI Lab Assistant
Critical Reading
Ready to apply the five critical reading questions to AI announcements. Try the starter prompt with AlphaCode 2, or bring any AI research claim you've encountered recently. We'll systematically work through: what was measured, how the baseline was established, contamination risk, funding interest, and replication status.
What's Coming Next Β· Module 1 Β· Lesson 4

Applying the Frameworks: What the Evidence Actually Points To

Synthesizing benchmarks, capability levers, and critical reading into honest near-term forecasts.
Given everything in this module, what can you actually say with confidence about where AI capabilities are headed β€” and how do you hold that view responsibly?

In February 2023, Microsoft published a 155-page paper titled "Sparks of Artificial General Intelligence: Early Experiments with GPT-4." The paper documented GPT-4's surprising performance across dozens of domains β€” law, medicine, creative writing, mathematics, visual reasoning. The conclusion was measured: "we believe that GPT-4's performance is strikingly close to that of human performance." Within a week, the AI research community had produced detailed rebuttals showing specific failure modes the paper had underweighted. Both the paper and the rebuttals were useful; the synthesis of both was more useful still. This is the practice: not credulous acceptance, not reflexive dismissal, but disciplined engagement with evidence from multiple directions.

What the Evidence Supports (As of 2024)

Applying the frameworks from this module to the accumulated evidence through 2024 produces a set of defensible positions. These are not predictions β€” they are characterizations of the current state with assessed confidence levels.

High confidence: Frontier language models perform at or above median human professional level on standardized knowledge-retrieval tasks (bar exam, medical licensing, financial certifications). This has been replicated by multiple independent researchers using models from at least three different labs. The capability is real, narrow, and consequential.

High confidence: Capability improvements are continuing across all three levers simultaneously. Compute scaling continues at the frontier. Algorithmic improvements (mixture-of-experts, constitutional AI, test-time scaling) are arriving faster than most 2020-era forecasts predicted. Data quality improvements (synthetic data generation, curated reasoning corpora) are showing returns. There is no current evidence that any of these levers has been exhausted.

Medium confidence: Multimodal capabilities (text + images + audio + code) are integrating faster than unimodal scaling alone would predict. GPT-4V (October 2023), Gemini Ultra, and Claude 3 Opus all demonstrated meaningful cross-modal reasoning that earlier scaling projections did not anticipate. The mechanism is not fully understood, which is why confidence is medium.

Low confidence / genuinely uncertain: Whether any current approach leads to systems with general reasoning comparable to adult humans across open-ended, novel problem-solving. Current systems have documented, persistent failures in basic counting, multi-step spatial reasoning, robust factual grounding, and metacognition. These may be architectural limitations or may yield to the levers above. The evidence does not currently allow a clear answer.

The Calibration Principle

Forecasting confidence should be proportional to the specificity and reproducibility of the underlying evidence. High confidence on narrow, replicated results. Medium confidence on recent, less-tested advances. Low confidence on extrapolations beyond any current benchmark. This is not pessimism β€” it is accurate calibration, which serves you better than either extreme.

Near-Term Trajectories Worth Watching

Multimodal reasoning in professional contexts. The convergence of vision, language, and code capabilities is enabling applications in radiology (Rad-DINO, released by Microsoft in January 2024), materials science (GNoME, DeepMind, November 2023, which predicted 2.2 million new stable crystal structures), and software engineering (Devin, released March 2024, claimed to complete 13.86% of real GitHub issues autonomously). Each of these claims requires the critical reading framework β€” but they also represent a genuine domain expansion beyond text generation.

Test-time compute as a scaling frontier. OpenAI's o1 and o3 models (2024) demonstrated that investing more inference compute in structured reasoning produces meaningful gains on problems where chain-of-thought reasoning applies β€” mathematics, formal logic, code debugging. The o3 model's performance on ARC-AGI (a benchmark specifically designed to test generalization rather than recall) rose from roughly 5% (for GPT-4-class models) to 87.5% β€” a result that researchers designed the benchmark to be difficult for. This is a single result and should be treated with appropriate skepticism, but it is a genuinely unexpected data point.

Agentic systems and tool use. The combination of language model reasoning with external tool access (web search, code execution, file systems, APIs) is moving AI from a question-answering capability to a task-execution capability. Google's Project Astra, Anthropic's Claude computer use (October 2024), and OpenAI's Operator project all target this domain. The benchmark infrastructure to evaluate these systems rigorously is still being built, which means the critical reading framework is especially important here β€” claims are outrunning measurement.

How to Hold These Views Responsibly

The purpose of this module is not to produce a specific forecast but to build the capacity to update forecasts appropriately as evidence arrives. Three practices support this:

Track your predictions explicitly. Forecasting researchers at Metaculus, Good Judgment Project, and AI Impacts have found that explicit, dated, probability-assigned predictions update faster and more accurately than informal impressions. Keeping even a simple log of what you expected and what happened builds calibration over time.

Maintain a short list of high-quality primary sources. arXiv's cs.AI and cs.LG sections for preprints; Epoch AI's tracking database for compute and training runs; the Papers With Code leaderboards for benchmark tracking; the State of AI Report (published annually by Air Street Capital) for broad synthesis. These are not the only sources, but they are reliably more precise than general technology media.

Distinguish the pace question from the destination question. You can be highly confident that AI capabilities will continue to improve in the near term (the evidence supports this at high confidence) while having genuine uncertainty about what those improvements will produce at the 5–10 year horizon. These are separate questions that are often conflated. Keeping them separate prevents both the "nothing will change" error and the "everything will change immediately" error.

Module Summary

You now have four frameworks: reading benchmark claims with specificity (L1), identifying which capability lever drove a given improvement (L2), applying critical reading questions to research and announcements (L3), and calibrating confident from uncertain claims in near-term forecasting (L4). The rest of this course builds on these foundations toward specific domains where AI progress is most consequential.

Key sources: Bubeck et al., "Sparks of AGI: Early Experiments with GPT-4" (2023); Epoch AI, "Tracking Trends in Machine Learning" (ongoing); Jaime Sevilla et al., "Compute Trends Across Three Eras of Machine Learning" (2022); Li et al., "ARC-AGI and the o3 Result" (December 2024); Romera-Paredes et al., "Mathematical Discoveries from Program Search with LLMs / GNoME" (2023).

Lesson 4 Quiz β€” Applying the Frameworks

Five questions Β· Select the best answer for each
1. According to the evidence assessed in Lesson 4, which of the following AI capability claims should be held at "high confidence"?
Correct. This specific claim β€” performance on standardized professional knowledge-retrieval benchmarks β€” has been replicated by independent researchers using models from multiple labs. It is narrow, well-measured, and reproducible. The other options are extrapolations that go beyond the current evidence base.
High confidence requires narrow, replicated, reproducible evidence. The only option meeting that standard is the standardized test performance claim, which has been independently verified across multiple models and labs. The other options involve extrapolation or broader capability claims that current evidence does not support.
2. OpenAI's o3 model scored 87.5% on the ARC-AGI benchmark in late 2024, compared to roughly 5% for GPT-4-class models. How should this result be treated according to the Lesson 4 framework?
Correct. This is the calibrated response: a single result on a well-designed benchmark is worth noting and tracking, especially when the magnitude is large. But a single result β€” even a surprising one β€” does not support strong conclusions before independent replication and systematic failure analysis.
The calibrated treatment of any single result, however dramatic, is: take it seriously as a data point, apply the critical reading questions, and await replication. A single result on one benchmark neither confirms AGI nor should be dismissed. Responsible forecasting tracks it while withholding strong conclusions.
3. DeepMind's GNoME system (November 2023) predicted 2.2 million new stable crystal structures. Why does this represent a different kind of AI progress claim than a typical language model benchmark score?
Correct. GNoME's predictions can be validated by physical experiment β€” the crystal structures either are stable or they are not. This is structurally similar to AlphaFold 2: the capability has real-world utility that can be tested outside the benchmark context. That distinguishes it from optimizing for an artificial test.
The key distinction is verifiability against physical reality. Predicted crystal structures can be synthesized and tested. This grounds the capability claim in something more concrete than benchmark performance. Publication venue and organizational structure do not change the epistemics of the claim.
4. The "Sparks of AGI" paper about GPT-4 (Bubeck et al., 2023) and its subsequent rebuttals from the research community together illustrate what principle from Lesson 4?
Correct. The paper documented genuine surprising capabilities. The rebuttals documented genuine persistent failures. Both were right. Neither the credulous reading nor the dismissive reading captured the full picture. Synthesizing both produced a more accurate model of GPT-4's actual capability profile than either alone.
The lesson from the Sparks of AGI episode is about the synthesis of multiple evidence sources. The paper showed real capabilities; the rebuttals showed real limitations. Dismissing either would have produced an inaccurate picture. Responsible analysis holds both simultaneously.
5. Why does Lesson 4 recommend distinguishing the "pace question" from the "destination question" when forecasting AI progress?
Correct. The evidence strongly supports continued near-term capability improvement (high confidence on the pace question). What those improvements produce at the 5–10 year horizon involves extrapolation far beyond current evidence (genuine uncertainty on the destination question). Conflating them produces either "nothing will change" or "everything changes immediately" β€” both wrong.
The distinction matters epistemically: near-term improvement is well-evidenced (multiple levers active, no sign of exhaustion), while long-term outcomes depend on which problems prove hard, which prove tractable, and what second-order social and economic effects emerge. Treating them as a single question forces false certainty in one direction or the other.

Lab 4 β€” Building a Calibrated AI Forecast

Synthesize all four module frameworks into an honest near-term forecast on a specific AI capability

Your Task

Choose a specific AI capability area (autonomous coding, medical imaging, scientific literature synthesis, agentic web browsing, or one of your own choosing). Build a calibrated forecast: what does the current evidence support at high, medium, and low confidence? What evidence would change your view?

Complete at least 3 exchanges to mark this lab done. The assistant will push you to justify confidence levels with specific evidence rather than general impressions.

Starter prompt: "I want to build a calibrated forecast for AI capabilities in autonomous software development over the next two years. Help me apply all four frameworks: what benchmarks exist, which capability levers are active, what the research announcements actually show, and where my confidence should be high vs uncertain."
AI Lab Assistant
Calibrated Forecasting
Ready to build a calibrated forecast with you. Try the starter prompt, or name a different capability area you want to analyze. I'll help you apply all four frameworks from this module: benchmark analysis, capability lever identification, critical reading of announcements, and calibrated confidence assignment. I'll push back if you make claims that go beyond the evidence β€” that's the point of the exercise.

Module 1 Test β€” How to Read AI Progress

15 questions Β· 80% required to pass Β· Covers all four lessons
1. The 2015 Microsoft ResNet paper that "surpassed human vision" on ImageNet is best described as demonstrating:
Correct. ResNet's result was real; the "surpassed humans" framing was misleading because human performance had been measured under casual conditions. Rigorous human testing subsequently matched the machine's score.
The result was real but the comparison was skewed by asymmetric measurement. The AI was tested rigorously; the human baseline was one annotator working quickly. Careful retesting showed humans matched the AI β€” no fraud, but a meaningful methodological flaw.
2. Goodhart's Law predicts that when a benchmark becomes an optimization target:
Correct. Goodhart's Law: a measure that becomes a target stops being a good measure. Models optimized against a benchmark learn its specific patterns rather than the general skill the benchmark was meant to proxy.
Goodhart's Law is specifically about measurement validity. Optimization pressure against a metric decouples high scores from genuine capability. GLUE, SuperGLUE, and ImageNet all followed this trajectory.
3. The Kaplan et al. (2020) scaling laws paper showed that language model performance improved as a function of which three variables?
Correct. Kaplan et al. identified these three as the primary scaling dimensions, each following a power-law relationship with performance β€” with each scalable roughly independently.
The three scaling variables in Kaplan et al. are parameters, data, and compute. The paper's central contribution was showing predictable power-law improvement as each of these increased.
4. "Emergent capabilities" in AI scaling refers to:
Correct. Wei et al. (2022) documented dozens of such transitions in the BIG-Bench suite β€” tasks that GPT-3-scale models failed completely but larger models handled reliably, with the transition occurring relatively sharply at a threshold scale.
Emergent capabilities are specifically about discontinuous scale transitions β€” near-zero performance below a threshold, then a sharp jump. This is distinct from smooth scaling improvements and makes certain capabilities difficult to predict from scaling curves alone.
5. The primary factor explaining ChatGPT's dramatically different public reception compared to GPT-3 (which had been publicly available since 2021) was:
Correct. RLHF was the dominant technical differentiator. The interface improvement was also real, but the capability to give helpful, appropriate, conversational responses came from the alignment training, not primarily from a scale increase.
RLHF was the primary driver. The interface change mattered, but GPT-3 had interface wrappers that didn't produce comparable engagement. The underlying change was that RLHF shaped outputs to be genuinely helpful β€” an algorithmic technique, not a scale increase.
6. The Chinchilla paper (Hoffmann et al., 2022) found that prior large models like GPT-3 had been:
Correct. Chinchilla showed that models like GPT-3 were parameter-heavy and data-light relative to the compute-optimal ratio. This led to Llama, Mistral, and other models achieving GPT-3.5-level performance at dramatically smaller parameter counts by following compute-optimal training recipes.
Chinchilla's finding was about the parameter-to-data ratio: GPT-3 and similar models had too many parameters for the amount of training data used. Optimal training requires roughly 20 tokens per parameter β€” far more data than those models had used.
7. Microsoft's Phi-2 (2.7B parameters) matching models 10Γ— its size on reasoning benchmarks is best explained by which capability lever?
Correct. The Phi series was trained on carefully curated, high-quality synthetic and textbook data β€” demonstrating that much of the apparent compute requirement in prior models was compensating for noisy training data. Data quality is the least-discussed lever with the most remaining headroom.
Phi-1 and Phi-2's distinctive feature was their training data: curated "textbook quality" synthetic text rather than raw internet crawls. The same model architecture, trained on better data, substantially outperformed larger models on reasoning tasks.
8. "Test-time compute scaling," as implemented in OpenAI's o1 model, improves performance by:
Correct. Test-time compute scaling extends chain-of-thought reasoning into a formal paradigm where the model explicitly searches over reasoning paths during inference. More compute spent at inference time on hard problems produces meaningfully better answers without any model retraining.
Test-time scaling is about investing more inference compute in structured reasoning β€” generating intermediate reasoning steps, evaluating them, and selecting better paths β€” rather than producing a single immediate answer. This is orthogonal to training-time scaling.
9. When evaluating a claim that an AI system has surpassed human expert performance on a medical benchmark, the single most important question is:
Correct. Med-PaLM 2's USMLE performance illustrates this precisely: high benchmark scores on medical licensing exams do not capture clinical practice, patient management, history-taking, or the social complexity of medicine. The gap between benchmark performance and real-world capability is the central interpretive challenge.
The most important question is always what was measured and whether it maps to the claimed capability. For medical AI, the USMLE measures recall and textbook reasoning β€” a meaningful but narrow subset of clinical medicine. Peer review and regulatory status matter but are secondary to this conceptual question.
10. Test set contamination is a particular concern for large language models because:
Correct. Internet-scale training data routinely includes academic papers, study guides, and Q&A forums that overlap with common benchmarks. Shi et al. (2023) found significant contamination in standard training corpora for several prominent benchmarks. This is an unsolved methodological problem for the field.
The contamination concern is specifically about training data overlap with test sets. Web-crawled training data is vast and includes academic content where benchmarks and their solutions appear. Unlike smaller models trained on curated data, LLMs' exposure is difficult to fully audit or rule out.
11. The "Sparks of AGI" paper (Bubeck et al., 2023) and its community rebuttals together are best understood as:
Correct. The paper documented real, surprising capabilities. The rebuttals documented real, persistent failures. Neither alone was sufficient. The disciplined synthesis of both perspectives β€” crediting what was genuine in each β€” produced the most accurate model of GPT-4's actual capabilities.
The lesson is about synthesis. Both the paper and the critiques were partially right. Dismissing either β€” or treating the episode as evidence of bias β€” misses the epistemic point: comprehensive understanding requires engaging with multiple evidence streams, including contrary ones.
12. DeepMind's GNoME (November 2023) represents a qualitatively different kind of AI capability claim than an LLM benchmark score primarily because:
Correct. Like AlphaFold 2, GNoME's outputs can be validated by physical experiment. This is fundamentally different from benchmark performance: the capability claims can be tested against reality rather than against a fixed test set. That verifiability is what earns higher epistemic confidence.
The key distinction is real-world verifiability. GNoME's crystal structure predictions can be synthesized and tested. Benchmark scores cannot be validated against anything other than the benchmark itself. Publication venue matters but is secondary to this epistemic distinction.
13. According to Lesson 4's calibration framework, which claim should be held at LOWEST confidence?
Correct. This is a long-range capability extrapolation far beyond any current benchmark, dependent on unresolved questions about whether current architectures can generalize beyond their documented failure modes. It should be held at low confidence β€” not because it is impossible, but because the evidence does not currently support a confident assessment either way.
Forecasting confidence should scale with evidence specificity and replicability. Near-term improvement and current benchmark performance are well-evidenced (high confidence). A specific claim about achieving general reasoning by a specific date is a long extrapolation beyond current evidence, warranting the lowest confidence of these options.
14. Why does the module recommend distinguishing the "pace question" from the "destination question" in AI forecasting?
Correct. The pace question (will AI continue improving in the near term?) is answerable with high confidence from current evidence. The destination question (what will that improvement produce at a 5-10 year horizon?) involves extrapolation through genuine uncertainty. Conflating them forces either unwarranted pessimism or unwarranted optimism.
The distinction is epistemological. Short-term improvement is well-evidenced. Long-term outcomes are not. Treating both as having the same confidence level β€” in either direction β€” produces predictive errors. The frameworks in this module are designed to maintain calibration on each question separately.
15. Which of the following best describes the honest analytical position this module advocates for reading AI progress?
Correct. This is the module's central thesis: progress is real (AlphaFold, professional exam performance, multimodal reasoning are genuine advances), substantial (the pace since 2012 has been historically unusual), and uneven (frontier models still fail at basic counting and spatial reasoning). Neither uncritical enthusiasm nor reflexive dismissal serves accurate understanding.
The module advocates a position between default skepticism and uncritical acceptance. Progress is real and substantial β€” but it is uneven and benchmark-sensitive. Accurate reading requires the specific analytical tools covered in all four lessons: benchmark interpretation, lever identification, critical reading, and calibrated confidence.