In September 1878, the New York Sun ran a skeptical editorial about Thomas Edison's claims for electric light. The piece argued that practical incandescent illumination was a "sheer nonsense" β that gas companies had nothing to worry about. Edison filed for the patent on his working bulb two months later. The pattern is not confined to electricity: in 1943, IBM chairman Thomas Watson reportedly said the world market for computers was perhaps five machines. In 1981, Bill Gates said 640 kilobytes of memory ought to be enough for anyone. Smart people, close to the technology, repeatedly failed to see the trajectory they were standing on.
The same failure mode is operating right now around artificial intelligence. Between 2012 and 2022, AI systems went from barely recognizing cats in photos to writing legal briefs, generating photorealistic images, passing the bar exam in the 90th percentile, and folding proteins that had stumped biochemists for fifty years. Each of those milestones was declared impossible or "decades away" by credentialed experts just before it happened. The capability curve has not leveled off. Knowing how to read it β what counts as real progress, what is hype, and what mechanisms actually drive improvement β is the core skill this course builds.
This course, What's Coming Next, will not tell you which specific products will exist in 2030. Nobody knows that. What it will give you is a set of durable frameworks: how to evaluate benchmark claims, how to distinguish scaling gains from architectural breakthroughs, how to spot the difference between a genuine capability jump and a well-funded press release. Four modules, starting right here with how to read progress itself.
If you finish every module, here's who you become:
On October 11, 2015, a paper from Microsoft Research announced that its ResNet model had achieved a 3.57% error rate on the ImageNet Large Scale Visual Recognition Challenge β beating the commonly cited human baseline of 5.1%. Tech headlines declared that AI could now "see better than humans." What the headlines did not mention: the human baseline had been measured on a random sample of 1,500 images by a single annotator working quickly. When researchers ran a more careful human test, trained annotators scored around 3.5% β essentially matching the machine. The AI had not surpassed human vision. It had matched a specific human, on a specific dataset, on a specific task. The distinction matters enormously.
That episode established a template repeated constantly since: a real capability advance gets wrapped in a misleading comparison, the comparison travels faster than the correction, and policy makers, investors, and the public build mental models on flawed foundations. Reading AI progress well starts with learning to disaggregate the claim from the benchmark from the underlying capability.
A benchmark is a standardized test β a fixed dataset with a scoring rule. It measures one thing: performance on that dataset under those rules. It does not measure general capability, robustness, real-world usefulness, or what the system will do on inputs outside the test set. Every AI benchmark has these properties whether researchers acknowledge them or not.
The ImageNet dataset, introduced in 2010, contains about 1.2 million images across 1,000 categories. Achieving low error on it is genuinely impressive β but it tells you almost nothing about how well a system handles medical images, satellite photos, handwritten documents, or any image type underrepresented in the dataset. When GPT-4 scored in the 90th percentile on the Uniform Bar Examination in March 2023, that is a real achievement. It also tells you nothing about whether the model can reliably navigate a real client intake, maintain confidentiality across sessions, or recognize when a question is outside its competence.
Goodhart's Law is the central hazard: once a benchmark becomes the target, it ceases to be a good measure. AI labs optimize their training pipelines against benchmark datasets β sometimes inadvertently, sometimes deliberately. A model that scores 95% on a reading comprehension benchmark may score 60% on questions that test the same skill but are phrased differently. The benchmark has been solved; the underlying capability has not necessarily been acquired.
Every time you see "AI achieves human-level performance," ask: human-level on what specific task, measured by what specific test, compared against which humans doing the task under what conditions? All four answers change the meaning of the claim.
In 2020, OpenAI researchers published "Scaling Laws for Neural Language Models," demonstrating that language model performance improved predictably as researchers increased three variables: the number of model parameters, the volume of training data, and the amount of compute used for training. The relationship followed a power law β each order-of-magnitude increase in compute produced a roughly fixed percentage improvement in loss.
This was a genuinely important finding because it suggested that progress was not dependent on new algorithmic breakthroughs β you could simply build bigger and predict roughly how much better your system would get. GPT-3 (175 billion parameters, released June 2020) and GPT-4 (architecture not fully disclosed, released March 2023) both followed this logic. So did Google's PaLM (540 billion parameters, April 2022) and Anthropic's Claude series.
But scaling laws have limits. The curves measure a specific loss metric β how well the model predicts the next token β which does not map cleanly onto all downstream tasks. A model can improve its training loss while making the same types of factual errors. More troublingly, some capabilities appear to emerge discontinuously: a model at one scale fails completely at a task, then at a larger scale succeeds reliably. These emergent capabilities are difficult to predict from the scaling curves alone, which complicates forecasting.
Two failure modes dominate public discourse about AI progress: uncritical hype and reflexive dismissal. Both produce equally wrong predictions. The hype failure mode latches onto every benchmark record and extrapolates to general intelligence. The dismissal failure mode latches onto every chatbot error and concludes the technology is fundamentally limited.
The evidence-based approach requires holding two things simultaneously: real, documented, substantial progress has occurred across many domains since 2012; and that progress has been uneven, benchmark-sensitive, and repeatedly mischaracterized. AlphaFold 2, released by DeepMind in July 2021, solved the protein structure prediction problem that had resisted biology for fifty years β that is a genuine scientific breakthrough with real consequences for drug discovery. At the same time, large language models routinely fail at simple counting tasks, spatial reasoning, and multi-step arithmetic that a ten-year-old handles without difficulty. Both facts are true.
The frameworks in the rest of this module β how compute translates to capability, how to read research papers about AI, and how to evaluate claims about what's coming β all depend on starting from this honest baseline: progress is real, substantial, and uneven, and reading it well requires specificity rather than either enthusiasm or cynicism.
This module builds four skills: reading benchmark claims critically (L1), understanding what drives capability jumps (L2), evaluating AI research papers and announcements (L3), and applying these frameworks to specific near-term AI developments (L4). Each lesson includes a hands-on lab to practice the skill against real examples.
Key sources for this lesson: Russakovsky et al., "ImageNet Large Scale Visual Recognition Challenge" (2015); He et al., "Deep Residual Learning for Image Recognition" (2015); Kaplan et al., "Scaling Laws for Neural Language Models" (2020); Hoffmann et al., "Training Compute-Optimal Large Language Models" (2022); Wei et al., "Emergent Abilities of Large Language Models" (2022).
You will interrogate AI benchmark claims the way a careful analyst would. The AI assistant has been primed with the Lesson 1 frameworks β use it to work through the specific questions below, or bring your own benchmark claim to examine.
Complete at least 3 exchanges to mark this lab done. Push back, ask follow-ups, and try to find the limits of what a benchmark claim actually tells you.
On December 2, 2022, OpenAI released ChatGPT. Within five days it had one million users; within two months, one hundred million β the fastest consumer application ramp in history. Analysts scrambled to explain the leap from GPT-3 (which had been publicly available since 2021 and had not caused comparable disruption) to this new system. The common explanation was scale: GPT-4 must be much larger. That explanation was wrong, or at least incomplete. The dominant factor was reinforcement learning from human feedback (RLHF) β a training technique that had been developed at OpenAI and Anthropic through 2021 and 2022, which aligned model outputs with human preferences. The capability jump was primarily an alignment and interface improvement, not a raw compute increase. Understanding which lever moved is the central skill.
AI capability improvements come from three separable sources, and distinguishing them matters for forecasting:
Lever 1 β Compute & Scale. More parameters, more training data, more GPU-hours. This lever has driven the majority of headline progress since 2012. It is the most predictable lever β scaling laws let researchers estimate gains in advance. It is also the most expensive and subject to diminishing returns. The 2022 Chinchilla paper (Hoffmann et al.) showed that most large models had been undertrained relative to their parameter count β they had scaled parameters without proportionally scaling data, leaving performance on the table.
Lever 2 β Algorithmic Improvement. New architectures, training techniques, or inference methods that increase efficiency independent of raw scale. The transformer architecture (Vaswani et al., 2017) was an algorithmic breakthrough that enabled the entire modern LLM era. RLHF was an algorithmic breakthrough. Mixture-of-experts architectures (used in Mixtral and likely in GPT-4) let models deploy more effective parameters per inference operation. Algorithmic improvements are less predictable than scaling β they arrive irregularly β but their effects can be dramatic and they reduce the compute cost of reaching a given capability level.
Lever 3 β Data Quality & Curation. The content and structure of training data, not just its volume. Phi-1 (June 2023) and Phi-2 (December 2023), released by Microsoft Research, demonstrated that a 2.7 billion parameter model trained on carefully curated "textbook quality" data could match models ten times its size on several reasoning benchmarks. The implication: much of the apparent scale requirement in earlier models was compensating for noisy, low-quality training data. Data curation is arguably the least-discussed lever and may have the largest remaining headroom.
When evaluating a claimed AI advance, identify which lever drove it. Scale advances are predictable and continuous. Algorithmic advances are irregular and potentially large. Data quality advances are underappreciated and may compound with both of the above.
The three levers interact. The Chinchilla paper's central finding was that optimal training requires roughly equal scaling of compute and data: a model trained with 10Γ more parameters but the same data is less efficient than one with 3Γ more parameters and 3Γ more data. This reframing led directly to Llama 2 (Meta, July 2023) and Mistral (September 2023), both of which achieved GPT-3.5-class performance at a fraction of the parameter count by following compute-optimal training recipes.
The practical implication: capability improvements do not require ever-larger models. The field is simultaneously scaling up (GPT-4, Gemini Ultra, Claude 3 Opus) and scaling down while maintaining performance (Llama 2 13B, Phi-2, Gemma 7B). Both trends are real and both carry forecasting implications β the former about what frontier systems can do, the latter about how widely capable AI can be deployed.
A second interaction effect: inference-time compute is emerging as a fourth lever. Chain-of-thought prompting (Wei et al., 2022), which asks models to reason step by step before answering, substantially improves performance on multi-step problems β not by changing the model, but by changing how much compute is used at inference time. OpenAI's o1 model (September 2024) extended this into a formal test-time compute scaling paradigm, where the model explicitly searches over reasoning steps. This suggests the scaling story is more complex than "bigger training = better model."
Each lever has a different forecasting signature. Compute scaling is expensive and slowing at the frontier β training runs for GPT-4-class models reportedly cost over $100 million, and the marginal returns per dollar are decreasing. If compute were the only lever, the pace of capability progress would be determined almost entirely by how much capital AI labs can deploy. But algorithmic improvement and data curation are not capital-constrained in the same way β a small team with a good idea can publish something that shifts the trajectory.
The honest forecaster's position in 2024 is: compute scaling continues but is increasingly expensive; algorithmic improvements are arriving faster than most predicted (attention mechanisms, RLHF, mixture-of-experts, chain-of-thought, test-time scaling all emerged within a decade); data quality improvements have significant remaining headroom. Taken together, these suggest continued capability progress at a pace that is likely to remain faster than the public's prior suggests β though no single lever is guaranteed to remain productive indefinitely.
Key sources: Vaswani et al., "Attention Is All You Need" (2017); Stiennon et al., "Learning to Summarize from Human Feedback" (2020); Hoffmann et al., "Training Compute-Optimal LLMs / Chinchilla" (2022); Gunasekar et al., "Textbooks Are All You Need / Phi-1" (2023); Wei et al., "Chain-of-Thought Prompting" (2022).
For each AI development below, identify which lever (or combination) drove the improvement and what that implies for future progress. Discuss with the assistant to sharpen your analysis.
Complete at least 3 exchanges to mark this lab done.
On May 10, 2023, Google announced Med-PaLM 2, a medical AI model that had scored 86.5% on the USMLE (United States Medical Licensing Examination) β well above the passing threshold of 60% and within the range of expert physician performance. News coverage declared that AI was approaching doctor-level medical knowledge. What the coverage typically omitted: the USMLE tests recall and reasoning about textbook cases, not the ability to take a patient history, examine a patient, manage uncertainty across a relationship spanning years, or navigate the social complexity of delivering a difficult diagnosis. The benchmark was real. The generalization from the benchmark was not. No lie was told; the inference was simply unsupported by the evidence presented.
These five questions apply to AI research papers, blog posts, press releases, and news articles equally:
1. What exactly was measured? Get specific. Not "medical knowledge" β the USMLE multiple-choice subset. Not "human-level reasoning" β performance on BIG-Bench Hard at a specific temperature setting. The task specificity almost always narrows the claim significantly.
2. Who is the comparison baseline, and how was it measured? As in the ImageNet case, the human (or prior model) baseline is frequently measured under different conditions than the AI system. If the paper doesn't describe how the baseline was produced, treat the comparison with caution.
3. Was there test set contamination? Large language models are trained on internet text, which may include benchmarks and their answer keys. If a model's training data includes the test set it is being evaluated on, the score is invalid. Contamination is difficult to fully rule out and is frequently under-discussed in papers. The 2023 paper "Are Large Language Models Data Contamination Detectors?" (Shi et al.) showed that several prominent benchmarks had significant contamination in standard training corpora.
4. Who funded the research, and do they have a stake in the result? This does not mean industry research is invalid β much of the most important AI research comes from labs with commercial interests. But it should affect your prior. A paper from OpenAI showing GPT-4 outperforms competitors deserves the same skepticism you would apply to a pharmaceutical company's trial of its own drug.
5. Has it been independently replicated? Many headline AI results have not been independently replicated at the time of announcement. The peer review process in machine learning is often post-hoc β papers appear on arXiv before review, and high-profile results at major conferences have been retracted. Independent replication is the strongest evidence that a result is real.
Google's announcement of Gemini Ultra claimed it was the first model to surpass human expert performance on MMLU (Massive Multitask Language Understanding). The claim was technically accurate β Gemini Ultra scored 90.0% vs. the 89.8% human expert baseline β but the margin was within noise, and the human baseline (from 2021) had been criticized for being measured under conditions favorable to the AI comparison. Independent analysis by researchers at Stanford and MIT subsequently found performance more variable across question types than the announcement suggested.
Machine learning papers follow a recognizable structure that, once understood, lets you extract the essential information quickly. The abstract and introduction describe the claimed contribution. The methods section describes what was built and how. The experiments section is where the actual evidence lives β and where careful readers focus most attention.
In the experiments section, look for: the specific benchmarks used and whether they are well-validated; the ablation studies (tests that remove one component at a time to establish what each contributes); the failure modes and limitations section (often in an appendix and often understated); and the comparison models (whether comparisons use the same compute budget and whether implementations are from the original authors or reimplemented).
A useful heuristic from Yann LeCun (Chief AI Scientist at Meta): "If a paper doesn't have a failure analysis, treat the results with suspicion." Real systems fail in specific, diagnosable ways. A paper that only shows success cases is either cherry-picking or has not been stress-tested.
For press releases and blog posts, the additional filter is: what are they not saying? A company releasing a model will highlight the benchmarks it performs well on and omit those where it underperforms. Reading competitor announcements is sometimes more informative than reading a company's own β they have incentive to surface the genuine weaknesses.
AI announcements follow a recognizable cycle in the current environment: model released β benchmark numbers published β tech press coverage β broader media coverage β policy response. At each stage, precision decreases. The benchmark numbers in the original paper are usually accurate (though subject to the caveats above). By the time the result reaches a news article, the context is often stripped. By the time it influences policy, the original paper may be months old and partially superseded.
The corrective is to maintain a short list of primary sources: arXiv (for preprints), the original lab blogs, and a small set of researchers who have demonstrated ability to read papers carefully and report on them honestly. The AI research community has several such people β Andrej Karpathy, Lilian Weng, Percy Liang, and others have built reputations for technical accuracy. Following them rather than general tech media substantially improves signal quality.
Key sources: Singhal et al., "Large Language Models Encode Clinical Knowledge / Med-PaLM 2" (2023); Hendrycks et al., "Measuring Massive Multitask Language Understanding / MMLU" (2020); Shi et al., "Detecting Pretraining Data from Large Language Models" (2023); Gemini Team, "Gemini: A Family of Highly Capable Multimodal Models" (2023).
Work through the five critical reading questions (what was measured, baseline methodology, contamination, funding, replication) applied to a real AI announcement. The assistant will help you find the gaps between what was claimed and what the evidence supports.
Complete at least 3 exchanges to mark this lab done.
In February 2023, Microsoft published a 155-page paper titled "Sparks of Artificial General Intelligence: Early Experiments with GPT-4." The paper documented GPT-4's surprising performance across dozens of domains β law, medicine, creative writing, mathematics, visual reasoning. The conclusion was measured: "we believe that GPT-4's performance is strikingly close to that of human performance." Within a week, the AI research community had produced detailed rebuttals showing specific failure modes the paper had underweighted. Both the paper and the rebuttals were useful; the synthesis of both was more useful still. This is the practice: not credulous acceptance, not reflexive dismissal, but disciplined engagement with evidence from multiple directions.
Applying the frameworks from this module to the accumulated evidence through 2024 produces a set of defensible positions. These are not predictions β they are characterizations of the current state with assessed confidence levels.
High confidence: Frontier language models perform at or above median human professional level on standardized knowledge-retrieval tasks (bar exam, medical licensing, financial certifications). This has been replicated by multiple independent researchers using models from at least three different labs. The capability is real, narrow, and consequential.
High confidence: Capability improvements are continuing across all three levers simultaneously. Compute scaling continues at the frontier. Algorithmic improvements (mixture-of-experts, constitutional AI, test-time scaling) are arriving faster than most 2020-era forecasts predicted. Data quality improvements (synthetic data generation, curated reasoning corpora) are showing returns. There is no current evidence that any of these levers has been exhausted.
Medium confidence: Multimodal capabilities (text + images + audio + code) are integrating faster than unimodal scaling alone would predict. GPT-4V (October 2023), Gemini Ultra, and Claude 3 Opus all demonstrated meaningful cross-modal reasoning that earlier scaling projections did not anticipate. The mechanism is not fully understood, which is why confidence is medium.
Low confidence / genuinely uncertain: Whether any current approach leads to systems with general reasoning comparable to adult humans across open-ended, novel problem-solving. Current systems have documented, persistent failures in basic counting, multi-step spatial reasoning, robust factual grounding, and metacognition. These may be architectural limitations or may yield to the levers above. The evidence does not currently allow a clear answer.
Forecasting confidence should be proportional to the specificity and reproducibility of the underlying evidence. High confidence on narrow, replicated results. Medium confidence on recent, less-tested advances. Low confidence on extrapolations beyond any current benchmark. This is not pessimism β it is accurate calibration, which serves you better than either extreme.
Multimodal reasoning in professional contexts. The convergence of vision, language, and code capabilities is enabling applications in radiology (Rad-DINO, released by Microsoft in January 2024), materials science (GNoME, DeepMind, November 2023, which predicted 2.2 million new stable crystal structures), and software engineering (Devin, released March 2024, claimed to complete 13.86% of real GitHub issues autonomously). Each of these claims requires the critical reading framework β but they also represent a genuine domain expansion beyond text generation.
Test-time compute as a scaling frontier. OpenAI's o1 and o3 models (2024) demonstrated that investing more inference compute in structured reasoning produces meaningful gains on problems where chain-of-thought reasoning applies β mathematics, formal logic, code debugging. The o3 model's performance on ARC-AGI (a benchmark specifically designed to test generalization rather than recall) rose from roughly 5% (for GPT-4-class models) to 87.5% β a result that researchers designed the benchmark to be difficult for. This is a single result and should be treated with appropriate skepticism, but it is a genuinely unexpected data point.
Agentic systems and tool use. The combination of language model reasoning with external tool access (web search, code execution, file systems, APIs) is moving AI from a question-answering capability to a task-execution capability. Google's Project Astra, Anthropic's Claude computer use (October 2024), and OpenAI's Operator project all target this domain. The benchmark infrastructure to evaluate these systems rigorously is still being built, which means the critical reading framework is especially important here β claims are outrunning measurement.
The purpose of this module is not to produce a specific forecast but to build the capacity to update forecasts appropriately as evidence arrives. Three practices support this:
Track your predictions explicitly. Forecasting researchers at Metaculus, Good Judgment Project, and AI Impacts have found that explicit, dated, probability-assigned predictions update faster and more accurately than informal impressions. Keeping even a simple log of what you expected and what happened builds calibration over time.
Maintain a short list of high-quality primary sources. arXiv's cs.AI and cs.LG sections for preprints; Epoch AI's tracking database for compute and training runs; the Papers With Code leaderboards for benchmark tracking; the State of AI Report (published annually by Air Street Capital) for broad synthesis. These are not the only sources, but they are reliably more precise than general technology media.
Distinguish the pace question from the destination question. You can be highly confident that AI capabilities will continue to improve in the near term (the evidence supports this at high confidence) while having genuine uncertainty about what those improvements will produce at the 5β10 year horizon. These are separate questions that are often conflated. Keeping them separate prevents both the "nothing will change" error and the "everything will change immediately" error.
You now have four frameworks: reading benchmark claims with specificity (L1), identifying which capability lever drove a given improvement (L2), applying critical reading questions to research and announcements (L3), and calibrating confident from uncertain claims in near-term forecasting (L4). The rest of this course builds on these foundations toward specific domains where AI progress is most consequential.
Key sources: Bubeck et al., "Sparks of AGI: Early Experiments with GPT-4" (2023); Epoch AI, "Tracking Trends in Machine Learning" (ongoing); Jaime Sevilla et al., "Compute Trends Across Three Eras of Machine Learning" (2022); Li et al., "ARC-AGI and the o3 Result" (December 2024); Romera-Paredes et al., "Mathematical Discoveries from Program Search with LLMs / GNoME" (2023).
Choose a specific AI capability area (autonomous coding, medical imaging, scientific literature synthesis, agentic web browsing, or one of your own choosing). Build a calibrated forecast: what does the current evidence support at high, medium, and low confidence? What evidence would change your view?
Complete at least 3 exchanges to mark this lab done. The assistant will push you to justify confidence levels with specific evidence rather than general impressions.