The Future of Intelligence · Introduction

Every Age Believes Its Tool Is the Last Tool

A course about reading where machine intelligence is going — and why the trajectory matters more than today's headlines.

In September 1878, Thomas Edison announced that he had solved the problem of electric lighting. He had not — the practical incandescent bulb was still fourteen months away. Yet within weeks, gas-company stocks on the London Stock Exchange dropped sharply. Investors, journalists, and engineers did not wait for proof; they adjusted their expectations to a perceived trajectory. By 1882, Edison's Pearl Street Station in lower Manhattan was supplying current to 85 customers. By 1900, electricity had begun restructuring factory layouts, urban planning, and the nature of night itself — changes that Edison's 1878 announcement had implied but that almost nobody had traced to their full conclusion.

The same compression of expectation and reality is happening now with machine intelligence. In November 2022, OpenAI released ChatGPT to the public. Within five days it had one million users; within two months, one hundred million — the fastest consumer product adoption ever recorded at that point. Goldman Sachs estimated in March 2023 that generative AI could automate tasks equivalent to 300 million full-time jobs globally. Whether that figure proves accurate or wildly off, the structural fact is identical to 1878: a technology has crossed a visibility threshold, and the world is adjusting its expectations before the full consequences are clear.

This course is not a prediction machine. It offers frameworks — ways of reading capability curves, understanding where performance comes from, recognizing historical analogues, and thinking clearly about second-order effects. You will finish with a better vocabulary for uncertainty, not a confident forecast. That is the honest and, we think, the more useful outcome. The four modules move from trajectory-reading, to how these systems actually work, to where they are being deployed, to what governing them might actually require.

If you finish every module, here's who you become:

You'll understand why capability curves and historical analogues matter more than any single headline about AI.
You can assess an AI claim — separating genuine trajectory shifts from noise — without defaulting to hype or dismissal.
You'll know how generative AI systems actually produce outputs, and why that shapes both their power and their limits.
You'll be able to map second-order effects: tracing how an AI deployment ripples into labor markets, institutions, and human purpose.
You're becoming someone who holds uncertainty precisely — with a working vocabulary for what we don't yet know and why that matters.
You'll have thought through what governing transformative AI actually requires, not in theory but in terms of real institutional gaps.
You leave with a considered answer to a question most people avoid: what role you specifically want to play in how this goes.

The Future of Intelligence · Module 1 · Lesson 1

Capability Curves and What They Actually Mean

AI benchmarks have been breaking at an accelerating rate. Learning to read the numbers — and their limits — is the first skill this field requires.

When a system surpasses human performance on a benchmark, what has actually been demonstrated — and what has not?

On December 5, 2023, Google DeepMind published results showing that its Gemini Ultra model had scored 90.0% on the Massive Multitask Language Understanding benchmark — MMLU — a set of 57 academic subjects from elementary mathematics to professional law. Human expert performance on MMLU is measured at roughly 89.8%. The news cycle interpreted this as a milestone: a general-purpose AI had matched or exceeded broad human expert knowledge. Within a week, researchers at the University of Edinburgh published a counter-analysis. They found that Gemini Ultra's score dropped to approximately 62% when questions were reformatted in ways that changed surface features but not meaning. The benchmark had been passed. The capability it was supposed to measure had not been cleanly demonstrated.

This is not an isolated incident. It is the central interpretive challenge of following AI progress. Scores rise. What scores measure, and how robustly they measure it, is a separate and harder question.

What a Benchmark Actually Is

A benchmark is a fixed test — a curated dataset of questions, tasks, or challenges — against which a model's outputs are scored. Benchmarks exist because comparing systems requires a common reference. The alternative, open-ended human evaluation, is slow, expensive, and hard to replicate. Benchmarks solve a logistics problem. They do not necessarily solve the meaning problem.

Three dynamics recur in AI benchmarking history. First, saturation: a benchmark designed for a generation of models becomes too easy for the next. ImageNet, the image-classification competition that launched the deep learning era when AlexNet won it in 2012 with 84.7% accuracy, is now routinely solved above 90% by models that struggle with adversarial images any child would identify correctly. Second, dataset contamination: models trained on internet text often encounter benchmark questions in their training data, inflating scores. Third, Goodhart's Law dynamics: once a benchmark becomes a target, pressure accumulates to optimize for it specifically, which diverges from the underlying capability it was meant to proxy.

None of this means benchmarks are useless. It means they must be read as evidence, not verdicts.

Real Case: MMLU and Its Limits

The MMLU benchmark was introduced by Dan Hendrycks and colleagues at UC Berkeley in 2020, specifically to test whether large language models had absorbed diverse factual and reasoning knowledge. GPT-3 scored 43.9% at launch. By early 2023, GPT-4 scored 86.4%. By late 2023, multiple models claimed scores at or above estimated human expert performance. Simultaneously, researchers demonstrated that models achieving these scores still failed systematically on multi-step reasoning variants of the same underlying questions — suggesting the benchmark was measuring pattern-matching to known answer formats more than deep comprehension.

Reading a Capability Curve

AI capability curves — graphs of benchmark performance over time — show consistent patterns that reward careful reading. Performance on most benchmarks follows an S-curve: slow initial progress, a rapid acceleration phase, then saturation near the ceiling. The acceleration phase is when media coverage spikes and comparisons to human performance become dramatic. The saturation phase is when the benchmark is replaced by a harder one, and the cycle restarts.

The most instructive recent example is the ARC-AGI benchmark, created by François Chollet and released in 2019. ARC tasks require solving novel visual-pattern puzzles that are trivially easy for most humans but that, as of 2023, no large language model could solve above roughly 30%. In June 2024, OpenAI's o3 model — using a test-time compute scaling approach — scored 75.7% on the semi-private evaluation set, and a high-compute configuration reached 87.5%. Chollet noted that this represented genuine progress but cautioned that o3 was spending compute equivalent to thousands of dollars per puzzle, whereas human solvers spend seconds. The capability was real; the efficiency gap was equally real.

Understanding a capability curve therefore requires four questions: What specific task is being measured? How was human performance established? What are the benchmark's known weaknesses? At what computational cost was the score achieved?

Scaling Laws: The Engine Behind the Curves

In January 2020, researchers at OpenAI — Jared Kaplan, Sam McCandlish, and colleagues — published a paper documenting what became known as neural scaling laws. They found that the performance of large language models improved in a remarkably predictable, power-law relationship with three variables: the number of model parameters, the size of the training dataset, and the amount of compute used. Double the compute, and performance improves by a predictable increment. The curve was smooth and consistent across orders of magnitude.

This was a significant finding because it meant AI progress was, to a degree, engineerable: more resources in, more capability out, on a predictable schedule. It also meant that companies and researchers could plan investments around capability projections rather than relying entirely on algorithmic breakthroughs. The 2022 Chinchilla paper from DeepMind's Jordan Hoffmann and colleagues refined the picture: prior large models had been under-trained relative to their size. Optimal training requires scaling data and parameters together. GPT-4, released in March 2023 with undisclosed but estimated very large training runs, appeared to follow the refined scaling predictions closely.

The limits of scaling laws are now a primary research question. Some researchers argue performance on truly novel reasoning tasks will plateau regardless of scale. Others argue we have not yet found the ceiling. The honest answer is that nobody knows — which is itself a fact worth holding.

Scaling Law The empirical observation that language model performance improves predictably as a power function of compute, parameters, and data — first documented systematically by Kaplan et al. at OpenAI in 2020.

Benchmark Saturation The point at which a benchmark's ceiling is approached by current systems, rendering it unable to distinguish meaningfully between top performers or track further progress.

Goodhart's Law When a measure becomes a target, it ceases to be a good measure. In AI, optimizing for a benchmark score can diverge from improving the underlying capability the benchmark was meant to proxy.

Emergent Capabilities: Surprise on the Curve

In 2022, researchers at Google Brain published a widely discussed paper on "emergent abilities" in large language models — capabilities that appeared abruptly at certain model scales rather than improving gradually. Three-digit arithmetic, chain-of-thought reasoning, and certain analogical tasks seemed to switch on near-discontinuously above threshold parameter counts. This implied that simply scaling could produce unexpected qualitative leaps, not just quantitative improvements.

A 2023 follow-up by Rylan Schaeffer and colleagues at Stanford challenged the emergence interpretation. They argued that apparent emergence was often an artifact of the evaluation metric: discontinuous-looking improvements on metrics like exact-match accuracy become smooth progressions when evaluated on continuous metrics. The underlying capability was improving smoothly; the measurement was hiding it.

Whether or not emergent capabilities are "real" in the strong sense, the episode illustrates the epistemological challenge of the field. Very smart researchers examining the same models, the same data, reach substantially different conclusions. Reading the trajectory of AI requires holding multiple interpretations simultaneously and updating as evidence accumulates.

The Core Skill of This Module

Before you can reason about where AI is going, you need a reliable method for reading where it is. That method is not "trust the benchmark scores" and not "distrust all claims." It is: identify what was measured, how it was measured, what the measurement does and does not imply, and what economic or competitive pressures might be distorting the presentation of results. The rest of this module builds that method in four directions: historical analogues, economic drivers, measurement failures, and forecast methodology.

Lesson 1 Quiz

Capability Curves and What They Actually Mean — five questions

1. Google DeepMind's Gemini Ultra scored above human expert level on MMLU in December 2023. Researchers at the University of Edinburgh then found that the score dropped to approximately 62% when they did what?

Correct. The Edinburgh study reformatted MMLU questions while preserving their meaning. The score collapse illustrated that Gemini Ultra was sensitive to surface presentation, suggesting pattern-matching to known answer formats rather than robust comprehension.

Not quite. The Edinburgh counter-analysis reformatted questions at the surface level — changing phrasing and format while keeping the underlying meaning intact — and found the score dropped dramatically. This is the canonical example of benchmark fragility used in this lesson.

2. The 2020 OpenAI scaling laws paper, authored by Kaplan, McCandlish, and colleagues, found that language model performance improved in what type of relationship with compute, parameters, and data?

Correct. The scaling laws paper documented a power-law relationship: performance improved predictably as a function of compute, parameters, and data, across orders of magnitude. This made AI capability, to a degree, engineerable and plannable.

The relationship documented was a power law — smooth and predictable across many orders of magnitude. This was significant because it meant capability gains were engineerable: invest more resources, receive predictable returns. Review the Scaling Laws section of Lesson 1.

3. The 2022 DeepMind Chinchilla paper, by Hoffmann and colleagues, corrected a key mistake in how prior large models had been trained. What was that mistake?

Correct. Chinchilla showed that prior large models — including GPT-3 — had been under-trained relative to their parameter counts. The optimal approach scales training data proportionally with model size, not just parameters alone.

Chinchilla's key finding was under-training: prior models were too large relative to how much data they were trained on. Optimal compute efficiency requires scaling data and parameters together. This refined the original Kaplan scaling laws significantly.

4. François Chollet's ARC-AGI benchmark was notable partly because of the cost at which OpenAI's o3 achieved its high score in 2024. What did Chollet identify as the key concern about o3's performance?

Correct. Chollet acknowledged o3's score as genuine progress but flagged the efficiency gap: the model was spending enormous compute per puzzle while human solvers required seconds. A capability demonstrated at very high cost is a different capability than one demonstrated efficiently.

The core concern Chollet raised was efficiency: o3's high scores required compute costs equivalent to thousands of dollars per puzzle. Human solvers take seconds. This gap between capability and efficiency is an important dimension of reading benchmark results accurately.

5. Rylan Schaeffer and colleagues at Stanford argued in 2023 that "emergent abilities" in large language models were often an artifact of what?

Correct. Schaeffer et al. argued that apparent emergence was a measurement artifact: exact-match accuracy hides smooth underlying improvement, making gradual gains appear as sudden capability switches. The capability was improving continuously; the metric was not capturing it continuously.

The Stanford argument was about evaluation metrics: exact-match accuracy makes smooth improvements look like discontinuous jumps. When evaluated on continuous metrics, the "emergent" capabilities showed gradual improvement, not sudden appearance. Measurement choices shape what we see.

Lab 1 — Reading a Benchmark Report

Practice applying the four benchmark-reading questions to real AI capability claims.

Your Task

You have encountered a news headline: "New AI Model Beats Human Experts on Medical Licensing Exam." Before accepting or rejecting this claim, you need to apply the four benchmark-reading questions from Lesson 1. Use this lab to work through them with an AI guide.

The assistant will ask you to apply each question in turn, give feedback on your reasoning, and push back if your interpretation is too credulous or too dismissive. Complete at least three substantive exchanges to finish the lab.

Start by telling the assistant: which of the four benchmark-reading questions you think is most important to ask first about this medical exam claim — and why.

Lab Assistant

Benchmark Analysis

Welcome to Lab 1. The headline is: "New AI Model Beats Human Experts on Medical Licensing Exam."

From Lesson 1, you have four questions for reading any benchmark claim: (1) What specific task is being measured? (2) How was human performance established? (3) What are the benchmark's known weaknesses? (4) At what computational cost was the score achieved?

Which of these do you think deserves your first attention when you see this particular claim — and what's your reasoning?

The Future of Intelligence · Module 1 · Lesson 2

Historical Analogues: What Prior Disruptions Can and Cannot Tell Us

Every transformative technology arrived with failed predictions on both sides. Understanding why teaches more than the predictions themselves.

Which historical technological transitions most accurately parallel the current AI moment — and where do the analogies break down?

In 1900, the economist David A. Wells published a retrospective analysis of American economic transformation since 1870. He documented that the railroad had, within three decades, eliminated entire occupational categories — stagecoach drivers, canal operators, certain categories of freight-wagon teamsters — while creating new ones that had not existed before: locomotive engineers, telegraph operators coordinating rail traffic, the hotel and restaurant workers serving rail hubs. The net employment effect was positive, but the distributional effect was sharp: specific communities built around pre-railroad transport were economically devastated, while other communities grew rapidly. Wells noted that nobody had predicted the specific geography of winners and losers in advance, even though the general direction of travel had been clear.

This observation — that general trajectories can be readable while specific distributional consequences remain opaque until after the fact — is the central lesson historical analogues offer for thinking about AI.

The Electricity Analogy: Infrastructure Before Application

The most commonly cited historical parallel for AI is electrification. The comparison is instructive in both what it illuminates and what it obscures. The illuminating part: electricity, like AI, was a general-purpose technology — one capable of transforming productivity across nearly every industry rather than being confined to a single application domain. Economist Paul David's influential 1990 paper, "The Dynamo and the Computer," documented that electrification did not produce measurable economy-wide productivity gains for roughly forty years after Edison's Pearl Street Station opened in 1882. Factories had to physically redesign their layouts, workers had to develop new skills, regulatory and safety frameworks had to develop, and secondary industries supplying electrical components had to scale. The productivity gains arrived in the 1920s, driven by manufacturers who had grown up in the electrical era and thought in terms of it from the start.

David's argument became known as the "productivity paradox" — a productivity paradox originally applied to computing in the 1980s and 1990s, when widespread computerization was not yet showing up in aggregate productivity statistics. Robert Solow's 1987 quip — "You can see the computer age everywhere but in the productivity statistics" — articulated the same pattern. The Solow paradox eventually resolved: economy-wide computing productivity gains became visible in US statistics beginning around 1995, roughly thirty years after mainframe computing began spreading through corporate America.

If the electrification and computing analogues hold for AI, the implication is that transformative productivity effects may lag the technology's visible deployment by a decade or more, and will require complementary investments in physical infrastructure, organizational redesign, and workforce reskilling that are not captured in the AI systems themselves.

Real Case: The Solow Paradox Resolved

Erik Brynjolfsson at MIT documented the resolution of the computing productivity paradox in a series of papers beginning in the late 1990s. He found that firms which invested heavily in computing AND undertook organizational changes — flatter hierarchies, worker reskilling, redesigned workflows — captured large productivity gains. Firms that installed computers without complementary organizational change did not. The technology was necessary but not sufficient. Brynjolfsson argues explicitly that the same dynamic will apply to AI: the gains will be captured by organizations that redesign around AI capabilities, not those that layer AI onto existing workflows.

Where the Electricity Analogy Breaks Down

Electricity did not generate electricity. Its outputs could not replace the engineers designing its infrastructure or the economists analyzing its effects. Large language models can, at least partially, assist in writing code for AI systems, generating training data, and analyzing AI research papers. This self-referential quality has no clean analogue in prior general-purpose technology transitions and makes simple application of historical timelines unreliable.

The second disanalogy is speed of deployment. The commercial telegraph took roughly fifteen years to wire the United States after Morse's 1844 Washington-to-Baltimore demonstration. The telephone took decades to reach majority household penetration. ChatGPT reached one million users in five days and one hundred million in two months. The deployment velocity of software-based AI is categorically faster than any prior general-purpose technology because it requires no physical installation at the point of use. This does not necessarily mean economic effects arrive faster — the Brynjolfsson complementarity argument still applies — but it does mean the period of public awareness and competitive response is compressed dramatically.

A third disanalogy involves the nature of what is being automated. Prior general-purpose technologies primarily substituted for physical labor or for specific, well-defined cognitive tasks (arithmetic calculation, data storage and retrieval). Current large language models engage in tasks — writing, legal analysis, medical diagnosis, code generation, scientific literature review — that had historically been considered definitionally cognitive, requiring judgment and interpretation. The historical precedents for automating judgment-intensive work are sparser and less conclusive.

The Printing Press: Displacement of a Knowledge Monopoly

A second historical analogue worth examining is Gutenberg's printing press, introduced in Europe around 1440. Prior to the press, book production was controlled by monastic scriptoria and a small number of secular workshops. Literacy was correlated tightly with this production monopoly: those who produced books largely determined who had access to knowledge. The press did not immediately democratize literacy — that took roughly two centuries, and required the Reformation, the development of vernacular literatures, and significant changes in educational institutions. But it did immediately and dramatically disrupt the economics of knowledge production: within fifty years, Venice alone had over 150 printing establishments, and the price of books fell by an estimated 80%.

The parallel to AI and knowledge work is that a technology which dramatically reduces the marginal cost of producing a specific type of output — in 1450, copied text; in 2024, written, analytical, and coded content — does not immediately restructure society, but it does immediately disrupt the economics of that output's production. The institutions, credentials, and economic arrangements built around the scarcity of that production face pressure that compounds over time.

General-Purpose Technology A technology capable of transforming productivity across many industries rather than a single application domain. Economists identify electricity, the steam engine, and information technology as canonical examples. AI is widely argued to be the fourth.

Productivity Paradox The observed lag between a general-purpose technology's deployment and its appearance in aggregate productivity statistics, first documented for computers by Robert Solow in 1987 and explained by Erik Brynjolfsson as a function of required complementary investments.

Complementary Investment Organizational, human-capital, and infrastructure investments required alongside a new technology for its productivity gains to materialize. Brynjolfsson's research shows these are often larger than the technology investment itself.

Using Analogies Well

The correct use of historical analogies is not to predict AI's future by substituting it into a prior technology's timeline. It is to identify mechanisms — the productivity paradox, complementarity requirements, knowledge-monopoly disruption, distributional unevenness — that have recurred across multiple transitions and ask whether those mechanisms are present now. Where they are present, the analogies increase your confidence. Where they are absent or inverted, the analogues signal caution.

Lesson 2 Quiz

Historical Analogues — five questions

1. Economist Paul David's 1990 paper "The Dynamo and the Computer" documented that electrification did not produce measurable economy-wide productivity gains for roughly how long after Edison's Pearl Street Station opened in 1882?

Correct. David documented a roughly forty-year lag between electrification's deployment and its appearance in aggregate productivity statistics. The gains arrived in the 1920s, driven by manufacturers who had grown up in the electrical era. This became the template for the "productivity paradox" argument.

David documented approximately forty years of lag before electrification showed up in productivity statistics. The gains arrived in the 1920s — about four decades after Pearl Street Station. This established the productivity paradox template that Solow later applied to computing.

2. Robert Solow's famous 1987 quip about computers — "You can see the computer age everywhere but in the productivity statistics" — described what phenomenon?

Correct. Solow observed that despite computing's visible spread through corporate America, it was not yet showing up in aggregate productivity statistics — the same pattern David identified for electricity. The paradox resolved around 1995 when economy-wide productivity gains became visible.

Solow was describing the productivity paradox: computers were everywhere but not yet showing up in aggregate productivity statistics. This mirrored David's finding about electrification. The paradox resolved in US data around 1995, roughly thirty years after mainframe computing spread through large corporations.

3. Erik Brynjolfsson's research on the computing productivity paradox found that firms which captured large productivity gains shared what specific characteristic beyond simply investing in computers?

Correct. Brynjolfsson found that technology alone was insufficient. Firms that captured gains combined computing investment with organizational redesign — flatter structures, reskilled workers, and redesigned workflows. Firms that installed computers without complementary change did not see the gains. He argues the same will apply to AI.

Brynjolfsson's key finding was complementarity: computers plus organizational change produced gains; computers alone did not. Firms needed to redesign around the technology — flatter hierarchies, reskilled workers, new workflows. He argues AI will require the same complementary investments to produce economy-wide gains.

4. The lesson identifies a key way in which AI differs from electricity as a general-purpose technology analogy. Which of the following is that disanalogy?

Correct. Large language models can assist in writing code for AI systems, generating training data, and analyzing AI research — a self-referential quality with no clean historical analogue. Electricity did not generate electricity or help design power plants. This property makes simple analogical timelines unreliable.

The key disanalogy identified in this lesson is self-referentiality: AI outputs can feed back into AI development in ways electricity could not. This collapses simple historical timeline analogies. Review the "Where the Electricity Analogy Breaks Down" section.

5. According to the lesson, within how many years of the introduction of Gutenberg's press around 1440 did Venice alone have over 150 printing establishments, and by how much had book prices fallen?

Correct. Within fifty years of Gutenberg's press, Venice had over 150 printing establishments and book prices had fallen by an estimated 80%. The lesson uses this to illustrate how dramatically reducing the marginal cost of producing a type of output disrupts the economics of that output — even before society more broadly restructures.

The figures from the lesson are fifty years and 80% price reduction. The printing press example illustrates the economics of marginal cost reduction: once the cost of producing a type of output collapses, the institutions built around its scarcity face immediate economic pressure, even if broader social restructuring takes much longer.

Lab 2 — Applying Historical Analogues

Work through the strengths and limits of a specific historical analogy for AI.

Your Task

You will analyze the analogy between AI and the printing press. The assistant will guide you through identifying which mechanisms from the printing press transition are present in AI's current situation, and which are absent or inverted. This is the "using analogies well" skill from Lesson 2.

Engage with at least three substantive exchanges. The assistant will push back on oversimplified mappings and ask you to identify specific mechanisms.

Start by telling the assistant one specific way the printing press analogy illuminates AI's current situation — and one way it misleads.

Lab Assistant

Historical Analysis

Welcome to Lab 2. We're examining the printing press as a historical analogue for AI. From Lesson 2, you know that the press around 1440 disrupted knowledge-production economics almost immediately — Venice had 150+ printing establishments within fifty years, book prices fell ~80% — but broader social restructuring (literacy, the Reformation, educational institutions) took roughly two centuries.

The lesson also gives you a method: identify mechanisms that have recurred across transitions and ask whether those mechanisms are present now. Where present, analogies are useful. Where absent or inverted, they signal caution.

Give me one specific way the printing press analogy genuinely illuminates AI's situation today, and one way it misleads or breaks down.

The Future of Intelligence · Module 1 · Lesson 3

Economic Drivers: Who Is Paying for This and Why

The trajectory of AI is not only a scientific phenomenon. It is a capital allocation decision made by specific actors with specific incentives.

What economic forces are actually driving AI investment — and what do those forces imply about where the technology will and will not develop?

In the first quarter of 2024, Microsoft reported that it had committed $13 billion in investment to OpenAI. In the same period, Google announced it had invested $2 billion in Anthropic, with a commitment for up to $300 million more. Amazon announced up to $4 billion for Anthropic. Meta disclosed it was spending roughly $35 billion in capital expenditure in 2024, primarily on AI infrastructure. These are not research grants. They are strategic investments by companies whose core revenue — advertising, cloud computing, enterprise software, e-commerce — faces potential disruption from AI capabilities, and who are simultaneously positioned to capture value if those capabilities become infrastructure.

Understanding the trajectory of AI requires understanding what these capital flows are and are not optimizing for — which capabilities get funded, which problems get prioritized, and which gaps get left.

The Cloud-AI Investment Cycle

The three largest cloud providers — Amazon Web Services, Microsoft Azure, and Google Cloud — collectively control an estimated 65% of global cloud infrastructure as of 2024. Each of these providers benefits from AI in two distinct and compounding ways. First, AI model training and inference require compute, and they sell compute. Second, AI capabilities integrated into their cloud platforms increase switching costs and create new billable services. The incentive to accelerate AI capability development is therefore extremely strong among these actors independently of any interest in AI itself.

This investment structure shapes the research agenda in observable ways. Applications that can be delivered as cloud services — API-accessible models, enterprise software integrations, code generation tools — receive enormous resources. Applications that require on-device inference or that would reduce dependence on cloud infrastructure receive comparatively less. The AI landscape in 2024 is therefore partly a map of what is profitable to build as a cloud service, not purely a map of what is technically achievable or socially useful.

Microsoft's $13 billion OpenAI investment was accompanied by a deal giving Microsoft exclusive cloud-provision rights for OpenAI's models. The technical and commercial decisions are entangled: the models that exist are partly the models that Azure can most profitably host.

Real Case: The GPU Chokepoint

NVIDIA's H100 GPU, the primary hardware for large-model training as of 2023–2024, was priced at approximately $30,000–$40,000 per unit. A single large training run for a frontier model requires thousands of them. NVIDIA reported in its Q2 2024 earnings that data-center revenue — primarily AI chips — had grown 154% year-over-year, reaching $26.3 billion in a single quarter. This concentration of critical infrastructure in a single hardware supplier creates a dependency that shapes the entire field: which organizations can afford to train frontier models, which research directions require prohibitive compute, and which geographies can participate in leading AI development are all partly determined by access to NVIDIA hardware and US export controls on that hardware.

The Labor-Cost Substitution Argument

A substantial fraction of AI investment is explicitly justified by the expectation of labor-cost reduction. McKinsey's 2023 report on generative AI estimated that it could automate work activities currently consuming between $2.6 trillion and $4.4 trillion in annual global wages. Goldman Sachs's March 2023 analysis estimated 300 million full-time equivalent jobs exposed to automation. These figures drive investment decisions: if AI can substitute for a significant fraction of knowledge-work labor at a fraction of its cost, the return on AI investment is potentially enormous.

The substitution argument has important nuances that get compressed in the headline figures. First, "exposed to automation" is not the same as "will be automated": automation potential must overcome switching costs, regulatory barriers, user-acceptance requirements, and complementarity demands. Second, partial automation — AI handling some tasks within a job while humans handle others — often increases demand for the remaining human tasks rather than eliminating the role. Third, the historical record on automation consistently shows job displacement concentrated in specific occupations and geographies rather than distributed uniformly, even when aggregate net employment effects are neutral or positive.

What the labor-cost argument does reliably predict is where AI investment will concentrate: on tasks that are high-volume, relatively standardized, performed by expensive knowledge workers, and currently executed without significant regulatory friction. Legal document review, software coding, customer service, medical imaging interpretation, and content generation each satisfy most of these criteria and each attract disproportionate AI investment.

The National Security Dimension

Since 2022, AI investment has increasingly been shaped by explicit national security and geopolitical considerations alongside commercial ones. The US government's October 2022 export controls on advanced semiconductor technology to China — including A100 and H100 GPUs — were explicitly justified as necessary to prevent Chinese development of frontier AI models for military applications. China's response included significant state investment in domestic chip design (Huawei's Ascend series) and AI model development (Baidu's ERNIE, Alibaba's Qwen, and others).

The National Security Commission on Artificial Intelligence, chaired by former Google CEO Eric Schmidt, published a 756-page report in March 2021 recommending $32 billion in federal AI investment over two years to maintain US technological leadership. The CHIPS and Science Act of 2022, signed by President Biden in August of that year, allocated $52 billion for domestic semiconductor manufacturing, explicitly linking chip production capacity to AI capability and national security.

The implication for reading the AI trajectory is that investment levels and research priorities are now partly determined by geopolitical competition, not only commercial incentives. This tends to accelerate development of AI capabilities perceived as strategically important — autonomous systems, surveillance, logistics optimization, cybersecurity — while potentially underinvesting in safety research, interpretability, and applications that are socially beneficial but not competitively strategic.

General-Purpose Technology Rent The economic value captured by providers of infrastructure underlying a general-purpose technology. In AI, cloud providers and chip manufacturers capture rents from the entire AI ecosystem regardless of which specific applications succeed.

Export Controls (AI Context) US government restrictions, significantly expanded in October 2022, limiting the export of advanced semiconductors and semiconductor manufacturing equipment to China and other specified countries, explicitly justified by AI and military applications.

What Capital Flows Tell You

Investment concentrations are leading indicators of where capability will develop, but they are lagging indicators of social impact. The gap between "where money is going" and "what happens to workers and institutions" is where most AI analysis goes wrong. Following the capital tells you what will be built. It does not tell you what will be adopted at scale, what will encounter resistance, or what second-order effects will follow adoption.

Lesson 3 Quiz

Economic Drivers — five questions

1. Microsoft committed approximately how much investment to OpenAI, and what additional commercial arrangement accompanied this investment?

Correct. Microsoft's $13 billion investment was paired with exclusive cloud-provision rights, giving Azure the infrastructure contract for OpenAI's models. This illustrates how commercial and technical decisions are entangled: the models that exist are partly the models that Azure can most profitably host.

The figures are $13 billion and exclusive Azure cloud-provision rights. This pairing illustrates a core point from Lesson 3: AI investment is entangled with cloud infrastructure economics. The research and deployment decisions are not purely technical — they reflect what is profitable for specific cloud providers to host.

2. NVIDIA's Q2 2024 earnings showed data-center revenue — primarily AI chips — had grown how much year-over-year, and to what quarterly figure?

Correct. 154% year-over-year growth to $26.3 billion in a single quarter. The lesson uses this to illustrate the GPU chokepoint: critical AI infrastructure concentrated in a single hardware supplier, with access to that hardware partly determining which organizations can train frontier models.

The figures are 154% year-over-year growth and $26.3 billion in a single quarter. This concentration matters because access to NVIDIA hardware partly determines which organizations can train frontier models, which research directions are affordable, and which geographies can participate in leading AI development.

3. McKinsey's 2023 generative AI report estimated the technology could automate work activities consuming how much in annual global wages?

Correct. McKinsey estimated $2.6 trillion to $4.4 trillion in annual global wages exposed to automation by generative AI. The lesson cautions that "exposed to automation" is not the same as "will be automated" — switching costs, regulatory barriers, user-acceptance requirements, and complementarity demands all intervene.

McKinsey's estimate was $2.6 trillion to $4.4 trillion. The lesson uses this figure to illustrate the labor-cost substitution argument that drives AI investment, while cautioning that "exposed to automation" overstates actual displacement by ignoring switching costs, regulatory barriers, and complementarity demands.

4. The US government's October 2022 export controls restricted which specific semiconductor products' export to China, and what official justification was given?

Correct. The October 2022 controls targeted A100 and H100 GPUs with the explicit justification of preventing Chinese military AI development. This illustrates the national security dimension of AI investment: geopolitical competition is now partly determining which capabilities get prioritized and who can access frontier AI infrastructure.

The controls specifically targeted A100 and H100 GPUs, justified by preventing Chinese military AI development. This introduced a national security dimension to AI investment that accelerates some capabilities (autonomous systems, surveillance) while potentially underinvesting in safety and socially beneficial applications.

5. The lesson argues that investment concentrations are leading indicators of where capability will develop but lagging indicators of social impact. What does this mean in practice?

Correct. Investment concentrations predict what gets built. They do not predict adoption rates, user acceptance, regulatory outcomes, or second-order social effects — which is where most AI analysis goes wrong. The capital flow and the social trajectory are related but distinct signals.

The distinction the lesson makes is between "what will be built" (which investment predicts reasonably well) and "what will happen when it's deployed" — adoption at scale, resistance, and second-order effects. Capital flows are a leading indicator for capability development but a lagging or misleading indicator for social impact.

Lab 3 — Mapping Investment to Capability Gaps

Trace the logic from economic incentive to likely AI development direction.

Your Task

Using what you know about cloud-provider incentives, labor-cost substitution pressures, and national security drivers from Lesson 3, you will identify a domain of AI capability that is likely underfunded relative to its social importance — and explain the economic logic for why it is underfunded.

The assistant will ask you to be specific about which economic incentive structures are absent, and will push back if your analysis is too vague. Complete at least three substantive exchanges.

Begin by naming one domain of AI development you think is underfunded relative to its social value — and identify specifically which of the three economic drivers from Lesson 3 fail to create investment pressure there.

Lab Assistant

Investment Analysis

Welcome to Lab 3. From Lesson 3, you have three economic drivers shaping AI investment: (1) cloud-provider infrastructure incentives — they fund what generates billable compute and reduces switching costs; (2) labor-cost substitution pressures — investment concentrates on high-volume, expensive knowledge-work tasks with low regulatory friction; (3) national security and geopolitical competition — spending accelerates capabilities perceived as strategically important.

Your task: identify a domain of AI development you think is underfunded relative to its social value, and explain specifically which of these three economic drivers are absent or weak for that domain. Be concrete — name the domain and trace the incentive logic.

The Future of Intelligence · Module 1 · Lesson 4

Forecasting Under Uncertainty: What Good Prediction Actually Looks Like

The AI field is full of confident predictions. The useful skill is not making predictions — it is calibrating confidence to evidence and updating as data arrives.

What distinguishes a well-calibrated AI forecast from a confident-sounding guess — and how do you tell the difference when you encounter one?

In 2016, Geoffrey Hinton — one of the three researchers who shared the 2018 Turing Award for foundational contributions to deep learning — said that radiologists should stop training because, within five years, neural networks would definitively outperform them at reading medical images. In 2022, he revised this view: deep learning had indeed achieved impressive diagnostic accuracy on specific imaging tasks, but radiologists were still very much employed, partly because the task of radiology encompasses far more than pattern recognition in a single image, and partly because deployment in clinical settings involves regulatory, liability, and workflow factors that technology performance alone cannot resolve. Hinton, one of the most knowledgeable people in the world about deep learning, had made a confidently stated prediction that proved substantially wrong — not because the technical capability failed to arrive, but because capability and deployment are different questions with different timelines.

Why AI Forecasts Fail

AI forecasting has a documented record of consistent errors in identifiable directions. The first is scope substitution: a capability is demonstrated in a restricted domain, and the forecast assumes unrestricted generalization. Deep learning's success on curated ImageNet images was forecast to generalize to medical images in uncontrolled clinical conditions; the transition required years of additional work on robustness, distribution shift, and annotation quality. Benchmark performance is measured in controlled conditions; deployment happens in the wild. The gap between them is systematically underestimated.

The second error is ignoring the adoption stack. A technology requires not only technical capability but a full stack of complementary conditions before it reaches scale deployment: regulatory approval (especially in medicine, law, and finance), liability frameworks, user-trust development, workflow integration, training of practitioners, and sometimes physical infrastructure. Each layer has its own timeline, and they must align. GPT-4 achieved impressive medical exam scores in 2023; as of mid-2024, no AI system had received FDA clearance as a general diagnostic tool, because regulatory pathways for AI-based clinical decision support are still being developed.

The third error is single-point thinking: presenting a future state as if it will arrive uniformly, when in practice adoption is geographically, economically, and institutionally uneven. The same technology may transform a well-resourced urban hospital and have no effect on a rural clinic lacking the connectivity, IT infrastructure, and trained staff to implement it.

Real Case: Superforecasting Applied to AI

Philip Tetlock's Good Judgment Project, which has tracked forecaster accuracy across thousands of geopolitical and economic questions since 2011, found that the best forecasters share specific habits: they break questions into components, assign explicit numerical probabilities, update frequently as new evidence arrives, and actively seek disconfirmation of their current views. Tetlock's 2023 AI-focused forecasting tournament, Forecasting AI Progress (FAIP), found that even trained superforecasters showed systematically overconfident predictions about near-term AI capabilities — but that the overconfidence was significantly reduced when forecasters were required to explicitly state their model of adoption, not just capability. The adoption model forced them to confront the adoption stack problem.

The Structure of a Well-Calibrated AI Forecast

A well-calibrated AI forecast has five components, each of which can be evaluated separately. First, it specifies the capability claim precisely: not "AI will handle legal work" but "AI will perform first-pass contract review for standard commercial agreements with fewer errors than junior associates in firms using current proofreading workflows." Second, it specifies the measurement method: how will we know if this capability has arrived? Third, it specifies the adoption conditions: what regulatory, liability, and workflow prerequisites must be satisfied before the capability translates to deployment at scale? Fourth, it gives an explicit probability and timeframe: not "soon" but "70% probability within five years." Fifth, it identifies the update triggers: what specific evidence would cause the forecast to be revised upward or downward?

Forecasts that lack any of these five components are not predictions in any meaningful sense. They are assertions about the future dressed in prediction language. The distinction matters enormously when deciding whether to act — whether to retrain, whether to invest, whether to regulate — on the basis of the forecast.

Where Forecasters Systematically Agree (and Why That Matters)

Despite the general unreliability of AI forecasts, there are domains where careful forecasters with different methodological approaches converge — and that convergence is itself evidence worth noting. The AI Impacts survey of machine learning researchers, conducted in 2022 with responses from 738 researchers, found median estimates of a 50% probability of "high-level machine intelligence" — defined as AI able to perform almost all cognitive tasks humans can perform — by 2059, with a 10% probability by 2028 and a 90% probability by 2100. These estimates were substantially earlier than a 2016 version of the same survey, reflecting researcher updating based on observed progress.

On nearer-term questions, there is stronger convergence. The 2022 AI Impacts survey found median estimates of roughly 2025–2027 for AI systems competitive with human performance on a wide range of specific task benchmarks — a range that has proved roughly accurate given GPT-4's 2023 performance and subsequent models. There is also convergence on the proposition that AI will not progress in a smooth linear fashion: most researchers expect a combination of continued scaling gains, periods of plateau, and potentially transformative capability jumps from new architectural approaches, though the timing and nature of those jumps remains contested.

Scope Substitution A forecasting error in which a capability demonstrated in a restricted domain is assumed to generalize to unrestricted conditions without additional work. Benchmark performance in controlled settings does not automatically translate to deployment performance in the wild.

Adoption Stack The full set of complementary conditions — regulatory approval, liability frameworks, user trust, workflow integration, practitioner training, and sometimes physical infrastructure — required for a technology to reach scale deployment. Each layer has its own timeline.

Calibration A property of forecasters and forecasts: a well-calibrated forecaster's 70% probability predictions come true roughly 70% of the time. Calibration is measurable and trainable, and is distinct from accuracy on any single prediction.

The Takeaway for This Module

Reading the trajectory of AI requires four parallel skills: reading benchmark claims accurately (Lesson 1), understanding which historical mechanisms apply and which do not (Lesson 2), tracing investment logic to see what will and will not be built (Lesson 3), and distinguishing well-calibrated forecasts from confident assertions (Lesson 4). None of these skills produces certainty. Together, they produce significantly better judgment than naive optimism, naive pessimism, or the reflex to defer to whoever sounds most confident.

Lesson 4 Quiz

Forecasting Under Uncertainty — five questions

1. Geoffrey Hinton's 2016 prediction about radiologists proved substantially wrong despite correct capability predictions. What was the specific reason his forecast failed, according to this lesson?

Correct. Hinton's error was conflating capability arrival with deployment arrival. Deep learning did achieve impressive imaging accuracy. But radiology encompasses more than single-image pattern recognition, and deployment in clinical settings requires regulatory clearance, liability frameworks, and workflow integration that technical capability cannot bypass.

The capability Hinton predicted did substantially arrive — deep learning imaging performance became impressive. The error was treating capability as equivalent to deployment. The adoption stack (regulatory, liability, workflow) has its own timeline, independent of technical performance. This is the lesson's central example of scope substitution and adoption-stack blindness.

2. The lesson identifies three systematic forecasting errors. Which of the following is NOT one of the three errors described?

Correct. Funding bias is not one of the three errors described in Lesson 4. The three are scope substitution, ignoring the adoption stack, and single-point thinking. Funding level is relevant context from Lesson 3, but it is not the same as the forecasting failure categories in Lesson 4.

Funding bias is not one of the three forecasting errors from Lesson 4. The three are: (1) scope substitution — assuming restricted capability generalizes; (2) ignoring the adoption stack — forgetting regulatory, liability, and workflow requirements; (3) single-point thinking — assuming uniform rather than uneven adoption. Review the "Why AI Forecasts Fail" section.

3. Philip Tetlock's Forecasting AI Progress tournament found that even trained superforecasters showed overconfident near-term AI capability predictions. What reduced this overconfidence significantly?

Correct. The adoption model requirement forced forecasters to confront the adoption stack explicitly — regulatory, liability, workflow, and trust conditions — which consistently produced more realistic timelines. The capability model alone was generating overconfidence because it bypassed the question of how capability becomes deployment.

The intervention that reduced overconfidence was requiring an explicit adoption model. When forecasters had to specify not just "when will the capability exist" but "what conditions are required for it to deploy at scale," they consistently produced more realistic timelines. Capability and adoption are different questions that require separate analysis.

4. The 2022 AI Impacts survey of 738 machine learning researchers found a median estimate for 50% probability of "high-level machine intelligence" by what year?

Correct. The 2022 AI Impacts survey found a median estimate of 2059 for 50% probability of high-level machine intelligence, with 10% probability by 2028 and 90% probability by 2100. Notably, these estimates were substantially earlier than the 2016 version of the same survey, reflecting researcher updating based on observed progress.

The median estimate was 2059 for 50% probability, with a 10% probability by 2028. Importantly, these figures were substantially earlier than the 2016 survey estimates — researchers updated their views based on observed progress, particularly from large language models. The updating itself is an example of good calibration practice.

5. A well-calibrated AI forecast has five specific components according to Lesson 4. Which of the following is one of those five components?

Correct. Update triggers are one of the five components: specifying in advance what evidence would cause the forecast to change. The other four are: precise capability claim, measurement method, adoption conditions, and explicit probability with timeframe. Forecasts lacking these components are assertions, not predictions.

Update triggers are one of the five components of a well-calibrated forecast from Lesson 4. The full five are: (1) precise capability claim, (2) measurement method, (3) adoption conditions, (4) explicit probability and timeframe, (5) update triggers. A forecast lacking any of these is an assertion dressed in prediction language.

Lab 4 — Building a Calibrated Forecast

Practice constructing a forecast with all five components from Lesson 4.

Your Task

You will construct a short AI forecast using all five components: precise capability claim, measurement method, adoption conditions, explicit probability and timeframe, and update triggers. The assistant will evaluate each component and push back where the forecast is under-specified, overconfident, or ignoring the adoption stack.

Choose any specific AI application domain you find interesting — medical, legal, creative, scientific, educational, or other. Aim for a concrete near-term claim (3–7 year horizon). Complete at least three substantive exchanges to build and refine your forecast.

Begin by stating your precise capability claim: what specifically will AI be able to do, in what domain, compared to what current baseline?

Lab Assistant

Forecast Builder

Welcome to Lab 4. You are going to build a well-calibrated AI forecast using the five-component structure from Lesson 4:

1. Precise capability claim — what specifically, in what domain, vs. what baseline?
2. Measurement method — how will we know when/if it has arrived?
3. Adoption conditions — what regulatory, liability, workflow prerequisites must align?
4. Explicit probability and timeframe — a specific number and year range, not "soon"
5. Update triggers — what evidence would revise your estimate up or down?

Start with your capability claim. Pick a specific AI application domain and tell me precisely what you think AI will be able to do, compared to what current systems or human performers can do today.

Module 1 Test

Reading the Trajectory — 15 questions · Pass at 80% (12/15)

1. What did University of Edinburgh researchers find when they reformatted MMLU benchmark questions for Gemini Ultra?

Correct. The score dropped to ~62%, illustrating that high benchmark performance can reflect sensitivity to surface presentation rather than deep comprehension of the underlying questions.

The score dropped to ~62% when questions were reformatted while preserving meaning. This is the canonical example of benchmark fragility from Lesson 1.

2. Scaling laws were first documented systematically in AI by researchers at which organization, and published in which year?

Correct. The Kaplan et al. scaling laws paper was published by OpenAI in January 2020, documenting the power-law relationship between compute, parameters, data, and model performance.

OpenAI, 2020. The Kaplan, McCandlish et al. paper established the foundational scaling law results that shaped subsequent large model development.

3. The Chinchilla paper from DeepMind found that previous large language models had been trained sub-optimally because of what error?

Correct. Chinchilla found that prior large models (including GPT-3) were over-parameterized relative to their training data. The optimal approach requires scaling data and parameters together, not just adding parameters.

Chinchilla's finding was under-training: models had too many parameters relative to training data tokens. Optimal efficiency requires matching parameter scale with data scale.

4. Goodhart's Law, as applied to AI benchmarks, states that:

Correct. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. In AI benchmarking, competitive pressure to maximize scores can produce strategies that improve scores without improving the underlying capability the benchmark was proxying.

Goodhart's Law holds that measures cease to be good measures once they become targets. In AI, this means benchmark optimization can diverge from the underlying capability the benchmark was designed to measure.

5. Paul David's "The Dynamo and the Computer" (1990) argued that the productivity gains from electrification arrived in the 1920s primarily because of what factor?

Correct. David's key argument was generational: manufacturers who designed their businesses around electricity from the beginning captured the gains, rather than those who retrofitted electrical power into layouts and workflows designed for steam. This is the complementarity argument before Brynjolfsson formalized it.

David attributed the 1920s productivity surge to manufacturers who had grown up in the electrical era — they designed factories and workflows around electrical capabilities rather than retrofitting them. This is the essence of the complementarity requirement that Brynjolfsson later formalized for computing.

6. Erik Brynjolfsson's research on computing and productivity found that the key differentiator between firms that captured large gains and those that did not was:

Correct. Brynjolfsson found that technology alone was insufficient. Firms needed complementary organizational change alongside computing investment. He argues this same dynamic will govern AI's productivity impact.

Brynjolfsson's finding was about complementarity: computing plus organizational redesign produced gains; computing alone did not. He explicitly argues the same will apply to AI adoption.

7. The self-referential quality of AI — that AI outputs can assist in developing AI systems — makes it different from which aspect of the electricity analogy specifically?

Correct. The self-referential feedback loop — AI helping build AI — has no clean historical analogue and is one of the primary reasons simple timeline-mapping from electricity to AI can mislead.

The key disanalogy is self-referentiality: electricity could not generate electricity or design power infrastructure, while AI outputs can contribute to AI development. This makes historical timelines unreliable as direct mappings.

8. Within roughly how many years of Gutenberg's press (introduced c. 1440) did Venice have over 150 printing establishments, and by how much had book prices fallen?

Correct. Fifty years and ~80% price reduction. The printing press case illustrates how dramatically reducing the marginal cost of a type of output disrupts the economics of that output's production even before broader social restructuring occurs.

Fifty years and ~80% price reduction. The lesson uses this to show that immediate economic disruption (to the cost of producing a type of output) can precede broader social restructuring by generations.

9. Microsoft's $13 billion investment in OpenAI was accompanied by what commercial arrangement that illustrates the entanglement of technical and commercial decisions?

Correct. The exclusive Azure cloud-provision deal means the models that get built are partly the models Azure can most profitably host — illustrating that AI investment decisions are not purely technical but commercially entangled with cloud infrastructure economics.

Microsoft received exclusive cloud-provision rights through Azure. This means AI model development decisions are partly shaped by what is most profitable for Azure to host — an example of commercial incentives shaping technical trajectories.

10. The US October 2022 export controls on A100 and H100 GPUs to China were explicitly justified by the government on what grounds?

Correct. The explicit justification was military AI — preventing China from using these chips to develop frontier AI systems for defense applications. This introduced geopolitical competition as a direct driver of AI investment priorities and access.

The official justification was preventing Chinese military AI development. This established AI capability as a national security matter, directly entangling geopolitical competition with AI research priorities and investment.

11. The "adoption stack" concept from Lesson 4 refers to:

Correct. The adoption stack encompasses all the non-technical conditions that must align for a technically capable AI system to reach scale deployment. Each layer has its own timeline, and forecasts that ignore the adoption stack systematically overestimate deployment speed.

The adoption stack is the full set of non-technical prerequisites for scale deployment: regulatory, liability, user-trust, workflow, training, and sometimes physical infrastructure conditions. Forecasts that ignore it consistently produce overconfident timelines.

12. Geoffrey Hinton's 2016 prediction about radiology illustrates which of the three systematic AI forecasting errors from Lesson 4?

Correct. The primary error was ignoring the adoption stack. The technical capability did arrive approximately as predicted. What Hinton underestimated was the clinical deployment stack: regulatory clearance pathways, liability frameworks, workflow integration with existing diagnostic processes, and physician acceptance requirements.

The primary lesson from Hinton's prediction is the adoption stack error: the technical capability arrived roughly as predicted, but deployment requires far more than technical capability in clinical medicine. The regulatory, liability, and workflow conditions create separate timelines that the capability prediction did not address.

13. Philip Tetlock's Good Judgment Project found that the best forecasters share which specific habits?

Correct. Tetlock's superforecasters decompose questions, assign explicit probabilities, update frequently on new evidence, and seek disconfirmation. These habits are distinct from domain expertise and can be learned and measured through calibration scoring.

Tetlock's Good Judgment Project identified: decomposing questions, explicit numerical probabilities, frequent updating on new evidence, and actively seeking disconfirmation. These habits, not domain expertise, most reliably produced accurate forecasts.

14. The 2022 AI Impacts survey of 738 machine learning researchers found their median estimate for "high-level machine intelligence" had shifted substantially earlier compared to the 2016 version of the same survey. What does this shift most directly illustrate?

Correct. The shift in median estimates represents researchers updating their views based on observed progress — exactly what well-calibrated forecasters do. The lesson presents this as a positive example of the research community responding to evidence rather than anchoring to prior beliefs.

The earlier estimates in 2022 vs. 2016 reflect researchers updating based on observed progress — particularly from large language models and deep learning advances. This is calibration in practice: beliefs shifting in response to evidence.

15. A forecast that says "AI will transform healthcare within the next decade" is missing which of the five components of a well-calibrated forecast from Lesson 4?

Correct. "AI will transform healthcare within the next decade" is missing at minimum: a precise capability claim (transform how? which parts of healthcare?), a measurement method (how would we know transformation occurred?), adoption conditions, an explicit probability (not just a vague timeframe), and update triggers. It is an assertion, not a forecast.

"AI will transform healthcare within the next decade" lacks nearly all five components from Lesson 4: it has no precise capability claim, no measurement method, no adoption conditions, no explicit probability, and no update triggers. Vague timeframes are not explicit probability-and-timeframe specifications. This is an assertion, not a forecast.