In September 1878, Thomas Edison announced that he had solved the problem of electric lighting. He had not β the practical incandescent bulb was still fourteen months away. Yet within weeks, gas-company stocks on the London Stock Exchange dropped sharply. Investors, journalists, and engineers did not wait for proof; they adjusted their expectations to a perceived trajectory. By 1882, Edison's Pearl Street Station in lower Manhattan was supplying current to 85 customers. By 1900, electricity had begun restructuring factory layouts, urban planning, and the nature of night itself β changes that Edison's 1878 announcement had implied but that almost nobody had traced to their full conclusion.
The same compression of expectation and reality is happening now with machine intelligence. In November 2022, OpenAI released ChatGPT to the public. Within five days it had one million users; within two months, one hundred million β the fastest consumer product adoption ever recorded at that point. Goldman Sachs estimated in March 2023 that generative AI could automate tasks equivalent to 300 million full-time jobs globally. Whether that figure proves accurate or wildly off, the structural fact is identical to 1878: a technology has crossed a visibility threshold, and the world is adjusting its expectations before the full consequences are clear.
This course is not a prediction machine. It offers frameworks β ways of reading capability curves, understanding where performance comes from, recognizing historical analogues, and thinking clearly about second-order effects. You will finish with a better vocabulary for uncertainty, not a confident forecast. That is the honest and, we think, the more useful outcome. The four modules move from trajectory-reading, to how these systems actually work, to where they are being deployed, to what governing them might actually require.
If you finish every module, here's who you become:
On December 5, 2023, Google DeepMind published results showing that its Gemini Ultra model had scored 90.0% on the Massive Multitask Language Understanding benchmark β MMLU β a set of 57 academic subjects from elementary mathematics to professional law. Human expert performance on MMLU is measured at roughly 89.8%. The news cycle interpreted this as a milestone: a general-purpose AI had matched or exceeded broad human expert knowledge. Within a week, researchers at the University of Edinburgh published a counter-analysis. They found that Gemini Ultra's score dropped to approximately 62% when questions were reformatted in ways that changed surface features but not meaning. The benchmark had been passed. The capability it was supposed to measure had not been cleanly demonstrated.
This is not an isolated incident. It is the central interpretive challenge of following AI progress. Scores rise. What scores measure, and how robustly they measure it, is a separate and harder question.
A benchmark is a fixed test β a curated dataset of questions, tasks, or challenges β against which a model's outputs are scored. Benchmarks exist because comparing systems requires a common reference. The alternative, open-ended human evaluation, is slow, expensive, and hard to replicate. Benchmarks solve a logistics problem. They do not necessarily solve the meaning problem.
Three dynamics recur in AI benchmarking history. First, saturation: a benchmark designed for a generation of models becomes too easy for the next. ImageNet, the image-classification competition that launched the deep learning era when AlexNet won it in 2012 with 84.7% accuracy, is now routinely solved above 90% by models that struggle with adversarial images any child would identify correctly. Second, dataset contamination: models trained on internet text often encounter benchmark questions in their training data, inflating scores. Third, Goodhart's Law dynamics: once a benchmark becomes a target, pressure accumulates to optimize for it specifically, which diverges from the underlying capability it was meant to proxy.
None of this means benchmarks are useless. It means they must be read as evidence, not verdicts.
The MMLU benchmark was introduced by Dan Hendrycks and colleagues at UC Berkeley in 2020, specifically to test whether large language models had absorbed diverse factual and reasoning knowledge. GPT-3 scored 43.9% at launch. By early 2023, GPT-4 scored 86.4%. By late 2023, multiple models claimed scores at or above estimated human expert performance. Simultaneously, researchers demonstrated that models achieving these scores still failed systematically on multi-step reasoning variants of the same underlying questions β suggesting the benchmark was measuring pattern-matching to known answer formats more than deep comprehension.
AI capability curves β graphs of benchmark performance over time β show consistent patterns that reward careful reading. Performance on most benchmarks follows an S-curve: slow initial progress, a rapid acceleration phase, then saturation near the ceiling. The acceleration phase is when media coverage spikes and comparisons to human performance become dramatic. The saturation phase is when the benchmark is replaced by a harder one, and the cycle restarts.
The most instructive recent example is the ARC-AGI benchmark, created by FranΓ§ois Chollet and released in 2019. ARC tasks require solving novel visual-pattern puzzles that are trivially easy for most humans but that, as of 2023, no large language model could solve above roughly 30%. In June 2024, OpenAI's o3 model β using a test-time compute scaling approach β scored 75.7% on the semi-private evaluation set, and a high-compute configuration reached 87.5%. Chollet noted that this represented genuine progress but cautioned that o3 was spending compute equivalent to thousands of dollars per puzzle, whereas human solvers spend seconds. The capability was real; the efficiency gap was equally real.
Understanding a capability curve therefore requires four questions: What specific task is being measured? How was human performance established? What are the benchmark's known weaknesses? At what computational cost was the score achieved?
In January 2020, researchers at OpenAI β Jared Kaplan, Sam McCandlish, and colleagues β published a paper documenting what became known as neural scaling laws. They found that the performance of large language models improved in a remarkably predictable, power-law relationship with three variables: the number of model parameters, the size of the training dataset, and the amount of compute used. Double the compute, and performance improves by a predictable increment. The curve was smooth and consistent across orders of magnitude.
This was a significant finding because it meant AI progress was, to a degree, engineerable: more resources in, more capability out, on a predictable schedule. It also meant that companies and researchers could plan investments around capability projections rather than relying entirely on algorithmic breakthroughs. The 2022 Chinchilla paper from DeepMind's Jordan Hoffmann and colleagues refined the picture: prior large models had been under-trained relative to their size. Optimal training requires scaling data and parameters together. GPT-4, released in March 2023 with undisclosed but estimated very large training runs, appeared to follow the refined scaling predictions closely.
The limits of scaling laws are now a primary research question. Some researchers argue performance on truly novel reasoning tasks will plateau regardless of scale. Others argue we have not yet found the ceiling. The honest answer is that nobody knows β which is itself a fact worth holding.
In 2022, researchers at Google Brain published a widely discussed paper on "emergent abilities" in large language models β capabilities that appeared abruptly at certain model scales rather than improving gradually. Three-digit arithmetic, chain-of-thought reasoning, and certain analogical tasks seemed to switch on near-discontinuously above threshold parameter counts. This implied that simply scaling could produce unexpected qualitative leaps, not just quantitative improvements.
A 2023 follow-up by Rylan Schaeffer and colleagues at Stanford challenged the emergence interpretation. They argued that apparent emergence was often an artifact of the evaluation metric: discontinuous-looking improvements on metrics like exact-match accuracy become smooth progressions when evaluated on continuous metrics. The underlying capability was improving smoothly; the measurement was hiding it.
Whether or not emergent capabilities are "real" in the strong sense, the episode illustrates the epistemological challenge of the field. Very smart researchers examining the same models, the same data, reach substantially different conclusions. Reading the trajectory of AI requires holding multiple interpretations simultaneously and updating as evidence accumulates.
Before you can reason about where AI is going, you need a reliable method for reading where it is. That method is not "trust the benchmark scores" and not "distrust all claims." It is: identify what was measured, how it was measured, what the measurement does and does not imply, and what economic or competitive pressures might be distorting the presentation of results. The rest of this module builds that method in four directions: historical analogues, economic drivers, measurement failures, and forecast methodology.
You have encountered a news headline: "New AI Model Beats Human Experts on Medical Licensing Exam." Before accepting or rejecting this claim, you need to apply the four benchmark-reading questions from Lesson 1. Use this lab to work through them with an AI guide.
The assistant will ask you to apply each question in turn, give feedback on your reasoning, and push back if your interpretation is too credulous or too dismissive. Complete at least three substantive exchanges to finish the lab.
In 1900, the economist David A. Wells published a retrospective analysis of American economic transformation since 1870. He documented that the railroad had, within three decades, eliminated entire occupational categories β stagecoach drivers, canal operators, certain categories of freight-wagon teamsters β while creating new ones that had not existed before: locomotive engineers, telegraph operators coordinating rail traffic, the hotel and restaurant workers serving rail hubs. The net employment effect was positive, but the distributional effect was sharp: specific communities built around pre-railroad transport were economically devastated, while other communities grew rapidly. Wells noted that nobody had predicted the specific geography of winners and losers in advance, even though the general direction of travel had been clear.
This observation β that general trajectories can be readable while specific distributional consequences remain opaque until after the fact β is the central lesson historical analogues offer for thinking about AI.
The most commonly cited historical parallel for AI is electrification. The comparison is instructive in both what it illuminates and what it obscures. The illuminating part: electricity, like AI, was a general-purpose technology β one capable of transforming productivity across nearly every industry rather than being confined to a single application domain. Economist Paul David's influential 1990 paper, "The Dynamo and the Computer," documented that electrification did not produce measurable economy-wide productivity gains for roughly forty years after Edison's Pearl Street Station opened in 1882. Factories had to physically redesign their layouts, workers had to develop new skills, regulatory and safety frameworks had to develop, and secondary industries supplying electrical components had to scale. The productivity gains arrived in the 1920s, driven by manufacturers who had grown up in the electrical era and thought in terms of it from the start.
David's argument became known as the "productivity paradox" β a productivity paradox originally applied to computing in the 1980s and 1990s, when widespread computerization was not yet showing up in aggregate productivity statistics. Robert Solow's 1987 quip β "You can see the computer age everywhere but in the productivity statistics" β articulated the same pattern. The Solow paradox eventually resolved: economy-wide computing productivity gains became visible in US statistics beginning around 1995, roughly thirty years after mainframe computing began spreading through corporate America.
If the electrification and computing analogues hold for AI, the implication is that transformative productivity effects may lag the technology's visible deployment by a decade or more, and will require complementary investments in physical infrastructure, organizational redesign, and workforce reskilling that are not captured in the AI systems themselves.
Erik Brynjolfsson at MIT documented the resolution of the computing productivity paradox in a series of papers beginning in the late 1990s. He found that firms which invested heavily in computing AND undertook organizational changes β flatter hierarchies, worker reskilling, redesigned workflows β captured large productivity gains. Firms that installed computers without complementary organizational change did not. The technology was necessary but not sufficient. Brynjolfsson argues explicitly that the same dynamic will apply to AI: the gains will be captured by organizations that redesign around AI capabilities, not those that layer AI onto existing workflows.
Electricity did not generate electricity. Its outputs could not replace the engineers designing its infrastructure or the economists analyzing its effects. Large language models can, at least partially, assist in writing code for AI systems, generating training data, and analyzing AI research papers. This self-referential quality has no clean analogue in prior general-purpose technology transitions and makes simple application of historical timelines unreliable.
The second disanalogy is speed of deployment. The commercial telegraph took roughly fifteen years to wire the United States after Morse's 1844 Washington-to-Baltimore demonstration. The telephone took decades to reach majority household penetration. ChatGPT reached one million users in five days and one hundred million in two months. The deployment velocity of software-based AI is categorically faster than any prior general-purpose technology because it requires no physical installation at the point of use. This does not necessarily mean economic effects arrive faster β the Brynjolfsson complementarity argument still applies β but it does mean the period of public awareness and competitive response is compressed dramatically.
A third disanalogy involves the nature of what is being automated. Prior general-purpose technologies primarily substituted for physical labor or for specific, well-defined cognitive tasks (arithmetic calculation, data storage and retrieval). Current large language models engage in tasks β writing, legal analysis, medical diagnosis, code generation, scientific literature review β that had historically been considered definitionally cognitive, requiring judgment and interpretation. The historical precedents for automating judgment-intensive work are sparser and less conclusive.
A second historical analogue worth examining is Gutenberg's printing press, introduced in Europe around 1440. Prior to the press, book production was controlled by monastic scriptoria and a small number of secular workshops. Literacy was correlated tightly with this production monopoly: those who produced books largely determined who had access to knowledge. The press did not immediately democratize literacy β that took roughly two centuries, and required the Reformation, the development of vernacular literatures, and significant changes in educational institutions. But it did immediately and dramatically disrupt the economics of knowledge production: within fifty years, Venice alone had over 150 printing establishments, and the price of books fell by an estimated 80%.
The parallel to AI and knowledge work is that a technology which dramatically reduces the marginal cost of producing a specific type of output β in 1450, copied text; in 2024, written, analytical, and coded content β does not immediately restructure society, but it does immediately disrupt the economics of that output's production. The institutions, credentials, and economic arrangements built around the scarcity of that production face pressure that compounds over time.
The correct use of historical analogies is not to predict AI's future by substituting it into a prior technology's timeline. It is to identify mechanisms β the productivity paradox, complementarity requirements, knowledge-monopoly disruption, distributional unevenness β that have recurred across multiple transitions and ask whether those mechanisms are present now. Where they are present, the analogies increase your confidence. Where they are absent or inverted, the analogues signal caution.
You will analyze the analogy between AI and the printing press. The assistant will guide you through identifying which mechanisms from the printing press transition are present in AI's current situation, and which are absent or inverted. This is the "using analogies well" skill from Lesson 2.
Engage with at least three substantive exchanges. The assistant will push back on oversimplified mappings and ask you to identify specific mechanisms.
In the first quarter of 2024, Microsoft reported that it had committed $13 billion in investment to OpenAI. In the same period, Google announced it had invested $2 billion in Anthropic, with a commitment for up to $300 million more. Amazon announced up to $4 billion for Anthropic. Meta disclosed it was spending roughly $35 billion in capital expenditure in 2024, primarily on AI infrastructure. These are not research grants. They are strategic investments by companies whose core revenue β advertising, cloud computing, enterprise software, e-commerce β faces potential disruption from AI capabilities, and who are simultaneously positioned to capture value if those capabilities become infrastructure.
Understanding the trajectory of AI requires understanding what these capital flows are and are not optimizing for β which capabilities get funded, which problems get prioritized, and which gaps get left.
The three largest cloud providers β Amazon Web Services, Microsoft Azure, and Google Cloud β collectively control an estimated 65% of global cloud infrastructure as of 2024. Each of these providers benefits from AI in two distinct and compounding ways. First, AI model training and inference require compute, and they sell compute. Second, AI capabilities integrated into their cloud platforms increase switching costs and create new billable services. The incentive to accelerate AI capability development is therefore extremely strong among these actors independently of any interest in AI itself.
This investment structure shapes the research agenda in observable ways. Applications that can be delivered as cloud services β API-accessible models, enterprise software integrations, code generation tools β receive enormous resources. Applications that require on-device inference or that would reduce dependence on cloud infrastructure receive comparatively less. The AI landscape in 2024 is therefore partly a map of what is profitable to build as a cloud service, not purely a map of what is technically achievable or socially useful.
Microsoft's $13 billion OpenAI investment was accompanied by a deal giving Microsoft exclusive cloud-provision rights for OpenAI's models. The technical and commercial decisions are entangled: the models that exist are partly the models that Azure can most profitably host.
NVIDIA's H100 GPU, the primary hardware for large-model training as of 2023β2024, was priced at approximately $30,000β$40,000 per unit. A single large training run for a frontier model requires thousands of them. NVIDIA reported in its Q2 2024 earnings that data-center revenue β primarily AI chips β had grown 154% year-over-year, reaching $26.3 billion in a single quarter. This concentration of critical infrastructure in a single hardware supplier creates a dependency that shapes the entire field: which organizations can afford to train frontier models, which research directions require prohibitive compute, and which geographies can participate in leading AI development are all partly determined by access to NVIDIA hardware and US export controls on that hardware.
A substantial fraction of AI investment is explicitly justified by the expectation of labor-cost reduction. McKinsey's 2023 report on generative AI estimated that it could automate work activities currently consuming between $2.6 trillion and $4.4 trillion in annual global wages. Goldman Sachs's March 2023 analysis estimated 300 million full-time equivalent jobs exposed to automation. These figures drive investment decisions: if AI can substitute for a significant fraction of knowledge-work labor at a fraction of its cost, the return on AI investment is potentially enormous.
The substitution argument has important nuances that get compressed in the headline figures. First, "exposed to automation" is not the same as "will be automated": automation potential must overcome switching costs, regulatory barriers, user-acceptance requirements, and complementarity demands. Second, partial automation β AI handling some tasks within a job while humans handle others β often increases demand for the remaining human tasks rather than eliminating the role. Third, the historical record on automation consistently shows job displacement concentrated in specific occupations and geographies rather than distributed uniformly, even when aggregate net employment effects are neutral or positive.
What the labor-cost argument does reliably predict is where AI investment will concentrate: on tasks that are high-volume, relatively standardized, performed by expensive knowledge workers, and currently executed without significant regulatory friction. Legal document review, software coding, customer service, medical imaging interpretation, and content generation each satisfy most of these criteria and each attract disproportionate AI investment.
Since 2022, AI investment has increasingly been shaped by explicit national security and geopolitical considerations alongside commercial ones. The US government's October 2022 export controls on advanced semiconductor technology to China β including A100 and H100 GPUs β were explicitly justified as necessary to prevent Chinese development of frontier AI models for military applications. China's response included significant state investment in domestic chip design (Huawei's Ascend series) and AI model development (Baidu's ERNIE, Alibaba's Qwen, and others).
The National Security Commission on Artificial Intelligence, chaired by former Google CEO Eric Schmidt, published a 756-page report in March 2021 recommending $32 billion in federal AI investment over two years to maintain US technological leadership. The CHIPS and Science Act of 2022, signed by President Biden in August of that year, allocated $52 billion for domestic semiconductor manufacturing, explicitly linking chip production capacity to AI capability and national security.
The implication for reading the AI trajectory is that investment levels and research priorities are now partly determined by geopolitical competition, not only commercial incentives. This tends to accelerate development of AI capabilities perceived as strategically important β autonomous systems, surveillance, logistics optimization, cybersecurity β while potentially underinvesting in safety research, interpretability, and applications that are socially beneficial but not competitively strategic.
Investment concentrations are leading indicators of where capability will develop, but they are lagging indicators of social impact. The gap between "where money is going" and "what happens to workers and institutions" is where most AI analysis goes wrong. Following the capital tells you what will be built. It does not tell you what will be adopted at scale, what will encounter resistance, or what second-order effects will follow adoption.
Using what you know about cloud-provider incentives, labor-cost substitution pressures, and national security drivers from Lesson 3, you will identify a domain of AI capability that is likely underfunded relative to its social importance β and explain the economic logic for why it is underfunded.
The assistant will ask you to be specific about which economic incentive structures are absent, and will push back if your analysis is too vague. Complete at least three substantive exchanges.
In 2016, Geoffrey Hinton β one of the three researchers who shared the 2018 Turing Award for foundational contributions to deep learning β said that radiologists should stop training because, within five years, neural networks would definitively outperform them at reading medical images. In 2022, he revised this view: deep learning had indeed achieved impressive diagnostic accuracy on specific imaging tasks, but radiologists were still very much employed, partly because the task of radiology encompasses far more than pattern recognition in a single image, and partly because deployment in clinical settings involves regulatory, liability, and workflow factors that technology performance alone cannot resolve. Hinton, one of the most knowledgeable people in the world about deep learning, had made a confidently stated prediction that proved substantially wrong β not because the technical capability failed to arrive, but because capability and deployment are different questions with different timelines.
AI forecasting has a documented record of consistent errors in identifiable directions. The first is scope substitution: a capability is demonstrated in a restricted domain, and the forecast assumes unrestricted generalization. Deep learning's success on curated ImageNet images was forecast to generalize to medical images in uncontrolled clinical conditions; the transition required years of additional work on robustness, distribution shift, and annotation quality. Benchmark performance is measured in controlled conditions; deployment happens in the wild. The gap between them is systematically underestimated.
The second error is ignoring the adoption stack. A technology requires not only technical capability but a full stack of complementary conditions before it reaches scale deployment: regulatory approval (especially in medicine, law, and finance), liability frameworks, user-trust development, workflow integration, training of practitioners, and sometimes physical infrastructure. Each layer has its own timeline, and they must align. GPT-4 achieved impressive medical exam scores in 2023; as of mid-2024, no AI system had received FDA clearance as a general diagnostic tool, because regulatory pathways for AI-based clinical decision support are still being developed.
The third error is single-point thinking: presenting a future state as if it will arrive uniformly, when in practice adoption is geographically, economically, and institutionally uneven. The same technology may transform a well-resourced urban hospital and have no effect on a rural clinic lacking the connectivity, IT infrastructure, and trained staff to implement it.
Philip Tetlock's Good Judgment Project, which has tracked forecaster accuracy across thousands of geopolitical and economic questions since 2011, found that the best forecasters share specific habits: they break questions into components, assign explicit numerical probabilities, update frequently as new evidence arrives, and actively seek disconfirmation of their current views. Tetlock's 2023 AI-focused forecasting tournament, Forecasting AI Progress (FAIP), found that even trained superforecasters showed systematically overconfident predictions about near-term AI capabilities β but that the overconfidence was significantly reduced when forecasters were required to explicitly state their model of adoption, not just capability. The adoption model forced them to confront the adoption stack problem.
A well-calibrated AI forecast has five components, each of which can be evaluated separately. First, it specifies the capability claim precisely: not "AI will handle legal work" but "AI will perform first-pass contract review for standard commercial agreements with fewer errors than junior associates in firms using current proofreading workflows." Second, it specifies the measurement method: how will we know if this capability has arrived? Third, it specifies the adoption conditions: what regulatory, liability, and workflow prerequisites must be satisfied before the capability translates to deployment at scale? Fourth, it gives an explicit probability and timeframe: not "soon" but "70% probability within five years." Fifth, it identifies the update triggers: what specific evidence would cause the forecast to be revised upward or downward?
Forecasts that lack any of these five components are not predictions in any meaningful sense. They are assertions about the future dressed in prediction language. The distinction matters enormously when deciding whether to act β whether to retrain, whether to invest, whether to regulate β on the basis of the forecast.
Despite the general unreliability of AI forecasts, there are domains where careful forecasters with different methodological approaches converge β and that convergence is itself evidence worth noting. The AI Impacts survey of machine learning researchers, conducted in 2022 with responses from 738 researchers, found median estimates of a 50% probability of "high-level machine intelligence" β defined as AI able to perform almost all cognitive tasks humans can perform β by 2059, with a 10% probability by 2028 and a 90% probability by 2100. These estimates were substantially earlier than a 2016 version of the same survey, reflecting researcher updating based on observed progress.
On nearer-term questions, there is stronger convergence. The 2022 AI Impacts survey found median estimates of roughly 2025β2027 for AI systems competitive with human performance on a wide range of specific task benchmarks β a range that has proved roughly accurate given GPT-4's 2023 performance and subsequent models. There is also convergence on the proposition that AI will not progress in a smooth linear fashion: most researchers expect a combination of continued scaling gains, periods of plateau, and potentially transformative capability jumps from new architectural approaches, though the timing and nature of those jumps remains contested.
Reading the trajectory of AI requires four parallel skills: reading benchmark claims accurately (Lesson 1), understanding which historical mechanisms apply and which do not (Lesson 2), tracing investment logic to see what will and will not be built (Lesson 3), and distinguishing well-calibrated forecasts from confident assertions (Lesson 4). None of these skills produces certainty. Together, they produce significantly better judgment than naive optimism, naive pessimism, or the reflex to defer to whoever sounds most confident.
You will construct a short AI forecast using all five components: precise capability claim, measurement method, adoption conditions, explicit probability and timeframe, and update triggers. The assistant will evaluate each component and push back where the forecast is under-specified, overconfident, or ignoring the adoption stack.
Choose any specific AI application domain you find interesting β medical, legal, creative, scientific, educational, or other. Aim for a concrete near-term claim (3β7 year horizon). Complete at least three substantive exchanges to build and refine your forecast.