Lesson 1 · AI in Science — Module 4

General Circulation Models & the AI Acceleration

How decades of numerical weather prediction gave way to neural networks trained on reanalysis data

Why did it take 70 years to build a physics-based global climate model — and can AI now shortcut that process?

On 5 April 1950, the ENIAC computer at Aberdeen Proving Ground ran the first numerical weather forecast in history. A team led by Jule Charney at Princeton's Institute for Advanced Study produced a 24-hour pressure-field map for North America. The computation took 24 hours — real-time parity at best. Charney called it a proof of concept; it would take three more decades before operational numerical weather prediction outperformed experienced human forecasters.

General Circulation Models (GCMs) grew from that lineage. By the 1970s, Syukuro Manabe and colleagues at GFDL had coupled an atmospheric model to a simple ocean, producing the first projections of CO₂-forced warming. The physics — fluid dynamics, thermodynamics, radiation transfer — was well understood. The bottleneck was always computation and resolution.

What Is a General Circulation Model?

A GCM divides the atmosphere, ocean, land surface, and sea ice into a three-dimensional grid of cells — typically 50–100 km horizontally and dozens of vertical layers. Within each cell, the model solves discretised versions of the Navier–Stokes equations for fluid motion, the radiative transfer equation for energy transport, and parameterisation schemes that approximate sub-grid processes such as convective clouds, ocean eddies, and soil moisture.

The resulting systems contain tens of millions of coupled equations. A single 100-year simulation at 25 km resolution on NCAR's Cheyenne supercomputer consumed roughly 35 million core-hours of compute time. This computational cost is the central constraint of climate science: you cannot run enough ensemble members to fully sample uncertainty, and you cannot afford to run at the resolution needed to resolve convective systems explicitly.

The ERA5 reanalysis, released by ECMWF in 2017–2019, changed the data landscape. ERA5 provides hourly, globally gridded atmospheric, land-surface, and sea-state data from 1940 to present at 31 km resolution — essentially a physically coherent reconstruction of every weather event in recorded history. This ~5 petabyte dataset became the training corpus for the first generation of AI weather and climate models.

Key Milestone

ECMWF's ERA5 reanalysis (2017–2019) provides 80+ years of globally coherent atmospheric data at 31 km resolution. Its ~5 petabyte dataset became the foundation training corpus for every major AI weather model developed between 2022 and 2025.

The Physics vs. Data-Driven Divide

Traditional GCMs are process-based: every equation encodes a physical law. This gives them strong interpretability and the ability to extrapolate to climate states outside the training distribution — critical when projecting 2100 conditions that have no historical analogue. But they are slow, expensive, and heavily dependent on parameterisations whose tuning introduces systematic biases.

Data-driven or machine-learning weather models (MLWMs) take a different approach. They learn a mapping from one atmospheric state to the next, trained entirely on reanalysis data. They have no explicit fluid equations; instead, the weights of a neural network encode the statistical relationships in ERA5. This makes them orders of magnitude faster at inference time — and, as of 2023–2024, competitive or superior on standard skill metrics for medium-range forecasting.

The critical question for climate projection (as opposed to weather forecasting) is whether these statistical relationships remain valid decades hence, under greenhouse-gas forcing that has no parallel in the ERA5 training window. This is the unresolved frontier the field is actively debating.

GCMGeneral Circulation Model — physics-based numerical model discretising the atmosphere and/or ocean on a global grid to simulate climate dynamics.

ReanalysisA retrospective reconstruction of historical atmospheric states produced by combining observational data with a short-range forecast model — ERA5 is the primary example used in AI training.

ParameterisationMathematical approximations for sub-grid physical processes (e.g., convection, cloud microphysics) that cannot be resolved explicitly at GCM resolution.

MLWMMachine-Learning Weather Model — a neural network trained on reanalysis to predict future atmospheric states, without explicit physical equations.

The 2022–2024 Revolution

In November 2022, Huawei's research team published Pangu-Weather, a 3D Earth Attention Network trained on ERA5 that matched or exceeded ECMWF's operational deterministic forecast on multiple standard metrics up to 7 days lead time. Shortly after, DeepMind released GraphCast (November 2023), trained on ERA5 1979–2017, which outperformed ECMWF's HRES model on 90% of the 1380 standard verification targets. Both models run global forecasts in under a minute on a single GPU — compared to hours on ECMWF's HPC clusters.

Google's NeuralGCM (2024) attempted to bridge the divide: a hybrid model that embeds a neural-network atmospheric core inside a traditional dynamical solver, preserving physical conservation laws while learning flexible sub-grid parameterisations from data. NeuralGCM runs 100-member ensemble simulations 1000× faster than comparable physics-only models.

These systems are not replacing GCMs for century-scale climate projection. But they are fundamentally transforming what is computationally feasible for seasonal forecasting, ensemble uncertainty quantification, and rapid impact assessment.

Documented Benchmark

GraphCast (DeepMind, 2023) outperformed ECMWF's operational HRES forecast on 90.3% of 1,380 verification targets across all lead times from 6 hours to 10 days. Its inference time for a global 10-day forecast: approximately 60 seconds on a single TPU.

Lesson 1 Quiz

GCMs and the AI Weather Model Revolution

1. What dataset became the primary training corpus for AI weather models like GraphCast and Pangu-Weather?

Correct. ERA5, released 2017–2019, provides ~5 petabytes of hourly global data from 1940 to present and is the foundational training set for Pangu-Weather, GraphCast, FourCastNet, and NeuralGCM.

Not quite. ERA5 (ECMWF) is the dominant training corpus. Its hourly, 31 km global coverage from 1940 onward made it uniquely suited as a supervised learning dataset for atmospheric prediction.

2. What fundamental limitation makes data-driven models (MLWMs) problematic for century-scale climate projection?

Exactly right. MLWMs learn statistical patterns from historical ERA5 data. Future climate states under sustained high-CO₂ forcing have no historical analogue, so there is no guarantee the learned mappings remain valid — a critical limitation for multi-decadal projection.

The opposite is true for speed — MLWMs are dramatically faster. The key limitation is generalisation: they are trained on 1940–2023 climate states and may not reliably extrapolate to a 2100 climate regime that falls outside that distribution.

3. DeepMind's GraphCast outperformed ECMWF's HRES model on what fraction of standard verification targets?

Correct — 90.3% of 1,380 verification targets across all lead times from 6 hours to 10 days, as reported in the 2023 Science paper by Lam et al.

GraphCast outperformed ECMWF HRES on 90.3% of 1,380 standard verification targets — a striking result that forced operational meteorological agencies to reassess their modelling roadmaps.

4. What distinguishes Google's NeuralGCM from pure data-driven models like GraphCast?

Correct. NeuralGCM (2024) is a hybrid: a learned neural network replaces the expensive parameterisation schemes while a physics-based dynamical core enforces conservation of mass, energy, and momentum.

NeuralGCM's key innovation is hybridisation — it pairs a learned neural-network atmospheric core with a traditional dynamical solver. This preserves physical conservation laws while gaining the flexibility and speed of machine learning.

Lab 1 — GCM Architecture Explorer

Discuss the structure and trade-offs of physics-based vs. AI weather models with your lab assistant

Lab Objective

In this lab you will interrogate the design choices behind General Circulation Models and the first generation of AI weather models. Ask about grid resolution trade-offs, parameterisation schemes, the ERA5 training data, or the specific benchmarks that GraphCast, Pangu-Weather, and NeuralGCM achieved.

Suggested starting questions: "Why does resolution matter so much in GCMs?" · "What physical laws does NeuralGCM still enforce?" · "Could an MLWM trained on ERA5 safely project 2100 climate?"

AI Lab Assistant

GCM & AI Weather Models

Welcome to Lab 1. I'm here to help you explore the architecture of General Circulation Models and compare them with AI weather models like GraphCast and NeuralGCM. What aspect would you like to dig into first — the physics, the data, or the benchmarks?

Lesson 2 · AI in Science — Module 4

Downscaling, Bias Correction, and High-Resolution Climate

Using deep learning to translate coarse global projections into actionable regional detail

If the best global climate models still run at 25–100 km grids, how do we get the city-scale information that infrastructure planners actually need?

When the atmospheric river events of January 2023 struck California, dumping nearly a year's worth of precipitation in three weeks, emergency managers needed flood-inundation maps at neighbourhood scale. The CMIP6 global models that inform California's long-range water planning run at roughly 100 km resolution — a grid cell larger than the entire San Fernando Valley. Translating that global output into actionable local projections is the task of downscaling, and it has become one of the primary applications of deep learning in climate science.

Statistical vs. Dynamical Downscaling

Traditional downscaling divides into two approaches. Dynamical downscaling nests a high-resolution regional model (e.g., WRF at 4 km) inside a GCM, explicitly simulating local topographic and land-surface effects. It is physically consistent but computationally prohibitive — running a single 30-year regional simulation can take months of supercomputer time.

Statistical downscaling learns a transfer function from coarse GCM output to observed local climate variables, using historical station records as the target. Traditional statistical approaches (BCSD, delta-mapping) are fast but assume that the coarse-to-fine relationship is stationary over time — an assumption that breaks down in a changing climate.

Deep-learning downscaling occupies a middle ground: it can learn highly nonlinear spatial relationships from large training datasets while running at inference speeds that enable probabilistic ensembles.

CMIP6Coupled Model Intercomparison Project Phase 6 — the coordinated multi-model ensemble used to produce the IPCC AR6 climate projections, with models running at 25–250 km resolution.

Super-ResolutionDeep-learning technique adapted from computer vision that learns to reconstruct high-resolution spatial detail from low-resolution input fields.

BCSDBias-Correction Spatial Disaggregation — a classical statistical downscaling method that adjusts GCM output distributions to match observed climatology before spatial interpolation.

Deep Learning Approaches to Downscaling

The most widely studied deep-learning downscaling architecture is the Convolutional Neural Network Super-Resolution (CNN-SR) approach, directly adapted from image super-resolution in computer vision. The network learns a mapping from low-resolution GCM fields to high-resolution observational gridded products (e.g., PRISM in the US, E-OBS in Europe). Applications include precipitation, temperature extremes, wind speed, and solar irradiance.

A landmark study by Sachindra et al. (2020) in the Journal of Hydrology demonstrated that LSTM-based downscaling of rainfall over Australia significantly outperformed conventional statistical methods on both mean climatology and extreme event frequency. A 2022 study led by NCAR's William D. Collins group applied a U-Net architecture to downscale CESM2 precipitation over the Western US from 1° to 1/8° resolution, achieving skill scores comparable to dynamical downscaling at 1/100th the computational cost.

The DeepSD framework (Vandal et al., 2017) was one of the first systematic demonstrations that convolutional super-resolution could match or exceed BCSD for precipitation downscaling over CONUS, using PRISM data as the high-resolution training target.

Real Application — EU Copernicus

The EU's Copernicus Climate Change Service (C3S) has funded operational AI-based downscaling pipelines that translate seasonal forecast model output (ECMWF SEAS5) into country-level temperature and precipitation anomalies used directly by European agricultural and energy sector planners.

Bias Correction: The Persistent Challenge

Even state-of-the-art GCMs exhibit systematic biases — their simulated precipitation distributions, surface temperature trends, or sea-surface temperatures may deviate from observations by amounts that would swamp the forced climate signal. Bias correction is the post-processing step that aligns model output distributions with observed climatology.

Classical bias correction (quantile mapping, delta method) is purely statistical and cannot adjust for physical process errors — if a model misrepresents ENSO teleconnections, no statistical correction will fix the downstream precipitation bias it produces. Deep-learning bias correction, by contrast, can learn spatially coherent corrections that account for large-scale atmospheric patterns. The ISIMIP3BASD method (Lange 2019) established a sophisticated quantile-delta mapping benchmark; subsequent neural-network approaches have shown skill improvements particularly for heavy precipitation tails that are most relevant for flood risk.

Microsoft Research's Aurora model (2024) demonstrated that a foundation model pre-trained on multiple atmospheric datasets and then fine-tuned on specific regional observational records could simultaneously perform bias correction and downscaling, outperforming single-purpose models on several regional benchmarks.

Infrastructure Implication

The US Army Corps of Engineers' Hydrologic Engineering Center has begun integrating AI-downscaled climate projections into its HEC-HMS flood modelling workflows, allowing planners to assess flood frequency changes at individual watershed scale using CMIP6 ensemble output — a task previously requiring months of dynamical downscaling runs.

Lesson 2 Quiz

Downscaling and Bias Correction with Deep Learning

1. What is the primary computational advantage of deep-learning downscaling over dynamical downscaling?

Correct. A 2022 NCAR study found a U-Net downscaling model achieved skill comparable to WRF dynamical downscaling at approximately 1/100th the computational cost, enabling ensemble runs previously impossible.

The key advantage is computational efficiency. Deep-learning downscaling can match dynamical model skill at roughly 1/100th the cost, which is transformative because it enables probabilistic ensembles across many GCM members.

2. What was DeepSD (Vandal et al., 2017) designed to demonstrate?

Correct. DeepSD was one of the first systematic demonstrations that CNN-based super-resolution applied to climate downscaling could achieve skill comparable to or better than the established BCSD statistical benchmark, using PRISM as the target.

DeepSD (Vandal et al., 2017) specifically compared convolutional super-resolution downscaling against the BCSD benchmark for precipitation over CONUS, using PRISM observations as training targets — an early influential demonstration in deep-learning climate downscaling.

3. Why does classical bias correction (e.g., quantile mapping) fail to correct errors caused by GCM process misrepresentation?

Exactly right. Quantile mapping can align a model's output distribution to match observations statistically, but if the model misrepresents a physical process (e.g., ENSO teleconnections), the spatially coherent errors those processes produce persist after correction.

Classical bias correction is a statistical post-processing step — it can rescale or redistribute output values, but it cannot correct errors arising from misrepresented physical processes like teleconnections, which produce spatially coherent biases that no distributional mapping can fix.

4. What observational dataset is most commonly used as the high-resolution training target for precipitation downscaling over the continental United States?

Correct. PRISM provides spatially continuous 4 km gridded climate data for CONUS by incorporating topographic relationships with station observations, making it the standard high-resolution training target for US precipitation downscaling studies including DeepSD.

PRISM is the standard. It provides 4 km gridded precipitation and temperature data for CONUS by using regression relationships between elevation and station observations — the most commonly used high-resolution target in US downscaling research.

Lab 2 — Downscaling Methods Analyst

Work through downscaling scenarios and evaluate bias correction strategies with your AI assistant

Lab Objective

In this lab you will explore real-world downscaling decisions. Discuss the trade-offs between statistical and dynamical downscaling, evaluate when deep learning adds value, and reason through bias correction choices for specific applications.

Suggested scenarios: "I need city-scale flood projections for Houston — what downscaling approach would you recommend?" · "Why can't we just interpolate CMIP6 output to 1 km?" · "Explain the stationarity assumption and why it matters for climate downscaling."

AI Lab Assistant

Climate Downscaling

Welcome to Lab 2. I specialise in climate downscaling and bias correction methods. Whether you want to compare dynamical vs. statistical approaches, work through a specific regional application, or understand the stationarity assumption, I'm ready. What's your question?

Lesson 3 · AI in Science — Module 4

Emulators, Surrogates, and the Speed of Discovery

Training neural networks to mimic expensive climate model components — enabling millions of simulations where only hundreds were feasible

What happens to climate science when a simulation that took a week can be re-run in a second?

In 2021, ClimateBench — a standardised benchmark for climate model emulation — was published by a team led by Duncan Watson-Parry at the University of Exeter. The benchmark asked: given a time series of emissions inputs (CO₂, CH₄, SO₂, black carbon), can a machine-learning model predict the global patterns of surface temperature, precipitation, and diurnal temperature range that a full GCM would produce? The winning models outperformed simple pattern-scaling on nearly every metric, and some achieved near-GCM accuracy at 10,000 times the speed of the full Earth System Model.

What Is a Climate Emulator?

A climate emulator (also called a surrogate model or reduced-complexity model) is a computationally cheap approximation of some component or output of a full Earth System Model. Emulators are trained by running the full model many times across a design-of-experiment input space and then fitting a statistical or machine-learning model to that input–output mapping.

Emulators serve multiple purposes in climate science:

1. Uncertainty quantification: Running a full CMIP6-class model for thousands of parameter combinations is infeasible. An emulator trained on a few hundred runs can surrogate the model for millions of samples needed for Bayesian parameter estimation or sensitivity analysis.

2. Impact model coupling: Integrated assessment models (IAMs) that link climate projections to economic and social outcomes need rapid climate responses for thousands of socioeconomic scenarios. Emulators enable these coupled runs.

3. Parameterisation replacement: Individual model components — convective parameterisations, cloud microphysics schemes, aerosol–radiation interactions — can be replaced with neural-network emulators trained on high-resolution process-level simulations.

Surrogate ModelA fast approximation of an expensive computational model, trained on a designed set of full-model runs and used to explore input space where direct simulation is infeasible.

ClimateBenchA 2021 standardised benchmark dataset and evaluation framework for climate model emulators, based on NorESM2 outputs across multiple emissions scenarios.

Parameterisation EmulationReplacing an explicit physical parameterisation scheme in a GCM with a neural network trained to reproduce that scheme's outputs — enabling faster and potentially more accurate representations of sub-grid processes.

Documented Emulator Implementations

CLIVAR FaIR (Finite Amplitude Impulse-Response model): A widely used simple climate model used in IPCC reports for probabilistic temperature projections. Its neural-network variants can be constrained against observational uncertainty ranges to produce calibrated 21st-century ensembles far more rapidly than full GCMs.

NCAR's CAM-ML: In 2018, Noah Brenowitz and Christopher Bretherton at the University of Washington published neural-network parameterisations of atmospheric deep convection trained on coarse-grained output from cloud-resolving simulations. When inserted into NCAR's Community Atmosphere Model, the neural networks produced stable multi-year integrations that outperformed the default convective parameterisation on several diagnostics — a landmark result in online coupled ML emulation.

M²LInES (2021–present): A multi-institution collaboration (LDEO, GFDL, MIT, NYU, CNRS) focused on developing ML parameterisations for ocean mesoscale eddies — sub-grid features that fundamentally control heat uptake and carbon sequestration but cannot be resolved at typical GCM ocean resolutions of 1°. Their eddy-flux emulators have been tested online in GFDL's MOM6 ocean model.

ClimSim (2023): A benchmark dataset released by Sungduk Yu et al. containing ~160 TB of simulation data from E3SM-MMF (a multi-scale model with explicit convection) paired with its coarse-resolution input variables — designed specifically for training neural-network emulators of moist physics parameterisations. It became the largest publicly available dataset for parameterisation emulation.

Scale Comparison

Running 10,000 climate sensitivity experiments with a full CMIP6-class model at 1° resolution would require approximately 500 million CPU-hours. A trained neural-network emulator can perform the equivalent mapping in under 10 CPU-hours — enabling uncertainty quantification studies that were previously computationally impossible.

Online vs. Offline Emulation

A critical distinction in parameterisation emulation is between offline and online deployment. In offline evaluation, the neural network receives GCM state inputs from a pre-computed simulation and its outputs are evaluated against the target parameterisation — but the network never influences the model's subsequent state. This is straightforward and widely used for benchmarking.

In online deployment, the neural network is actually coupled into the running GCM, so its outputs drive the next model time step. This is far more challenging because small biases accumulate, and the model can drift into unphysical states. Several early online ML parameterisation experiments collapsed within days of simulation time. The Brenowitz–Bretherton 2018 work was notable precisely because it achieved stable multi-year online integration — a much harder target than offline accuracy.

Physical consistency constraints — enforcing conservation of energy and moisture — are now recognised as essential for stable online ML parameterisations. Several groups (including Tom Beucler's group at University of Lausanne) have developed constraint architectures that build conservation laws directly into the network's output layer.

ClimSim Dataset — 2023

Released by Sungduk Yu et al. (Nature Communications, 2024), ClimSim contains ~160 TB of paired coarse/fine-resolution atmospheric simulation data from E3SM-MMF. It is the largest public benchmark dataset for training neural-network moist physics emulators and was specifically designed to facilitate reproducible comparisons across research groups.

Lesson 3 Quiz

Climate Emulators and Surrogate Models

1. What was the primary innovation of the Brenowitz & Bretherton (2018) neural-network parameterisation experiment?

Correct. Many neural-network parameterisations perform well offline but collapse within days when coupled into the running model. Brenowitz and Bretherton achieved stable multi-year online integration of a convection emulator in NCAR's CAM — a significantly harder and more scientifically valuable result.

The landmark achievement was online stability. Most early neural-network parameterisations worked offline but collapsed when actually coupled into the GCM. Achieving stable multi-year online integration in NCAR's CAM was the critical advance that demonstrated ML parameterisations could actually work inside running models.

2. What does the M²LInES collaboration primarily aim to emulate?

Correct. M²LInES (2021–present) focuses on developing ML parameterisations for ocean mesoscale eddies — typically 10–100 km features that standard 1° GCM ocean grids cannot resolve but which critically influence heat and carbon sequestration. Their eddy-flux emulators have been tested online in GFDL's MOM6.

M²LInES (a collaboration of LDEO, GFDL, MIT, NYU, CNRS) is focused on ocean mesoscale eddy parameterisations. These sub-grid eddies — unresolvable at standard 1° GCM ocean resolution — fundamentally control how much heat and carbon the ocean absorbs from the atmosphere.

3. Why is "online" emulator deployment considered harder than "offline" evaluation?

Exactly right. In offline mode the emulator never influences the model — any errors are inconsequential. Online, every output drives the next timestep, so even small biases can compound through error accumulation, leading the model into unphysical states and often numerical instability within days.

The key issue is feedback. Offline, the emulator is evaluated against pre-computed GCM states — errors don't matter beyond the evaluation. Online, the emulator's outputs drive subsequent model states, so biases compound through time, often causing the model to drift into unphysical regimes within days.

4. What is the approximate size of the ClimSim benchmark dataset, and what makes it distinctive?

Correct. ClimSim (Yu et al., 2024, Nature Communications) is ~160 TB of paired coarse/fine-resolution simulation data from the E3SM Multi-Scale Modeling Framework, designed specifically to benchmark neural-network moist physics emulators with reproducible protocols.

ClimSim is ~160 TB — the largest public benchmark dataset for parameterisation emulation. What makes it distinctive is that it pairs coarse-resolution input variables with fine-resolution E3SM-MMF outputs specifically to train and benchmark neural-network moist physics parameterisations.

Lab 3 — Climate Emulator Design Studio

Design neural-network emulators for climate model components with your AI assistant

Lab Objective

In this lab you will reason through emulator design decisions: what components to emulate, what training data to use, how to ensure online stability, and how to enforce physical conservation constraints. Apply lessons from ClimateBench, ClimSim, and the M²LInES project.

Suggested prompts: "I want to build a neural-network emulator for ocean eddy heat flux — what architecture would you choose?" · "How would you enforce energy conservation in a convection emulator?" · "Walk me through how ClimateBench measures emulator skill."

AI Lab Assistant

Emulators & Surrogates

Welcome to Lab 3. I'm specialised in climate model emulation — from parameterisation replacement to full GCM surrogates. Let's design something together. Which component of the Earth system would you like to emulate, and what is your application?

Lesson 4 · AI in Science — Module 4

Extreme Events, Attribution, and Operational Climate Services

From detecting tipping points to delivering actionable seasonal forecasts — AI at the frontier of applied climate science

When the 2021 Pacific Northwest heat dome killed more than 600 people, how quickly could AI help determine whether it was made more likely by climate change?

The June 2021 Pacific Northwest heat dome set all-time temperature records across British Columbia, Oregon, and Washington. Lytton, BC reached 49.6°C on June 29 — 4.6°C above the previous provincial record. Within days of the event ending, the World Weather Attribution team published a rapid attribution study concluding the event was "virtually impossible" without human-caused climate change — at least 150 times more likely than in a pre-industrial climate. This study was completed in roughly two weeks, a timeline that depended critically on ensemble climate model runs and statistical extreme-value analysis that AI methods are now accelerating further.

AI-Assisted Extreme Event Detection

Identifying, classifying, and tracking extreme weather events in large climate model ensembles is a fundamental task in attribution science. Traditional approaches used threshold-based detection algorithms — define a "heat wave" as temperatures exceeding a fixed percentile for a minimum number of consecutive days, then count events. Neural-network approaches can learn more nuanced event definitions directly from reanalysis, identifying atmospheric circulation patterns associated with compound extremes rather than single-variable exceedances.

TempestExtremes (Ullrich and Zarzycki, 2017) established an open-source framework for climate model extreme event detection that has been widely used across CMIP6 studies. Building on this foundation, deep-learning classifiers trained on ERA5 have demonstrated superior skill at detecting tropical cyclone precursor patterns, atmospheric river corridors, and heat-dome circulation anomalies compared to threshold-based methods.

A 2023 study published in Nature Climate Change by Gabriel Vecchi's group at Princeton used a CNN trained on ERA5 to detect tropical cyclone tracks across 40 CMIP6 model outputs, enabling a systematic multi-model assessment of how the distribution of Atlantic hurricane intensities shifts under 2°C and 4°C warming scenarios — a scale of analysis previously requiring years of manual processing.

Event AttributionThe scientific process of estimating how much human-caused climate change altered the probability or intensity of a specific observed extreme weather event, typically using ensemble climate model experiments.

Compound ExtremesEvents where multiple climate hazards occur simultaneously or sequentially (e.g., simultaneous heat and drought), producing impacts greater than any single hazard alone — harder to detect with single-variable thresholds.

World Weather AttributionAn international scientific collaboration (led by Friederike Otto and colleagues) that conducts rapid attribution studies of extreme weather events, typically published within 1–4 weeks of an event.

Tipping Point Detection and Early Warning

Climate tipping points — threshold transitions in Earth system components like the Atlantic Meridional Overturning Circulation (AMOC), Amazon dieback, or West Antarctic Ice Sheet collapse — represent some of the most consequential and least understood risks in climate science. Traditional early-warning signals (rising variance, slowing recovery from perturbations) are theoretically grounded but observationally weak and difficult to distinguish from natural variability in short instrumental records.

In 2023, Peter Ditlevsen and Susanne Ditlevsen published a study in Nature Communications applying statistical fingerprinting methods to Atlantic SST-based AMOC proxies, concluding that AMOC may be approaching a critical transition and could collapse between 2025 and 2095 under current emission trajectories — a conclusion that generated significant scientific debate.

Deep-learning early-warning systems have been developed by Alistair Duffey and colleagues (UCL, 2024) using Long Short-Term Memory networks trained on synthetic time series from tipping-point models. These systems can distinguish between genuine critical slowing down and noise amplification earlier than classical statistical tests — though false-alarm rates remain a significant challenge when applied to observational records.

Operational Seasonal Forecasting — ECMWF SEAS5

ECMWF's SEAS5 seasonal forecast system now incorporates post-processing using neural networks trained to correct systematic biases in 2-metre temperature and precipitation anomalies across user-relevant spatial scales. This AI-assisted bias correction is applied operationally for Copernicus Climate Change Service products used by European national meteorological services.

Operational Climate Services: From Science to Decision

The ultimate test of AI in climate science is operational deployment — integration into services that real decision-makers use. Several documented examples illustrate this transition:

Google DeepMind's GenCast (2024): A diffusion-based probabilistic weather model that generates ensemble forecasts capturing uncertainty structure comparable to ECMWF's 50-member ensemble at 1/10th the compute cost. GenCast demonstrated superior skill at forecasting extreme wind events relevant for the renewable energy sector.

NOAA's AI-enhanced hurricane track guidance: Since 2021, NOAA's Environmental Modeling Center has incorporated AI-based track consensus models (GPCE and related systems) as official guidance products in the National Hurricane Center's operational forecast workflow. These models demonstrated skill advantages over traditional track consensus during the 2023 hurricane season.

IBM Environmental Intelligence Suite: Deploys IBM's foundation climate model (trained on 2 petabytes of geospatial and historical weather data) to provide agricultural drought risk assessments, wildfire spread prediction, and energy demand forecasting to commercial and government clients across 150 countries.

Climate TRACE: An AI-driven coalition (Google, WattTime, Rocky Mountain Institute, and others) that uses satellite imagery, remote sensing, and machine learning to produce independent sector-by-sector greenhouse gas emission inventories at facility level — providing accountability verification that complements national government reporting.

Documented Impact — 2023 Turkey–Syria Earthquake

AI climate models were not the primary response tool for earthquakes, but they illustrate the broader pattern: following the February 2023 earthquake, AI-based debris-flow and flood risk models trained on high-resolution topography and soil data were used by UN agencies to rapidly identify secondary hazard zones in the affected region — an application of the same deep-learning spatial analysis techniques developed for climate downscaling.

Open Challenges and the Path Forward

Despite rapid progress, critical open problems remain. Uncertainty quantification in deep-learning climate models is poorly developed compared to the probabilistic ensemble framework of traditional GCMs. Physical interpretability — understanding why a neural network makes a particular prediction — is essential for scientific credibility but remains limited for large models. Non-stationarity is a persistent concern: models trained on historical climate may fail when deployed under future forcing conditions outside their training distribution.

The community consensus emerging from workshops at ECMWF, NCAR, and the World Meteorological Organisation is that the most productive path forward is hybrid modelling: AI components embedded within physics-constrained frameworks, where the strengths of data-driven flexibility and process-based interpretability are combined rather than forced to compete.

Lesson 4 Quiz

Extreme Events, Attribution, and Operational AI Climate Services

1. What was the World Weather Attribution team's conclusion about the June 2021 Pacific Northwest heat dome?

Correct. The WWA rapid attribution study concluded the Pacific Northwest heat dome was "virtually impossible" without anthropogenic climate change, and was made at least 150 times more likely by human emissions — a striking quantitative attribution for an event that killed 600+ people.

The WWA study's conclusion was stark: the event was "virtually impossible" without human-caused climate change and was at least 150 times more likely than in a pre-industrial climate. This made it one of the clearest-ever attribution results for an extreme heat event.

2. What makes GenCast (DeepMind, 2024) particularly valuable for the renewable energy sector?

Correct. GenCast generates ensemble forecasts comparable to ECMWF's 50-member ensemble but at approximately 1/10th the compute cost. Its demonstrated skill at extreme wind event forecasting is specifically relevant for wind energy producers managing grid stability risk.

GenCast (2024) is a diffusion-based probabilistic forecast model. Its relevance for renewables comes from its superior skill at forecasting extreme wind events — critical for wind farm operators managing production uncertainty — at dramatically lower compute cost than physics-based ensembles.

3. What does Climate TRACE use to produce independent greenhouse gas emission inventories?

Correct. Climate TRACE (a coalition including Google, WattTime, and RMI) uses satellite imagery and machine-learning analysis to produce sector-specific, facility-level GHG inventories that provide independent verification of national government emissions reports — critical accountability infrastructure for climate agreements.

Climate TRACE uses satellite imagery, remote sensing data, and machine learning to independently estimate emissions at facility level — providing a third-party verification layer for national government reporting under the Paris Agreement, without relying on the government data it is meant to check.

4. According to emerging community consensus, what is the most promising path forward for AI in climate modelling?

Correct. The consensus from ECMWF, NCAR, and WMO workshops is that hybrid models — combining learned components with physical conservation constraints — offer the best path forward, leveraging the strengths of both approaches rather than treating them as competitors.

The emerging consensus from major meteorological institutions (ECMWF, NCAR, WMO) is hybrid modelling: AI components embedded within physics-constrained frameworks. Pure data-driven replacement risks extrapolation failures for future climate states; pure physics models are too slow for large ensembles.

Lab 4 — Extreme Event Attribution Analyst

Work through attribution methodology and operational climate service design with your AI assistant

Lab Objective

In this lab you will apply attribution science concepts to real extreme events, evaluate the methodological choices behind rapid attribution studies, and reason about how AI tools are changing operational climate services. Engage with the World Weather Attribution framework, GenCast, and Climate TRACE approaches.

Suggested prompts: "Walk me through how the WWA team attributed the 2021 Pacific Northwest heat dome in two weeks." · "What are the main uncertainty sources in probabilistic event attribution?" · "How would you use Climate TRACE data to verify a country's Paris Agreement commitments?"

AI Lab Assistant

Attribution & Climate Services

Welcome to Lab 4. I specialise in extreme event attribution and operational AI climate services. We can work through attribution methodology, discuss uncertainty in probabilistic attribution, or explore how AI tools like GenCast and Climate TRACE are changing decision-relevant climate science. What would you like to explore?

Module 4 — Final Test

Climate and Earth System Modeling · 15 Questions · Pass at 80%

1. Who led the team that ran the first numerical weather forecast on ENIAC in 1950?

Correct. Jule Charney led the Princeton IAS team that produced the first numerical weather forecast on ENIAC on 5 April 1950.

Jule Charney led the April 1950 ENIAC forecast. Von Neumann provided institutional support; Manabe later pioneered coupled atmosphere-ocean GCMs at GFDL.

2. ERA5 covers atmospheric data from approximately what time period?

Correct. ERA5 extends back to 1940, providing 80+ years of globally coherent atmospheric data — a critical feature for training AI weather models on long historical records.

ERA5 covers 1940 to present. An earlier version (ERA-Interim) covered only 1979 onward, but ERA5 was extended back to 1940, providing a much longer training window for AI weather models.

3. Pangu-Weather was developed by which organisation?

Correct. Pangu-Weather was developed by Huawei's research team and published in Nature in 2023, using a 3D Earth Attention Network trained on ERA5.

Pangu-Weather was developed by Huawei and published in Nature (2023). DeepMind produced GraphCast; ECMWF remains the primary physics-based operational centre.

4. What is the approximate horizontal resolution of CMIP6-class global climate models?

Correct. CMIP6 models range from about 25 km (high-resolution models like HighResMIP) to 250 km (lower-resolution models), with most running near 50–100 km — too coarse to resolve individual storms or urban heat islands.

CMIP6 models span roughly 25–250 km resolution. Most standard models run near 50–100 km. This coarseness is why downscaling is needed to produce city- or watershed-level projections.

5. What US observational dataset is the standard high-resolution training target for precipitation downscaling in the continental US?

Correct. PRISM (Parameter-elevation Regressions on Independent Slopes Model) provides 4 km gridded precipitation and temperature data for CONUS and is the standard training target for US downscaling studies.

PRISM is the standard — it provides 4 km spatially continuous climate grids for CONUS using regression relationships between elevation and station observations, making it ideal as a high-resolution training target.

6. The DeepSD downscaling framework (Vandal et al., 2017) is primarily based on what type of neural network architecture?

Correct. DeepSD adapted CNN super-resolution techniques from computer vision, learning to map low-resolution GCM precipitation to high-resolution PRISM observations.

DeepSD used convolutional super-resolution — a computer-vision technique adapted for climate downscaling, learning spatial detail recovery from low-resolution GCM fields to high-resolution PRISM observations.

7. What is the fundamental risk of assuming "stationarity" in statistical climate downscaling?

Exactly right. Stationarity assumes the statistical relationship between large-scale and local climate is constant through time. Under anthropogenic forcing, physical processes can change (e.g., shifted storm tracks, altered moisture transport), invalidating the historical training relationship.

The stationarity assumption says the coarse-to-fine relationship learned from historical data holds in future conditions. This breaks down if climate change alters the physical processes governing local climate — shifted storm tracks, changed atmospheric circulation patterns, and altered moisture regimes all violate stationarity.

8. ClimateBench (2021) is a standardised benchmark for evaluating what specific AI capability?

Correct. ClimateBench measures how well AI emulators can predict the global spatial patterns of temperature, precipitation, and other variables that a full Earth System Model would produce for a given emissions trajectory.

ClimateBench evaluates climate model emulators — specifically, whether an AI model can predict the geographic patterns of climate response (temperature, precipitation, diurnal range) that a full GCM would produce for a given emissions input scenario.

9. What does "online" deployment mean in the context of neural-network climate parameterisations?

Correct. Online deployment means the neural network is integrated into the GCM time loop — its parameterisation outputs directly influence the model's next state, creating feedback that can cause biases to accumulate and simulations to destabilise.

Online means coupled into the running model. The neural network receives the current model state, outputs a parameterised tendency, and that tendency feeds into the next model timestep — creating the feedback loop that makes online stability so much harder to achieve than offline accuracy.

10. The M²LInES collaboration focuses ML parameterisation development specifically on what component of the Earth system?

Correct. M²LInES (involving LDEO, GFDL, MIT, NYU, CNRS) develops ML parameterisations for ocean mesoscale eddies — sub-grid features at standard 1° GCM resolution that are critical determinants of heat uptake and carbon sequestration.

M²LInES targets ocean mesoscale eddies — 10–100 km oceanic features that are unresolvable at standard GCM ocean resolution (typically 1°) but which crucially control how much heat and CO₂ the ocean absorbs from the atmosphere.

11. What temperature record did Lytton, British Columbia set during the June 2021 Pacific Northwest heat dome?

Correct. Lytton reached 49.6°C on June 29, 2021 — 4.6°C above the previous British Columbia record. The village burned down the following day in a wildfire. It remains Canada's all-time temperature record.

Lytton reached 49.6°C on June 29, 2021 — 4.6°C above the previous BC record. Canada's all-time temperature record was shattered by a margin that would have seemed physically impossible to most climatologists before the event.

12. Climate TRACE uses which combination of technologies to verify greenhouse gas emissions independently of government reporting?

Correct. Climate TRACE (Google, WattTime, RMI, and partners) uses satellite imagery and ML analysis to produce facility-level GHG inventories independently — providing accountability verification for Paris Agreement commitments.

Climate TRACE uses satellite imagery, remote sensing, and machine learning. This combination allows facility-level emission estimates that do not depend on self-reported government data — a critical independence requirement for international climate agreement verification.

13. What makes GenCast (DeepMind, 2024) a probabilistic rather than deterministic forecast model?

Correct. GenCast is a diffusion-based generative model — it learns the distribution of future atmospheric states and samples ensemble members from that distribution, analogous to how physics-based ensemble prediction systems generate member diversity through initial condition perturbations.

GenCast is a diffusion model — a generative approach that learns the probability distribution of future atmospheric states. Each call to the model samples a different ensemble member from this distribution, enabling probabilistic forecasting without running multiple perturbed copies of a deterministic model.

14. Tom Beucler's group at the University of Lausanne developed what type of architectural constraint for ML parameterisations?

Correct. Beucler's group developed architectures where conservation of energy and moisture is enforced by construction in the output layer — rather than hoped for as an emergent property of training — which is critical for online stability in coupled GCM integrations.

Beucler's group developed architectures that enforce conservation constraints (energy, moisture) directly in the output layer by construction. This is more reliable than hoping conservation emerges from training and is essential for preventing unphysical drift in online GCM integration.

15. What is the primary reason the emerging community consensus favours hybrid AI-physics models over pure data-driven replacement of GCMs for climate projection?

Exactly right. Pure data-driven models cannot reliably extrapolate to climate states outside their ERA5 training window — a fatal limitation for century-scale projection. Pure physics models are too expensive for uncertainty quantification. Hybrid models preserve physical constraints where they matter most while gaining ML flexibility and speed where it is most valuable.

The hybrid consensus reflects a genuine trade-off: pure MLWMs may fail to extrapolate to future climate states outside their training distribution, while pure physics GCMs are computationally prohibitive for the ensemble sizes needed for uncertainty quantification. Hybrid models combine physical constraints with ML efficiency and flexibility.