On 5 April 1950, the ENIAC computer at Aberdeen Proving Ground ran the first numerical weather forecast in history. A team led by Jule Charney at Princeton's Institute for Advanced Study produced a 24-hour pressure-field map for North America. The computation took 24 hours — real-time parity at best. Charney called it a proof of concept; it would take three more decades before operational numerical weather prediction outperformed experienced human forecasters.
General Circulation Models (GCMs) grew from that lineage. By the 1970s, Syukuro Manabe and colleagues at GFDL had coupled an atmospheric model to a simple ocean, producing the first projections of CO₂-forced warming. The physics — fluid dynamics, thermodynamics, radiation transfer — was well understood. The bottleneck was always computation and resolution.
A GCM divides the atmosphere, ocean, land surface, and sea ice into a three-dimensional grid of cells — typically 50–100 km horizontally and dozens of vertical layers. Within each cell, the model solves discretised versions of the Navier–Stokes equations for fluid motion, the radiative transfer equation for energy transport, and parameterisation schemes that approximate sub-grid processes such as convective clouds, ocean eddies, and soil moisture.
The resulting systems contain tens of millions of coupled equations. A single 100-year simulation at 25 km resolution on NCAR's Cheyenne supercomputer consumed roughly 35 million core-hours of compute time. This computational cost is the central constraint of climate science: you cannot run enough ensemble members to fully sample uncertainty, and you cannot afford to run at the resolution needed to resolve convective systems explicitly.
The ERA5 reanalysis, released by ECMWF in 2017–2019, changed the data landscape. ERA5 provides hourly, globally gridded atmospheric, land-surface, and sea-state data from 1940 to present at 31 km resolution — essentially a physically coherent reconstruction of every weather event in recorded history. This ~5 petabyte dataset became the training corpus for the first generation of AI weather and climate models.
ECMWF's ERA5 reanalysis (2017–2019) provides 80+ years of globally coherent atmospheric data at 31 km resolution. Its ~5 petabyte dataset became the foundation training corpus for every major AI weather model developed between 2022 and 2025.
Traditional GCMs are process-based: every equation encodes a physical law. This gives them strong interpretability and the ability to extrapolate to climate states outside the training distribution — critical when projecting 2100 conditions that have no historical analogue. But they are slow, expensive, and heavily dependent on parameterisations whose tuning introduces systematic biases.
Data-driven or machine-learning weather models (MLWMs) take a different approach. They learn a mapping from one atmospheric state to the next, trained entirely on reanalysis data. They have no explicit fluid equations; instead, the weights of a neural network encode the statistical relationships in ERA5. This makes them orders of magnitude faster at inference time — and, as of 2023–2024, competitive or superior on standard skill metrics for medium-range forecasting.
The critical question for climate projection (as opposed to weather forecasting) is whether these statistical relationships remain valid decades hence, under greenhouse-gas forcing that has no parallel in the ERA5 training window. This is the unresolved frontier the field is actively debating.
In November 2022, Huawei's research team published Pangu-Weather, a 3D Earth Attention Network trained on ERA5 that matched or exceeded ECMWF's operational deterministic forecast on multiple standard metrics up to 7 days lead time. Shortly after, DeepMind released GraphCast (November 2023), trained on ERA5 1979–2017, which outperformed ECMWF's HRES model on 90% of the 1380 standard verification targets. Both models run global forecasts in under a minute on a single GPU — compared to hours on ECMWF's HPC clusters.
Google's NeuralGCM (2024) attempted to bridge the divide: a hybrid model that embeds a neural-network atmospheric core inside a traditional dynamical solver, preserving physical conservation laws while learning flexible sub-grid parameterisations from data. NeuralGCM runs 100-member ensemble simulations 1000× faster than comparable physics-only models.
These systems are not replacing GCMs for century-scale climate projection. But they are fundamentally transforming what is computationally feasible for seasonal forecasting, ensemble uncertainty quantification, and rapid impact assessment.
GraphCast (DeepMind, 2023) outperformed ECMWF's operational HRES forecast on 90.3% of 1,380 verification targets across all lead times from 6 hours to 10 days. Its inference time for a global 10-day forecast: approximately 60 seconds on a single TPU.
In this lab you will interrogate the design choices behind General Circulation Models and the first generation of AI weather models. Ask about grid resolution trade-offs, parameterisation schemes, the ERA5 training data, or the specific benchmarks that GraphCast, Pangu-Weather, and NeuralGCM achieved.
When the atmospheric river events of January 2023 struck California, dumping nearly a year's worth of precipitation in three weeks, emergency managers needed flood-inundation maps at neighbourhood scale. The CMIP6 global models that inform California's long-range water planning run at roughly 100 km resolution — a grid cell larger than the entire San Fernando Valley. Translating that global output into actionable local projections is the task of downscaling, and it has become one of the primary applications of deep learning in climate science.
Traditional downscaling divides into two approaches. Dynamical downscaling nests a high-resolution regional model (e.g., WRF at 4 km) inside a GCM, explicitly simulating local topographic and land-surface effects. It is physically consistent but computationally prohibitive — running a single 30-year regional simulation can take months of supercomputer time.
Statistical downscaling learns a transfer function from coarse GCM output to observed local climate variables, using historical station records as the target. Traditional statistical approaches (BCSD, delta-mapping) are fast but assume that the coarse-to-fine relationship is stationary over time — an assumption that breaks down in a changing climate.
Deep-learning downscaling occupies a middle ground: it can learn highly nonlinear spatial relationships from large training datasets while running at inference speeds that enable probabilistic ensembles.
The most widely studied deep-learning downscaling architecture is the Convolutional Neural Network Super-Resolution (CNN-SR) approach, directly adapted from image super-resolution in computer vision. The network learns a mapping from low-resolution GCM fields to high-resolution observational gridded products (e.g., PRISM in the US, E-OBS in Europe). Applications include precipitation, temperature extremes, wind speed, and solar irradiance.
A landmark study by Sachindra et al. (2020) in the Journal of Hydrology demonstrated that LSTM-based downscaling of rainfall over Australia significantly outperformed conventional statistical methods on both mean climatology and extreme event frequency. A 2022 study led by NCAR's William D. Collins group applied a U-Net architecture to downscale CESM2 precipitation over the Western US from 1° to 1/8° resolution, achieving skill scores comparable to dynamical downscaling at 1/100th the computational cost.
The DeepSD framework (Vandal et al., 2017) was one of the first systematic demonstrations that convolutional super-resolution could match or exceed BCSD for precipitation downscaling over CONUS, using PRISM data as the high-resolution training target.
The EU's Copernicus Climate Change Service (C3S) has funded operational AI-based downscaling pipelines that translate seasonal forecast model output (ECMWF SEAS5) into country-level temperature and precipitation anomalies used directly by European agricultural and energy sector planners.
Even state-of-the-art GCMs exhibit systematic biases — their simulated precipitation distributions, surface temperature trends, or sea-surface temperatures may deviate from observations by amounts that would swamp the forced climate signal. Bias correction is the post-processing step that aligns model output distributions with observed climatology.
Classical bias correction (quantile mapping, delta method) is purely statistical and cannot adjust for physical process errors — if a model misrepresents ENSO teleconnections, no statistical correction will fix the downstream precipitation bias it produces. Deep-learning bias correction, by contrast, can learn spatially coherent corrections that account for large-scale atmospheric patterns. The ISIMIP3BASD method (Lange 2019) established a sophisticated quantile-delta mapping benchmark; subsequent neural-network approaches have shown skill improvements particularly for heavy precipitation tails that are most relevant for flood risk.
Microsoft Research's Aurora model (2024) demonstrated that a foundation model pre-trained on multiple atmospheric datasets and then fine-tuned on specific regional observational records could simultaneously perform bias correction and downscaling, outperforming single-purpose models on several regional benchmarks.
The US Army Corps of Engineers' Hydrologic Engineering Center has begun integrating AI-downscaled climate projections into its HEC-HMS flood modelling workflows, allowing planners to assess flood frequency changes at individual watershed scale using CMIP6 ensemble output — a task previously requiring months of dynamical downscaling runs.
In this lab you will explore real-world downscaling decisions. Discuss the trade-offs between statistical and dynamical downscaling, evaluate when deep learning adds value, and reason through bias correction choices for specific applications.
In 2021, ClimateBench — a standardised benchmark for climate model emulation — was published by a team led by Duncan Watson-Parry at the University of Exeter. The benchmark asked: given a time series of emissions inputs (CO₂, CH₄, SO₂, black carbon), can a machine-learning model predict the global patterns of surface temperature, precipitation, and diurnal temperature range that a full GCM would produce? The winning models outperformed simple pattern-scaling on nearly every metric, and some achieved near-GCM accuracy at 10,000 times the speed of the full Earth System Model.
A climate emulator (also called a surrogate model or reduced-complexity model) is a computationally cheap approximation of some component or output of a full Earth System Model. Emulators are trained by running the full model many times across a design-of-experiment input space and then fitting a statistical or machine-learning model to that input–output mapping.
Emulators serve multiple purposes in climate science:
1. Uncertainty quantification: Running a full CMIP6-class model for thousands of parameter combinations is infeasible. An emulator trained on a few hundred runs can surrogate the model for millions of samples needed for Bayesian parameter estimation or sensitivity analysis.
2. Impact model coupling: Integrated assessment models (IAMs) that link climate projections to economic and social outcomes need rapid climate responses for thousands of socioeconomic scenarios. Emulators enable these coupled runs.
3. Parameterisation replacement: Individual model components — convective parameterisations, cloud microphysics schemes, aerosol–radiation interactions — can be replaced with neural-network emulators trained on high-resolution process-level simulations.
CLIVAR FaIR (Finite Amplitude Impulse-Response model): A widely used simple climate model used in IPCC reports for probabilistic temperature projections. Its neural-network variants can be constrained against observational uncertainty ranges to produce calibrated 21st-century ensembles far more rapidly than full GCMs.
NCAR's CAM-ML: In 2018, Noah Brenowitz and Christopher Bretherton at the University of Washington published neural-network parameterisations of atmospheric deep convection trained on coarse-grained output from cloud-resolving simulations. When inserted into NCAR's Community Atmosphere Model, the neural networks produced stable multi-year integrations that outperformed the default convective parameterisation on several diagnostics — a landmark result in online coupled ML emulation.
M²LInES (2021–present): A multi-institution collaboration (LDEO, GFDL, MIT, NYU, CNRS) focused on developing ML parameterisations for ocean mesoscale eddies — sub-grid features that fundamentally control heat uptake and carbon sequestration but cannot be resolved at typical GCM ocean resolutions of 1°. Their eddy-flux emulators have been tested online in GFDL's MOM6 ocean model.
ClimSim (2023): A benchmark dataset released by Sungduk Yu et al. containing ~160 TB of simulation data from E3SM-MMF (a multi-scale model with explicit convection) paired with its coarse-resolution input variables — designed specifically for training neural-network emulators of moist physics parameterisations. It became the largest publicly available dataset for parameterisation emulation.
Running 10,000 climate sensitivity experiments with a full CMIP6-class model at 1° resolution would require approximately 500 million CPU-hours. A trained neural-network emulator can perform the equivalent mapping in under 10 CPU-hours — enabling uncertainty quantification studies that were previously computationally impossible.
A critical distinction in parameterisation emulation is between offline and online deployment. In offline evaluation, the neural network receives GCM state inputs from a pre-computed simulation and its outputs are evaluated against the target parameterisation — but the network never influences the model's subsequent state. This is straightforward and widely used for benchmarking.
In online deployment, the neural network is actually coupled into the running GCM, so its outputs drive the next model time step. This is far more challenging because small biases accumulate, and the model can drift into unphysical states. Several early online ML parameterisation experiments collapsed within days of simulation time. The Brenowitz–Bretherton 2018 work was notable precisely because it achieved stable multi-year online integration — a much harder target than offline accuracy.
Physical consistency constraints — enforcing conservation of energy and moisture — are now recognised as essential for stable online ML parameterisations. Several groups (including Tom Beucler's group at University of Lausanne) have developed constraint architectures that build conservation laws directly into the network's output layer.
Released by Sungduk Yu et al. (Nature Communications, 2024), ClimSim contains ~160 TB of paired coarse/fine-resolution atmospheric simulation data from E3SM-MMF. It is the largest public benchmark dataset for training neural-network moist physics emulators and was specifically designed to facilitate reproducible comparisons across research groups.
In this lab you will reason through emulator design decisions: what components to emulate, what training data to use, how to ensure online stability, and how to enforce physical conservation constraints. Apply lessons from ClimateBench, ClimSim, and the M²LInES project.
The June 2021 Pacific Northwest heat dome set all-time temperature records across British Columbia, Oregon, and Washington. Lytton, BC reached 49.6°C on June 29 — 4.6°C above the previous provincial record. Within days of the event ending, the World Weather Attribution team published a rapid attribution study concluding the event was "virtually impossible" without human-caused climate change — at least 150 times more likely than in a pre-industrial climate. This study was completed in roughly two weeks, a timeline that depended critically on ensemble climate model runs and statistical extreme-value analysis that AI methods are now accelerating further.
Identifying, classifying, and tracking extreme weather events in large climate model ensembles is a fundamental task in attribution science. Traditional approaches used threshold-based detection algorithms — define a "heat wave" as temperatures exceeding a fixed percentile for a minimum number of consecutive days, then count events. Neural-network approaches can learn more nuanced event definitions directly from reanalysis, identifying atmospheric circulation patterns associated with compound extremes rather than single-variable exceedances.
TempestExtremes (Ullrich and Zarzycki, 2017) established an open-source framework for climate model extreme event detection that has been widely used across CMIP6 studies. Building on this foundation, deep-learning classifiers trained on ERA5 have demonstrated superior skill at detecting tropical cyclone precursor patterns, atmospheric river corridors, and heat-dome circulation anomalies compared to threshold-based methods.
A 2023 study published in Nature Climate Change by Gabriel Vecchi's group at Princeton used a CNN trained on ERA5 to detect tropical cyclone tracks across 40 CMIP6 model outputs, enabling a systematic multi-model assessment of how the distribution of Atlantic hurricane intensities shifts under 2°C and 4°C warming scenarios — a scale of analysis previously requiring years of manual processing.
Climate tipping points — threshold transitions in Earth system components like the Atlantic Meridional Overturning Circulation (AMOC), Amazon dieback, or West Antarctic Ice Sheet collapse — represent some of the most consequential and least understood risks in climate science. Traditional early-warning signals (rising variance, slowing recovery from perturbations) are theoretically grounded but observationally weak and difficult to distinguish from natural variability in short instrumental records.
In 2023, Peter Ditlevsen and Susanne Ditlevsen published a study in Nature Communications applying statistical fingerprinting methods to Atlantic SST-based AMOC proxies, concluding that AMOC may be approaching a critical transition and could collapse between 2025 and 2095 under current emission trajectories — a conclusion that generated significant scientific debate.
Deep-learning early-warning systems have been developed by Alistair Duffey and colleagues (UCL, 2024) using Long Short-Term Memory networks trained on synthetic time series from tipping-point models. These systems can distinguish between genuine critical slowing down and noise amplification earlier than classical statistical tests — though false-alarm rates remain a significant challenge when applied to observational records.
ECMWF's SEAS5 seasonal forecast system now incorporates post-processing using neural networks trained to correct systematic biases in 2-metre temperature and precipitation anomalies across user-relevant spatial scales. This AI-assisted bias correction is applied operationally for Copernicus Climate Change Service products used by European national meteorological services.
The ultimate test of AI in climate science is operational deployment — integration into services that real decision-makers use. Several documented examples illustrate this transition:
Google DeepMind's GenCast (2024): A diffusion-based probabilistic weather model that generates ensemble forecasts capturing uncertainty structure comparable to ECMWF's 50-member ensemble at 1/10th the compute cost. GenCast demonstrated superior skill at forecasting extreme wind events relevant for the renewable energy sector.
NOAA's AI-enhanced hurricane track guidance: Since 2021, NOAA's Environmental Modeling Center has incorporated AI-based track consensus models (GPCE and related systems) as official guidance products in the National Hurricane Center's operational forecast workflow. These models demonstrated skill advantages over traditional track consensus during the 2023 hurricane season.
IBM Environmental Intelligence Suite: Deploys IBM's foundation climate model (trained on 2 petabytes of geospatial and historical weather data) to provide agricultural drought risk assessments, wildfire spread prediction, and energy demand forecasting to commercial and government clients across 150 countries.
Climate TRACE: An AI-driven coalition (Google, WattTime, Rocky Mountain Institute, and others) that uses satellite imagery, remote sensing, and machine learning to produce independent sector-by-sector greenhouse gas emission inventories at facility level — providing accountability verification that complements national government reporting.
AI climate models were not the primary response tool for earthquakes, but they illustrate the broader pattern: following the February 2023 earthquake, AI-based debris-flow and flood risk models trained on high-resolution topography and soil data were used by UN agencies to rapidly identify secondary hazard zones in the affected region — an application of the same deep-learning spatial analysis techniques developed for climate downscaling.
Despite rapid progress, critical open problems remain. Uncertainty quantification in deep-learning climate models is poorly developed compared to the probabilistic ensemble framework of traditional GCMs. Physical interpretability — understanding why a neural network makes a particular prediction — is essential for scientific credibility but remains limited for large models. Non-stationarity is a persistent concern: models trained on historical climate may fail when deployed under future forcing conditions outside their training distribution.
The community consensus emerging from workshops at ECMWF, NCAR, and the World Meteorological Organisation is that the most productive path forward is hybrid modelling: AI components embedded within physics-constrained frameworks, where the strengths of data-driven flexibility and process-based interpretability are combined rather than forced to compete.
In this lab you will apply attribution science concepts to real extreme events, evaluate the methodological choices behind rapid attribution studies, and reason about how AI tools are changing operational climate services. Engage with the World Weather Attribution framework, GenCast, and Climate TRACE approaches.