When Google published its 2024 Environmental Report, it disclosed that its total greenhouse gas emissions had risen 48% since 2019 β the opposite direction of its net-zero pledge. The primary culprit, the company acknowledged, was the surging electricity demand of its AI data centers. This was not a small company's growing pain. It was one of the world's most sophisticated engineering organizations admitting that the very tool it was deploying to help humanity could not yet account for its own footprint.
Training a large language model requires running billions of matrix multiplications across thousands of specialized chips, continuously, for weeks or months. A landmark 2019 study by Emma Strubell and colleagues at the University of Massachusetts Amherst estimated that training a single large transformer-based NLP model could emit as much COβ as five average American cars over their entire lifetimes β roughly 284 tonnes. Later models are far larger.
GPT-3, released by OpenAI in 2020, was estimated to have required approximately 1,287 megawatt-hours of electricity during training. For context, that is roughly the annual consumption of 120 U.S. households. GPT-4's training costs have not been officially disclosed, but independent researchers and leaked estimates place energy use substantially higher. Anthropic, Google DeepMind, and Meta have similarly declined to publish precise training energy figures for their flagship models.
The lack of standardized disclosure is itself a governance problem. Without mandatory reporting, the field cannot accurately calculate its own impact or set meaningful reduction targets.
The 2023 paper "Power Hungry Processing: Watts Driving the Cost of AI Deployment?" by Luccioni, Viguier & Ligozat measured inference energy across 88 open-source models. They found that text generation tasks consumed up to 4,757 times more energy per query than simple classification tasks β demonstrating that model architecture and task type matter enormously for real-world consumption.
Training is a one-time (per model version) cost. Inference β running the model to answer queries β is continuous and accumulates at enormous scale. According to research published by the International Energy Agency in 2024, a single ChatGPT query consumes roughly 10 times the electricity of a standard Google Search. When multiplied across billions of daily queries, inference energy dwarfs training energy over a model's lifetime.
Microsoft, which integrated OpenAI models into Bing and its Office suite in 2023, reported in its 2023 Sustainability Report that data center water consumption β used for cooling β had increased by 34% year-over-year, reaching 6.4 million cubic meters. Water stress is a co-consequence of compute intensity, particularly relevant in arid regions where many large data centers are sited.
NVIDIA's H100 GPU, released in 2022, delivers roughly 3Γ the training throughput of the previous-generation A100 at comparable power draw β a meaningful efficiency gain. Google's custom Tensor Processing Units (TPUs) similarly optimize matrix operations for lower watt-per-FLOP ratios than general-purpose silicon. These hardware improvements are real.
However, economists and climate researchers warn of the Jevons Paradox: as computing becomes more efficient, it also becomes cheaper, which historically has caused total consumption to rise rather than fall. The history of computing supports this concern. CPU efficiency improved dramatically from the 1970s through the 2010s, yet global data center electricity use grew continuously. There is no strong evidence that AI hardware efficiency gains will decouple from demand growth.
Sustainable AI requires both supply-side improvements (greener electricity, more efficient hardware) and demand-side discipline (choosing the right-sized model for each task, avoiding unnecessary inference, measuring and disclosing consumption). Neither alone is sufficient.
You are advising a mid-sized tech company that wants to deploy a large language model in its customer service pipeline. Before they commit, they've asked you to assess the energy implications. Use this lab to explore how to estimate, disclose, and reduce AI's energy footprint in a real deployment context.
In 2022, Google announced it had achieved 100% renewable energy matching globally since 2017 β meaning that on an annual basis, the company purchased as many megawatt-hours of renewable energy certificates as it consumed in electricity. This sounds definitive. But the company simultaneously acknowledged a more demanding target: 24/7 Carbon-Free Energy, or CFE β matching clean power to consumption in every hour, in every grid region, by 2030. The gap between annual matching and hourly matching is vast, and reveals how much work remains.
The standard corporate mechanism for claiming renewable energy use is the Renewable Energy Certificate (REC). One REC represents one megawatt-hour of electricity generated from a renewable source. Companies buy RECs to "match" their consumption β but a REC purchased in Texas can offset electricity consumed from a coal-heavy grid in Virginia on a dark, windless night. Critics, including environmental nonprofit Rocky Mountain Institute, call this "spreadsheet decarbonization."
Power Purchase Agreements (PPAs) are a stronger commitment: a company contracts directly with a renewable energy developer to purchase output from a specific project over 10β25 years. Microsoft, Google, and Amazon are among the largest corporate PPA signatories globally. According to BloombergNEF's 2023 Corporate Energy Market Outlook, these three companies collectively signed over 20 gigawatts of new renewable PPAs in 2022β2023. PPAs fund the actual construction of new renewable capacity β a meaningful climate contribution β but still don't guarantee that the electrons powering a given server at a given moment are clean.
Google's 24/7 CFE initiative, launched in 2020 in partnership with the UN and other technology companies, sets a more rigorous standard: for every hour of electricity consumption, the company aims to procure an equal amount of carbon-free energy on the same regional grid. This requires dispatchable clean energy (storage, geothermal, hydro, nuclear) to cover hours when solar and wind are unavailable.
Google's 2024 report showed it achieved a global average of 64% CFE in 2023, meaning 36% of its hourly consumption was still matched by fossil-heavy grid power. Progress is uneven: its Singapore data centers achieved only 4% CFE due to that grid's heavy reliance on natural gas. Its operations in Denmark and Finland, where grids are heavily renewable, performed far better.
In September 2023, Microsoft signed a 20-year agreement with Constellation Energy to purchase power from the restarted Three Mile Island Unit 1 nuclear plant in Pennsylvania β specifically to power its AI data centers. The deal underscored a broader industry recognition that intermittent renewables alone cannot meet 24/7 clean power demands for always-on computing infrastructure.
Data center siting decisions carry large carbon and water consequences. A data center in Iceland, powered almost entirely by geothermal and hydroelectric energy, has a near-zero operational carbon footprint. The same workload run from a data center in Singapore or parts of the U.S. Midwest might have a carbon intensity five to ten times higher, depending on the grid mix.
Water cooling is a related concern. Evaporative cooling towers β the dominant technology in large data centers β consume enormous volumes of freshwater. A 2021 study in Nature Communications estimated that U.S. data centers withdrew roughly 1.7 billion liters of water per day. Microsoft's reported 34% increase in water use from 2022 to 2023 illustrates how AI scaling is accelerating this demand. Data centers in the American West β including major cloud regions in Arizona, Nevada, and Oregon β face increasing scrutiny from water authorities as drought conditions worsen.
The strongest form of renewable energy procurement creates additionality β it funds new clean capacity that would not have existed otherwise. PPAs that finance new solar or wind projects contribute to additionality. Buying existing RECs from projects that were already running does not. Evaluating additionality is now a core criterion in serious corporate sustainability assessments.
Your organization's sustainability team has received three vendor proposals for cloud AI services. Each vendor makes different renewable energy claims. You need to evaluate these claims critically and recommend a procurement approach that reflects genuine climate impact rather than marketing.
In February 2023, Meta released LLaMA β a family of open-source language models ranging from 7 billion to 65 billion parameters. The 7B version, researchers quickly discovered, outperformed GPT-3 on several benchmarks while requiring a fraction of the compute for inference. This was not magic. It reflected a decade of architectural improvement β better training data curation, more efficient attention mechanisms, and lessons from the scaling laws literature. The release forced a broader conversation: had the AI industry been over-parameterizing models out of competitive pressure rather than necessity?
In 2022, researchers at DeepMind published "Training Compute-Optimal Large Language Models," quickly nicknamed the Chinchilla paper. Their central finding: most large language models up to that point had been significantly undertrained relative to their parameter count. The optimal trade-off, they argued, is to train a smaller model on more data rather than a larger model on less data.
Their Chinchilla model, at 70 billion parameters trained on 1.4 trillion tokens, outperformed the 280-billion-parameter Gopher model on nearly every benchmark β while requiring substantially less compute for both training and inference. The implication for sustainability is direct: if a 70B model consistently beats a 280B model, deploying the 280B model at scale is not just computationally wasteful β it is environmentally wasteful.
The Chinchilla paper (Hoffmann et al., DeepMind, 2022) demonstrated that for a given compute budget, the optimal strategy is to roughly scale model size and training tokens equally. Most prior large models had used compute budgets to maximize parameters rather than training duration β resulting in "compute-optimal" models that were 3β4Γ smaller than their predecessors but comparably capable.
Once a model is trained, several techniques can dramatically reduce its inference cost:
Quantization reduces the numerical precision of model weights. A standard model uses 32-bit floating-point numbers (FP32). Quantizing to 8-bit integers (INT8) roughly halves memory usage and speeds inference significantly with minimal accuracy loss on most tasks. 4-bit quantization is increasingly viable. The open-source community developed tools like GPTQ and bitsandbytes that made 4-bit quantization of LLaMA-family models practical on consumer hardware β enabling the same model to run on a laptop that previously required a server cluster.
Pruning removes weights or attention heads identified as low-contribution during a structured analysis phase. Structured pruning can reduce model size by 30β50% with modest accuracy degradation for many real-world tasks. Knowledge distillation trains a smaller "student" model to mimic a larger "teacher" model, transferring learned behavior into a more efficient architecture. Google's DistilBERT, published in 2019, achieved 97% of BERT's performance on GLUE benchmarks at 40% fewer parameters and 60% faster inference.
Perhaps the most actionable insight for practitioners is right-sizing: using the smallest model capable of achieving acceptable performance for a given task. A 175-billion-parameter model is not appropriate for classifying whether a customer support ticket is about billing or shipping. A fine-tuned BERT-class model with ~110M parameters can achieve near-identical accuracy on such classification tasks at 1/1,000th the inference cost.
A 2023 study from Hugging Face and Carnegie Mellon University ("Efficiency Benchmarks for NLP") found that for the majority of enterprise NLP tasks β classification, extraction, summarization of short texts β models in the 1β7B parameter range performed comparably to 70B+ models, while consuming 10β50Γ less energy per inference. The researchers recommended that organizations build explicit model selection criteria based on task complexity rather than defaulting to the largest available model.
1. Benchmark task requirements before selecting model size. 2. Apply quantization (INT8 minimum) for production inference. 3. Evaluate distilled alternatives before deploying full-scale models. 4. Monitor inference energy using tools like CodeCarbon or the ML COβ Impact calculator. 5. Cache frequent responses to avoid redundant inference computation.
A logistics company is deploying AI across four internal workflows: (1) routing optimization, (2) customer email classification, (3) generating contract summaries, and (4) real-time driver safety alerts. They're planning to use GPT-4-class models for all four. Your job is to advise on right-sizing, quantization opportunities, and the energy implications of their current plan.
When the U.S. Securities and Exchange Commission finalized its climate disclosure rules in March 2024 β requiring large public companies to report material climate-related risks and Scope 1 and 2 emissions β the rule did not explicitly mention AI energy consumption. But legal analysts immediately noted the implication: for technology companies where AI workloads constitute a significant portion of energy use, AI-driven emissions would be material climate-related risks requiring disclosure. The regulatory pressure, long anticipated, had arrived β even if AI wasn't named directly.
Several practical tools now exist for measuring AI energy consumption at the code level:
CodeCarbon is an open-source Python library developed by Mila (MontrΓ©al Institute for Learning Algorithms), UniversitΓ© de MontrΓ©al, and partners. It measures the energy consumption of Python code execution and converts it to estimated COβ equivalent based on the carbon intensity of the electricity grid where the computation is running. As of 2024, CodeCarbon has been downloaded over 400,000 times and is integrated into several cloud ML platforms.
ML COβ Impact Calculator (mlco2.github.io) allows researchers to estimate the emissions of training runs by inputting hardware type, cloud provider, region, and training duration. It draws on the ElectricityMaps API for regional grid intensity data. The tool was used in a 2022 NeurIPS paper that found median emissions reporting among published ML papers was absent β fewer than 5% of accepted papers reported training energy.
Experiment trackers such as Weights & Biases and MLflow have added energy tracking integrations that log kWh alongside loss curves, enabling teams to visualize the energy cost of hyperparameter experiments β often revealing that extensive search is consuming disproportionate energy relative to accuracy gains.
An analysis of 1,700 papers accepted to NeurIPS 2022 found that fewer than 5% reported any training energy figure. The authors called this a "reproducibility crisis for sustainability" β without energy disclosure, it is impossible for the field to build cumulative knowledge about which approaches are computationally efficient, or to hold itself accountable for its footprint.
The EU AI Act, formally adopted in 2024, includes provisions requiring high-risk AI systems to document their energy consumption in technical documentation submitted to regulators. While the implementing regulations are still being developed, the Act establishes energy use as a required disclosure element for systems above a defined risk threshold.
The EU Corporate Sustainability Reporting Directive (CSRD), effective for large companies from fiscal year 2024, requires detailed reporting under the European Sustainability Reporting Standards (ESRS). ESRS E1 on climate change explicitly requires disclosure of energy consumption by source, Scope 1β3 emissions, and targets β all of which encompass data center and AI workload energy.
In the United States, the AI Act of 2024 Executive Order on AI directed the Department of Energy to develop methodologies for assessing AI energy and water consumption across federal agencies. The National AI Initiative also published a voluntary framework for AI sustainability reporting, though it lacks enforcement mechanisms.
For organizations using third-party AI APIs (OpenAI, Anthropic, Google, Microsoft), the AI-related emissions fall under Scope 3 β indirect emissions from purchased goods and services. Scope 3 accounting is the most contested and least standardized domain of corporate carbon reporting. The GHG Protocol's Scope 3 Technical Guidance recommends that companies include purchased cloud computing emissions, but few AI providers currently publish the granular data needed to calculate this accurately.
Amazon Web Services, Google Cloud, and Microsoft Azure have all developed carbon footprint reporting tools for enterprise customers β but these tools rely on average fleet carbon intensities rather than workload-specific measurements, which tends to underestimate AI-heavy workloads' actual footprint. Advocates including the Green Software Foundation have called for API-level energy reporting as a standard feature of cloud AI services.
A credible AI sustainability disclosure should include: (1) Total kWh consumed by AI workloads (training + inference separately); (2) Carbon intensity of electricity source (grid average or specific PPA); (3) Water consumption for cooling; (4) Hardware utilization rates; (5) Comparison of model-size choices and alternatives considered; (6) Year-over-year trends. The Green Software Foundation's Software Carbon Intensity (SCI) specification provides a standardized formula: SCI = (E Γ I) + M per R, where E is energy, I is grid carbon intensity, M is embodied hardware emissions, and R is functional unit (query, user, transaction).
A publicly traded retail company has just become subject to EU CSRD requirements. Their legal and sustainability teams have asked you β as their AI governance advisor β to help design a methodology for measuring and disclosing the carbon footprint of their three deployed AI systems: a demand forecasting model, a product recommendation engine, and a customer service chatbot. They have no existing measurement infrastructure.