AI Sustainability Research

AI Over-Compute

Pilot design before the headline

We start with 10,000 labeled chat episodes.

The pilot draws 10,000 chat episodes from WildChat-4.8M, a public dataset of real user-chatbot interactions. The unit of analysis is the conversation episode. Each row is counted once, first interpreted as an actual user task, then decomposed into the capabilities needed to answer it.

Source WildChat Real user-chatbot conversations from WildChat-4.8M public
Pilot rows 10,000 Chat episodes sampled from 2023-04-09 to 2025-07-31
MECE execution buckets 7 Every task belongs to one and only one lowest-sufficient execution bucket
Classification Gemini Gemini-only labels with strict exact-ID batch validation

Data And Labels

Each chat is converted into a task decomposition

For each row, Gemini first identifies the actual task the user is trying to complete. It then decomposes the task into needed capabilities: lookup, reading, writing, calculation, local software, small-model generation, standard LLM synthesis, reasoning, tools, or expert judgment. The final execution bucket is a consequence of that decomposition, not the starting assumption.

1 Raw conversation

WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.

2 Actual task

What the user is really trying to accomplish, not just the words in the prompt.

3 Task decomposition

Break the task into lookup, reading, generation, calculation, software, reasoning, tool, and expert components.

4 Energy replay

Estimate GPT-5-class answering energy, then compare it with the lowest sufficient execution bucket.

MECE execution bucket Count Share Underlying labels Example
Standard LLM sufficient8,22782.3%standard_llm 8,227
Small model sufficient8078.1%small_model_local 527, small_model_cloud 280
Direct search / lookup4374.4%web_search_simple 362, reference_lookup 75
Search + reading1571.6%web_search_research 157
Local tool / software460.5%local_deterministic_tool 38, native_software_workflow 8
Reasoning / agent required3063.1%reasoning_llm 209, llm_with_tools 91, remote_specialized_tool 6
Expert / not comparable200.2%human_expert 13, not_comparable 7

Core Message

The 28.8% number is a counterfactual replay result.

It comes from replaying the same 10,000 tasks two ways: GPT-5.5 central answering every task costs 35,453.0Wh, while task-matched execution costs 25,257.2Wh. The difference is 10,195.8Wh, so 10,195.8 / 35,453.0 = 28.8%. "Cutting energy" means changing the execution path after task decomposition, not claiming the GPT-5-class model itself became more efficient.

Counterfactual Accounting

Decompose first, then compare the same tasks two ways

Decomposition is necessary because a chat is not a single generic "AI request." One conversation may be simple lookup, another may be writing, another may be calculation, and another may need reasoning or tools. The accounting compares the same 10,000 actual tasks under two execution plans.

1. GPT-5.5 central replay 35,453.0 Wh

Counterfactual baseline: every task is answered through the GPT-5.5 central path.

2. Task-matched execution 25,257.2 Wh

After decomposition, each task uses the lowest sufficient execution bucket.

3. Energy cut 28.8%

(35,453.0 - 25,257.2) / 35,453.0 = 10,195.8 / 35,453.0.

Why This Equation

The numerator is avoidable cloud execution, not total AI energy.

The equation isolates the part of GPT-5-class answering energy that changes when the execution plan changes. It does not convert human time into carbon, and it does not count model training. It asks one narrow question: for the same task mix, how much cloud inference energy changes if we do not send every task through the same GPT-5.5 central path?

Task bucket GPT-5.5 replay Task-matched Saved Share of saved Why energy changes
Standard LLM sufficient29,793.9 Wh22,865.4 Wh6,928.6 Wh68.0%Still needs an LLM, but not every case needs the full GPT-5.5 central replay path.
Small model sufficient1,842.7 Wh12.8 Wh1,830.0 Wh17.9%Short, common, low-risk language tasks move to small local/cloud models.
Direct search / lookup817.0 Wh179.1 Wh637.9 Wh6.3%Known facts and lists are answered through search/reference lookup.
Reasoning / agent required2,439.3 Wh1,995.1 Wh444.2 Wh4.4%These tasks still need heavier execution, so the saving is smaller.
Search + reading400.0 Wh140.1 Wh259.9 Wh2.5%Search supplies sources; human reading and synthesis remain separate from cloud carbon.
Local tool / software95.3 Wh0.0 Wh95.3 Wh0.9%Calculator, spreadsheet, or local software replaces cloud inference for the answer.
Expert / not comparable64.7 Wh64.7 Wh0.0 Wh0.0%High-stakes or unsafe tasks are not scored as simple model/search substitutions.

Energy Multipliers

Use one additive task-energy model before comparing alternatives

Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each actual task is compared with the lowest sufficient execution bucket: local software, direct search, search plus reading, small model, standard LLM, reasoning/agent, or expert.

Where 28.8% comes from 10,195.8 Wh

(35,453.0Wh GPT-5.5 replay - 25,257.2Wh task-matched execution) / 35,453.0Wh.

GPT-5.5 central CO2 4.017 kg

Uses EPA 0.394 kgCO2/kWh average electricity factor.

GPT-5.5 heavy saved 16,688.1 Wh

Heavy GPT-5.5 active-compute sensitivity.

GPT-5.5 heavy CO2 6.575 kg

Same task decomposition, heavier GPT-5.5 path.

MECE execution bucket Share GPT-5.5 central avg Route avg Multiplier Central saved
Local tool / software46 / 10,000, 0.5%2.07 Wh~0 cloud Whlocal / 095.3 Wh
Direct search / lookup437 / 10,000, 4.4%1.87 Wh0.410 Wh4.6x637.9 Wh
Search + reading157 / 10,000, 1.6%2.55 Wh0.892 Wh2.9x259.9 Wh
Small model sufficient807 / 10,000, 8.1%2.28 Wh0.016 Wh144.1x1,830.0 Wh
Standard LLM sufficient8,227 / 10,000, 82.3%3.62 Wh2.779 Wh1.3x6,928.6 Wh
Reasoning / agent required306 / 10,000, 3.1%7.97 Wh6.520 Wh1.2x444.2 Wh
Expert / not comparable20 / 10,000, 0.2%3.24 Wh3.236 Wh1.0x0.0 Wh

Interpretation

Search is one route, not the whole story.

Averages differ because the conversations differ in length, turn count, and model route. Under GPT-5.5 central, direct-search cases average 1.87Wh on the GPT-5.5 path and 0.410Wh on the search path. Small-model cases are the largest gap: 2.28Wh on GPT-5.5 versus 0.016Wh on the assigned small-model execution path.

Scale-Up

Small savings per chat become TWh-scale at global AI volume

The pilot saves 1.02 to 1.67 Wh per row under GPT-5.5 central and heavy assumptions. Scaling that routing intensity shows what the opportunity looks like at platform and global volume.

Human population 8.3B

Approximate 2026 world population.

ChatGPT weekly users 800M+

OpenAI public usage anchor.

Saved per pilot row 1.02-1.67 Wh

GPT-5.5 central to GPT-5.5 heavy range.

Conversion 0.394 kg/kWh

EPA U.S. average electricity factor.

Scale scenario Conversations/year Energy saved CO2 saved Forest equivalent
800M weekly users × 5 chats/week208B0.21-0.35 TWh/year0.08-0.14 MtCO2/year84k-137k acres of U.S. forest/year
1B users × 5 chats/day1.83T1.86-3.05 TWh/year0.73-1.20 MtCO2/year733k-1.20M acres of U.S. forest/year
10B AI chats/day globally3.65T3.72-6.09 TWh/year1.47-2.40 MtCO2/year1.47M-2.40M acres of U.S. forest/year

Research Rule

Identify the task first; decompose it second; replay GPT-5-class energy third.

Human time is not converted into carbon. This page estimates cloud energy from the task decomposition; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.

Method

The task-energy model is additive

GPT-5.5 central uses a 0.85 Wh base response-equivalent and a 6.5 Wh reasoning total. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.

Base visible inference Ebase × max(responses, tokens / reference)
Reasoning add-on max(0, Ereason total - Ebase) × responses
Search add-on 0.30 Wh × search calls
Cloud carbon CO2 = E / 1000 × 0.394 kg/kWh
Status quo cloud carbon 13.968 kgCO2

35,453.0Wh across 10,000 pilot rows.

Routed cloud carbon 9.951 kgCO2

25,257.2Wh after matching each task to its lowest sufficient execution bucket.

Cloud carbon saved 4.017 kgCO2

Pilot-scale value; platform-scale value is shown in the scale-up section.

Appendix

Data coverage and GPT-5.5 coefficient derivation

The main story uses a GPT-5.5 replay. The details below show where the pilot conversations came from and how the GPT-5.5 energy assumptions are anchored.

GPT-4o anchor 0.31-0.34 Wh

Public standard-query anchor from Epoch/OpenAI-era estimates.

GPT-5.5 central base 0.85 Wh

0.34Wh × 2.5 active-compute multiplier.

GPT-5.5 reasoning total 6.5 Wh

Central estimate when GPT-5.5 reasoning is invoked.

GPT-5.5 heavy path 1.5Wh / 10Wh

Heavy standard base / reasoning total for Pro-style paths.

Parameter Proxy

Define GPT-5.5 by active compute.

GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops.

GPT-5.5 assumption Active compute multiplier Standard Wh/request Reasoning / agent route Interpretation
GPT-5.5 central2.5x GPT-4o anchor0.85 Wh6.5 Wh reasoning totalMain calculation used in the page headline
GPT-5.5 heavy4.4x GPT-4o anchor1.5 Wh10 Wh reasoning totalUpper sensitivity for heavy routes
Long-context / agentic pathadd by actual tokens/tools2.5-40+ Whcontext, retrieval, and tool loops dominateUse for document-scale and agentic workflows
Pilot model family Count Years observed Role in this page Treatment
GPT-3.53,3262023-2024Observed demandReplay through GPT-5.5-era router
GPT-4 early / turbo / preview1,1742023-2024Observed demandReplay with GPT-5.5 central/heavy
GPT-4o3,1072024Energy anchor0.31-0.34 Wh optimized proxy
GPT-4o mini7182024-2025Small-model route evidenceLower-compute alternative
o1 reasoning1,2782024Reasoning route evidenceBase plus reasoning add-on
GPT-4.1 / GPT-4.1 mini3972025Observed demandReplay with GPT-5.5 central/heavy

Sources

Sources and anchors for the calculation