AI Sustainability Research

AI Over-Compute

Pilot design before the headline

We start with 300 labeled chat episodes.

The pilot draws 300 chat episodes from WildChat-4.8M, a public dataset of real user-chatbot interactions. The unit of analysis is the 300-row pilot draw. Each row is counted once, classified into the task it performs, and assigned exactly one lowest-compute route that could satisfy the request.

Source WildChat Real user-chatbot conversations from WildChat-4.8M public
Pilot rows 300 Chat episodes sampled from 2023-04-11 to 2025-07-08
MECE route buckets 7 Every row belongs to one and only one route bucket
Classification 3-way Gemini, Claude, and Codex labels merged into one route decision

Data And Labels

Each chat is converted into a routing decision

For each row, the classifiers label the user's intent, observed route, lowest sufficient route, and energy inputs such as visible tokens, model responses, reasoning, search, and tool calls. The route labels below are mutually exclusive and collectively exhaustive across the 300 rows.

1 Raw conversation

WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.

2 Task label

Factual lookup, writing, coding, calculation, local software, reasoning, tool workflow, or not comparable.

3 MECE route

Local tool, direct search, search plus reading, small model, standard LLM, reasoning/agent, or expert.

4 Cost replay

Replay the same task under GPT-5.5 central/heavy, then compare it with the assigned route directly.

MECE route bucket Count Share Underlying labels
Standard LLM sufficient15050.0%standard_llm
Small model sufficient9732.3%small_model_cloud 70, small_model_local 27
Direct search / lookup237.7%web_search_simple 14, reference_lookup 9
Search + reading103.3%web_search_research
Local tool / software82.7%local_deterministic_tool 5, native_software_workflow 3
Reasoning / agent required82.7%reasoning_llm 7, llm_with_tools 1
Expert / not comparable41.3%human_expert 1, not_comparable 3

Core Message

About half of GPT-5.5 chat energy is a routing choice.

In this 300-row pilot, the MECE route assignment cuts GPT-5.5 central cloud energy from 1192.4Wh to 613.0Wh, a 48.6% reduction before any change to model architecture. The largest opportunity is not Google Search alone; it is routing small, simple, and software-native tasks away from high-compute LLM paths.

Energy Multipliers

Use one additive route model before comparing alternatives

Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each task is compared with its assigned MECE route: local software, direct search, search plus reading, small model, standard LLM, reasoning/agent, or expert.

GPT-5.5 central saved 579.3 Wh

1192.4Wh GPT-5.5 replay minus 613.0Wh under direct MECE routing.

GPT-5.5 central CO2 0.228 kg

Uses EPA 0.394 kgCO2/kWh average electricity factor.

GPT-5.5 heavy saved 1000.1 Wh

Heavy GPT-5.5 active-compute sensitivity.

GPT-5.5 heavy CO2 0.394 kg

Same route labels, heavier GPT-5.5 path.

MECE route bucket Share GPT-5.5 central avg Route avg Multiplier Central saved
Local tool / software8 / 300, 2.7%6.51 Wh~0 cloud Whlocal / 052.1 Wh
Direct search / lookup23 / 300, 7.7%1.57 Wh0.46 Wh3.4x25.5 Wh
Search + reading10 / 300, 3.3%2.36 Wh1.14 Wh2.1x12.2 Wh
Small model sufficient97 / 300, 32.3%3.46 Wh0.073 Wh47.5x328.6 Wh
Standard LLM sufficient150 / 300, 50.0%4.49 Wh3.47 Wh1.3x151.7 Wh
Reasoning / agent required8 / 300, 2.7%7.27 Wh6.12 Wh1.2x9.2 Wh
Expert / not comparable4 / 300, 1.3%3.47 Wh3.47 Wh1.0x0.0 Wh

Interpretation

Search is one route, not the whole story.

Averages differ because the conversations differ in length, turn count, and model route. Under GPT-5.5 central, direct-search cases average 1.57Wh on the LLM path and 0.46Wh on the search path. Small-model cases are the largest gap: 3.46Wh on GPT-5.5 versus 0.073Wh on the assigned small-model route.

Scale-Up

Small savings per chat become TWh-scale at global AI volume

The pilot saves 1.93 to 3.33 Wh per row under GPT-5.5 central and heavy assumptions. Scaling that routing intensity shows what the opportunity looks like at platform and global volume.

Human population 8.3B

Approximate 2026 world population.

ChatGPT weekly users 800M+

OpenAI public usage anchor.

Saved per pilot row 1.93-3.33 Wh

GPT-5.5 central to GPT-5.5 heavy range.

Conversion 0.394 kg/kWh

EPA U.S. average electricity factor.

Scale scenario Conversations/year Energy saved CO2 saved Forest equivalent
800M weekly users × 5 chats/week208B0.40-0.69 TWh/year0.16-0.27 MtCO2/year158k-273k acres of U.S. forest/year
1B users × 5 chats/day1.83T3.52-6.08 TWh/year1.39-2.40 MtCO2/year1.39M-2.40M acres of U.S. forest/year
10B AI chats/day globally3.65T7.05-12.17 TWh/year2.78-4.79 MtCO2/year2.78M-4.79M acres of U.S. forest/year

Research Rule

Classify the task first; compare energy second; keep human time as a separate axis.

Human time is not converted into carbon. This page computes cloud energy by direct MECE route assignment; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.

Method

The route model is additive

GPT-5.5 central uses a 0.85 Wh base response-equivalent and a 6.5 Wh reasoning total. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.

Base visible inference Ebase × max(responses, tokens / reference)
Reasoning add-on max(0, Ereason total - Ebase) × responses
Search add-on 0.30 Wh × search calls
Cloud carbon CO2 = E / 1000 × 0.394 kg/kWh
Status quo cloud carbon 0.470 kgCO2

1192.4Wh across 300 pilot conversations.

Routed cloud carbon 0.242 kgCO2

613.0Wh after assigning each row to its MECE route.

Cloud carbon saved 0.228 kgCO2

Pilot-scale value; platform-scale value is shown in the scale-up section.

Appendix

Data coverage and GPT-5.5 coefficient derivation

The main story uses a GPT-5.5 replay. The details below show where the pilot conversations came from and how the GPT-5.5 energy assumptions are anchored.

GPT-4o anchor 0.31-0.34 Wh

Public standard-query anchor from Epoch/OpenAI-era estimates.

GPT-5.5 central base 0.85 Wh

0.34Wh × 2.5 active-compute multiplier.

GPT-5.5 reasoning total 6.5 Wh

Central estimate when GPT-5.5 reasoning is invoked.

GPT-5.5 heavy path 1.5Wh / 10Wh

Heavy standard base / reasoning total for Pro-style paths.

Parameter Proxy

Define GPT-5.5 by active compute.

GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops.

GPT-5.5 assumption Active compute multiplier Standard Wh/request Reasoning / agent route Interpretation
GPT-5.5 central2.5x GPT-4o anchor0.85 Wh6.5 Wh reasoning totalMain calculation used in the page headline
GPT-5.5 heavy4.4x GPT-4o anchor1.5 Wh10 Wh reasoning totalUpper sensitivity for heavy routes
Long-context / agentic pathadd by actual tokens/tools2.5-40+ Whcontext, retrieval, and tool loops dominateUse for document-scale and agentic workflows
Pilot model family Count Years observed Role in this page Treatment
GPT-3.5792023-2024Observed demandReplay through GPT-5.5-era router
GPT-4 early / turbo / preview302023-2024Observed demandReplay with GPT-5.5 central/heavy
GPT-4o692024Energy anchor0.31-0.34 Wh optimized proxy
GPT-4o mini442024-2025Small-model route evidenceLower-compute alternative
o1 reasoning392024Reasoning route evidenceBase plus reasoning add-on
GPT-4.1 / GPT-4.1 mini392025Observed demandReplay with GPT-5.5 central/heavy

Sources

Sources and anchors for the calculation