AI Sustainability Research

AI Over-Compute

Pilot design before the headline

We start with 3,000 labeled chat episodes.

The pilot draws 3,000 chat episodes from WildChat-4.8M, a public dataset of real user-chatbot interactions. The unit of analysis is the 3,000-row pilot draw. Each row is counted once, classified into the task it performs, and assigned exactly one lowest-compute route that could satisfy the request.

Source WildChat Real user-chatbot conversations from WildChat-4.8M public
Pilot rows 3,000 Chat episodes sampled from 2023-04-09 to 2025-07-31
MECE route buckets 7 Every row belongs to one and only one route bucket
Classification Gemini Single-model labels with strict 25/25 batch validation

Data And Labels

Each chat is converted into a routing decision

For each row, Gemini labels the user's intent, observed route, lowest sufficient route, and energy inputs such as visible tokens, model responses, reasoning, search, and tool calls. The route labels below are mutually exclusive and collectively exhaustive across the 3,000 rows.

1 Raw conversation

WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.

2 Task label

Factual lookup, writing, coding, calculation, local software, reasoning, tool workflow, or not comparable.

3 MECE route

Local tool, direct search, search plus reading, small model, standard LLM, reasoning/agent, or expert.

4 Cost replay

Replay the same task under GPT-5.5 central/heavy, then compare it with the assigned route directly.

MECE route bucket Count Share Underlying labels
Standard LLM sufficient1,35345.1%standard_llm
Small model sufficient1,29443.1%small_model_cloud 1,083, small_model_local 211
Direct search / lookup1916.4%web_search_simple 110, reference_lookup 81
Search + reading361.2%web_search_research
Local tool / software361.2%local_deterministic_tool 24, native_software_workflow 12
Reasoning / agent required702.3%reasoning_llm 67, llm_with_tools 3
Expert / not comparable200.7%human_expert 2, not_comparable 18

Core Message

About half of GPT-5.5 chat energy is a routing choice.

In this 3,000-row pilot, the MECE route assignment cuts GPT-5.5 central cloud energy from 11,410.3Wh to 5,668.3Wh, a 50.3% reduction before any change to model architecture. The largest opportunity is not Google Search alone; it is routing small, simple, and software-native tasks away from high-compute LLM paths.

Energy Multipliers

Use one additive route model before comparing alternatives

Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each task is compared with its assigned MECE route: local software, direct search, search plus reading, small model, standard LLM, reasoning/agent, or expert.

GPT-5.5 central saved 5,742 Wh

11,410.3Wh GPT-5.5 replay minus 5,668.3Wh under direct MECE routing.

GPT-5.5 central CO2 2.262 kg

Uses EPA 0.394 kgCO2/kWh average electricity factor.

GPT-5.5 heavy saved 9,853.5 Wh

Heavy GPT-5.5 active-compute sensitivity.

GPT-5.5 heavy CO2 3.882 kg

Same route labels, heavier GPT-5.5 path.

MECE route bucket Share GPT-5.5 central avg Route avg Multiplier Central saved
Local tool / software36 / 3,000, 1.2%2.24 Wh~0 cloud Whlocal / 080.6 Wh
Direct search / lookup191 / 3,000, 6.4%1.74 Wh0.405 Wh4.3x255.7 Wh
Search + reading36 / 3,000, 1.2%2.78 Wh0.983 Wh2.8x64.5 Wh
Small model sufficient1,294 / 3,000, 43.1%3.11 Wh0.076 Wh41.0x3,924.6 Wh
Standard LLM sufficient1,353 / 3,000, 45.1%4.62 Wh3.663 Wh1.3x1,291.6 Wh
Reasoning / agent required70 / 3,000, 2.3%8.10 Wh6.314 Wh1.3x125.0 Wh
Expert / not comparable20 / 3,000, 0.7%2.98 Wh2.981 Wh1.0x0.0 Wh

Interpretation

Search is one route, not the whole story.

Averages differ because the conversations differ in length, turn count, and model route. Under GPT-5.5 central, direct-search cases average 1.74Wh on the LLM path and 0.405Wh on the search path. Small-model cases are the largest gap: 3.11Wh on GPT-5.5 versus 0.076Wh on the assigned small-model route.

Scale-Up

Small savings per chat become TWh-scale at global AI volume

The pilot saves 1.91 to 3.28 Wh per row under GPT-5.5 central and heavy assumptions. Scaling that routing intensity shows what the opportunity looks like at platform and global volume.

Human population 8.3B

Approximate 2026 world population.

ChatGPT weekly users 800M+

OpenAI public usage anchor.

Saved per pilot row 1.91-3.28 Wh

GPT-5.5 central to GPT-5.5 heavy range.

Conversion 0.394 kg/kWh

EPA U.S. average electricity factor.

Scale scenario Conversations/year Energy saved CO2 saved Forest equivalent
800M weekly users × 5 chats/week208B0.40-0.68 TWh/year0.16-0.27 MtCO2/year157k-269k acres of U.S. forest/year
1B users × 5 chats/day1.83T3.49-5.99 TWh/year1.38-2.36 MtCO2/year1.38M-2.36M acres of U.S. forest/year
10B AI chats/day globally3.65T6.99-11.99 TWh/year2.75-4.72 MtCO2/year2.75M-4.72M acres of U.S. forest/year

Research Rule

Classify the task first; compare energy second; keep human time as a separate axis.

Human time is not converted into carbon. This page computes cloud energy by direct MECE route assignment; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.

Method

The route model is additive

GPT-5.5 central uses a 0.85 Wh base response-equivalent and a 6.5 Wh reasoning total. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.

Base visible inference Ebase × max(responses, tokens / reference)
Reasoning add-on max(0, Ereason total - Ebase) × responses
Search add-on 0.30 Wh × search calls
Cloud carbon CO2 = E / 1000 × 0.394 kg/kWh
Status quo cloud carbon 4.496 kgCO2

11,410.3Wh across 3,000 pilot rows.

Routed cloud carbon 2.233 kgCO2

5,668.3Wh after assigning each row to its MECE route.

Cloud carbon saved 2.262 kgCO2

Pilot-scale value; platform-scale value is shown in the scale-up section.

Appendix

Data coverage and GPT-5.5 coefficient derivation

The main story uses a GPT-5.5 replay. The details below show where the pilot conversations came from and how the GPT-5.5 energy assumptions are anchored.

GPT-4o anchor 0.31-0.34 Wh

Public standard-query anchor from Epoch/OpenAI-era estimates.

GPT-5.5 central base 0.85 Wh

0.34Wh × 2.5 active-compute multiplier.

GPT-5.5 reasoning total 6.5 Wh

Central estimate when GPT-5.5 reasoning is invoked.

GPT-5.5 heavy path 1.5Wh / 10Wh

Heavy standard base / reasoning total for Pro-style paths.

Parameter Proxy

Define GPT-5.5 by active compute.

GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops.

GPT-5.5 assumption Active compute multiplier Standard Wh/request Reasoning / agent route Interpretation
GPT-5.5 central2.5x GPT-4o anchor0.85 Wh6.5 Wh reasoning totalMain calculation used in the page headline
GPT-5.5 heavy4.4x GPT-4o anchor1.5 Wh10 Wh reasoning totalUpper sensitivity for heavy routes
Long-context / agentic pathadd by actual tokens/tools2.5-40+ Whcontext, retrieval, and tool loops dominateUse for document-scale and agentic workflows
Pilot model family Count Years observed Role in this page Treatment
GPT-3.57662023-2024Observed demandReplay through GPT-5.5-era router
GPT-4 early / turbo / preview2612023-2024Observed demandReplay with GPT-5.5 central/heavy
GPT-4o7552024Energy anchor0.31-0.34 Wh optimized proxy
GPT-4o mini4512024-2025Small-model route evidenceLower-compute alternative
o1 reasoning3702024Reasoning route evidenceBase plus reasoning add-on
GPT-4.1 / GPT-4.1 mini3972025Observed demandReplay with GPT-5.5 central/heavy

Sources

Sources and anchors for the calculation