AI Sustainability Research

AI Over-Compute

Pilot design before the headline

We start with 300 real LLM conversations.

The pilot draws 300 conversations from WildChat-4.8M, a public dataset of real user-chatbot interactions. After duplicate classified rows are removed, this page reports 295 unique conversations. Each conversation is classified into the task it performs and the lowest-compute route that could plausibly satisfy it.

Source WildChat Real user-chatbot conversations from WildChat-4.8M public
Pilot draw 300 Conversations sampled from 2023-04-11 to 2025-07-08
Reported sample 295 Unique conversations after deduplicating classified rows
Classification 3-way Gemini, Claude, and Codex votes merged by majority rule

Data And Labels

Each chat is converted into a routing decision

For each conversation, the classifiers label the user's intent, observed route, lowest sufficient route, feasibility under a five-minute human-time constraint, and energy inputs such as visible tokens, model responses, reasoning, search, and tool calls.

1 Raw conversation

WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.

2 Intent label

Search, writing, coding, calculation, local software, reasoning, tool workflow, or not comparable.

3 Lowest sufficient route

Direct search, local tool, small model, standard model, reasoning model, expert, or tool agent.

4 Cost replay

Replay the same task under GPT-5.5 central/heavy and switch only if extra human effort is ≤ 5 min.

Core Message

A third of GPT-5.5 chat energy is a routing choice.

In this pilot, 197 of 295 unique conversations have a lower-compute route that stays within the five-minute constraint. Replaying the tasks through GPT-5.5 central cuts cloud energy from 1188Wh to 788Wh, a 33.7% reduction before any change to model architecture.

Energy Multipliers

Use one additive route model before comparing alternatives

Every conversation is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each task is compared with the lowest-compute route that still keeps extra human effort under five minutes.

GPT-5.5 central saved 400.5 Wh

1188.1Wh status quo minus 787.6Wh under the five-minute policy.

GPT-5.5 central CO2 0.158 kg

Uses EPA 0.394 kgCO2/kWh average electricity factor.

GPT-5.5 heavy saved 680.7 Wh

Heavy GPT-5.5 active-compute sensitivity.

GPT-5.5 heavy CO2 0.268 kg

Same routing labels and time constraint, heavier GPT-5.5 path.

Switch type Feasible GPT-5.5 central Alt route GPT-5.5 heavy Alt route
Standard LLM → small model78 / 892.94 Wh = base0.075 Wh, 39.2x5.18 Wh = base0.075 Wh, 69.1x
Reasoning → cheaper route23 / 317.54 Wh = 1.89 base + 5.65 reasoning1.634 Wh, 4.6x11.84 Wh = 3.34 base + 8.50 reasoning2.882 Wh, 4.1x
LLM → local tool/software5 / 72.27 Wh = base~0 cloud Wh4.01 Wh = base~0 cloud Wh
LLM → direct search/lookup14 / 231.00 Wh = base0.30 Wh, 3.3x1.77 Wh = base0.30 Wh, 5.9x
LLM → search plus reading0 / 10not switchednot switchednot switchednot switched

Interpretation

Search is one route, not the whole story.

Averages differ because the conversations differ in length and number of model responses. Under GPT-5.5 central, direct-search cases average 1.00Wh, local-tool cases average 2.27Wh, and reasoning-downshift cases average 7.54Wh. Search adds another 0.30Wh per query when retrieval is actually used.

Scale-Up

Small savings per chat become TWh-scale at global AI volume

The pilot saves 1.36 to 2.31 Wh per routed conversation under GPT-5.5 central and heavy assumptions. Scaling that routing intensity shows what the opportunity looks like at platform and global volume.

Human population 8.3B

Approximate 2026 world population.

ChatGPT weekly users 800M+

OpenAI public usage anchor.

Saved per routed conversation 1.36-2.31 Wh

GPT-5.5 central to GPT-5.5 heavy range.

Conversion 0.394 kg/kWh

EPA U.S. average electricity factor.

Scale scenario Conversations/year Energy saved CO2 saved Forest equivalent
800M weekly users × 5 chats/week208B0.28-0.48 TWh/year0.11-0.19 MtCO2/year110k-190k acres of U.S. forest/year
1B users × 5 chats/day1.83T2.48-4.21 TWh/year0.98-1.66 MtCO2/year0.98M-1.66M acres of U.S. forest/year
10B AI chats/day globally3.65T4.96-8.42 TWh/year1.95-3.32 MtCO2/year1.95M-3.32M acres of U.S. forest/year

Policy Rule

Minimize cloud energy subject to human time ≤ 5 minutes.

Human time is the constraint, not a carbon term. A lower-compute route is used only when it keeps the user's estimated extra effort within the five-minute budget.

Method

The route model is additive

GPT-5.5 central uses a 0.85 Wh base response-equivalent and a 6.5 Wh reasoning total. Search adds 0.30 Wh per query. Carbon is electricity multiplied by the EPA U.S. average grid factor.

Base visible inference Ebase × max(responses, tokens / reference)
Reasoning add-on max(0, Ereason total - Ebase) × responses
Search add-on 0.30 Wh × search calls
Cloud carbon CO2 = E / 1000 × 0.394 kg/kWh
Status quo cloud carbon 0.468 kgCO2

1188.1Wh across 295 unique pilot conversations.

Policy cloud carbon 0.310 kgCO2

787.6Wh after feasible switching under τ=5.

Cloud carbon saved 0.158 kgCO2

Pilot-scale value; platform-scale value is shown in the scale-up section.

Appendix

Data coverage and GPT-5.5 coefficient derivation

The main story uses a GPT-5.5 replay. The details below show where the pilot conversations came from and how the GPT-5.5 energy assumptions are anchored.

GPT-4o anchor 0.31-0.34 Wh

Public standard-query anchor from Epoch/OpenAI-era estimates.

GPT-5.5 central base 0.85 Wh

0.34Wh × 2.5 active-compute multiplier.

GPT-5.5 reasoning total 6.5 Wh

Central estimate when GPT-5.5 reasoning is invoked.

GPT-5.5 heavy path 1.5Wh / 10Wh

Heavy standard base / reasoning total for Pro-style paths.

Parameter Proxy

Define GPT-5.5 by active compute.

GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops.

GPT-5.5 assumption Active compute multiplier Standard Wh/request Reasoning / agent route Interpretation
GPT-5.5 central2.5x GPT-4o anchor0.85 Wh6.5 Wh reasoning totalMain calculation used in the page headline
GPT-5.5 heavy4.4x GPT-4o anchor1.5 Wh10 Wh reasoning totalUpper sensitivity for heavy routes
Long-context / agentic pathadd by actual tokens/tools2.5-40+ Whcontext, retrieval, and tool loops dominateUse for document-scale and agentic workflows
Pilot model family Count Years observed Role in this page Treatment
GPT-3.5792023-2024Observed demandReplay through GPT-5.5-era router
GPT-4 early / turbo / preview302023-2024Observed demandReplay with GPT-5.5 central/heavy
GPT-4o692024Energy anchor0.31-0.34 Wh optimized proxy
GPT-4o mini442024-2025Small-model route evidenceLower-compute alternative
o1 reasoning392024Reasoning route evidenceBase plus reasoning add-on
GPT-4.1 / GPT-4.1 mini392025Observed demandReplay with GPT-5.5 central/heavy

Sources

Sources and anchors for the calculation