AI Sustainability Research

AI Over-Compute

Pilot design before the headline

We start with 10,000 labeled chat episodes.

The pilot draws 10,000 chat episodes from WildChat-4.8M, a public dataset of real user-chatbot interactions. The unit of analysis is the conversation episode. Each row is counted once, first interpreted as an actual user task, then decomposed into the capabilities needed to answer it.

Source WildChat Real user-chatbot conversations from WildChat-4.8M public
Pilot rows 10,000 Chat episodes sampled from 2023-04-09 to 2025-07-31
Execution tiers 9 T0 no-LLM paths, T1 small LLM, T2-T5 frontier paths, and T6 exclusions
Classification Gemini Gemini-only labels with strict exact-ID batch validation

Data And Labels

Each chat is converted into a task decomposition

For each row, Gemini first identifies the actual task the user is trying to complete. It then decomposes the task into needed capabilities: deterministic tools, search/lookup, small-model language work, standard frontier generation, long context, reasoning/test-time compute, tool agents, or expert judgment. The final tier is a consequence of that decomposition, not the starting assumption.

1 Raw conversation

WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.

2 Actual task

What the user is really trying to accomplish, not just the words in the prompt.

3 Task decomposition

Break the task into lookup, reading, generation, calculation, software, reasoning, tool, and expert components.

4 Energy replay

Compare standard frontier replay and reasoning replay against the minimum sufficient execution tier.

Execution tier Count Share Meaning Example
T0a No LLM: local tool/software1131.1%Calculator, spreadsheet, local script, or software workflow
T0b No LLM: search/lookup5165.2%Direct reference lookup or search result is enough
T0c No LLM: specialized tool/API150.2%Non-LLM API, app, database, or specialized calculator
T1 Small LLM3,53735.4%Gemma-class simple translation, rewrite, extraction, short generation
T2 Frontier standard4,79047.9%Normal frontier writing, explanation, coding, and synthesis
T3 Frontier + long context1091.1%Frontier model mainly because context length or fidelity is the bottleneck
T4 Frontier + reasoning5285.3%Frontier model plus reasoning/test-time compute
T5 LLM agent + tools1511.5%LLM must orchestrate search, tools, APIs, or actions
T6 Expert / not comparable2412.4%High-stakes, unsafe, unclear, or excluded from simple substitution

Core Message

The headline depends on which frontier path people overuse.

The same 10,000 tasks save 10.0% if the baseline is standard frontier replay for every task. They save 76.0% if the baseline is reasoning-frontier replay for every task. That distinction is the research point: routing ordinary tasks away from reasoning/test-time compute matters far more than routing every task away from frontier models. We did not re-query GPT-4 or GPT-5.5; this is an energy replay over observed tasks and published coefficient assumptions.

Counterfactual Accounting

Decompose first, then compare the same tasks two ways

Decomposition is necessary because a chat is not a single generic "AI request." One conversation may be simple lookup, another may be writing, another may be calculation, and another may need reasoning or tools. The accounting compares the same 10,000 actual tasks under two execution plans.

1. Standard frontier replay 28,259.3 Wh

Counterfactual baseline: every task is answered through a standard frontier path.

2. Reasoning frontier replay 115,000.2 Wh

Stress baseline: every task is pushed through reasoning/test-time compute.

3. Task-matched execution 25,420.7 Wh

Minimum sufficient tiers: no-LLM, small LLM, standard frontier, context, reasoning, agent, or exclusion.

Saved vs standard frontier 10.0%

2,838.6Wh saved; T4/T5 quality upgrades offset part of the T0/T1 savings.

Saved vs reasoning frontier 76.0%

87,424.7Wh saved when ordinary tasks avoid reasoning/test-time compute.

Small-model opportunity 35.4%

3,537 / 10,000 tasks need language capability, but not frontier execution.

Why This Equation

The numerator is avoidable cloud execution, not total AI energy.

The equation isolates the cloud inference energy that changes when the execution plan changes. It does not convert human time into carbon, and it does not count model training. It asks two narrow questions: what if every task uses standard frontier execution, and what if every task uses reasoning/test-time compute?

Replay Definition

This is not asking GPT-4 to answer again.

"Replay" means applying the same accounting model to the same observed WildChat tasks under two execution plans. The baselines replay standard frontier and reasoning-frontier execution. The alternative uses the task decomposition to choose the minimum sufficient tier. No GPT-4, GPT-4o, GPT-4.5, GPT-5.5, or Pro model was re-run on the 10,000 conversations for this result.

Task tier Reasoning replay Task-matched Saved vs reasoning Share of reasoning savings Why energy changes
T0a No LLM: local tool/software915.7 Wh0.0 Wh915.7 Wh1.0%Deterministic local execution avoids cloud inference.
T0b No LLM: search/lookup4,949.8 Wh154.8 Wh4,794.9 Wh5.5%Known facts and references are lower-compute lookup tasks.
T0c No LLM: specialized tool/API180.5 Wh0.8 Wh179.7 Wh0.2%Domain tools replace general model generation.
T1 Small LLM32,750.7 Wh206.2 Wh32,544.5 Wh37.2%Simple language work moves to Gemma-class small models.
T2 Frontier standard60,985.4 Wh15,470.7 Wh45,514.7 Wh52.1%These tasks need frontier quality, but not reasoning mode.
T3 Frontier + long context2,273.4 Wh1,079.5 Wh1,193.9 Wh1.4%Context/fidelity matters; reasoning is not the main bottleneck.
T4 Frontier + reasoning6,924.7 Wh6,924.7 Wh0.0 Wh0.0%Reasoning/test-time compute is justified here.
T5 LLM agent + tools3,404.6 Wh1,123.4 Wh2,281.2 Wh2.6%Tool orchestration is needed, but not every step is reasoning.
T6 Expert / not comparable2,615.4 Wh2,615.4 Wh0.0 Wh0.0%Excluded from simple cloud-substitution savings.

Energy Multipliers

Use one additive task-energy model before comparing alternatives

Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each actual task is compared with the lowest sufficient execution tier: no-LLM tool/search/API, small LLM, standard frontier, long-context frontier, reasoning frontier, LLM agent with tools, or expert/not comparable.

Saved vs standard frontier 2,838.6 Wh

(28,259.3Wh standard replay - 25,420.7Wh task-matched execution) / 28,259.3Wh = 10.0%.

Saved vs reasoning frontier 87,424.7 Wh

(115,000.2Wh reasoning replay - 27,575.5Wh task-matched execution) / 115,000.2Wh = 76.0%.

Reasoning replay CO2 cut 34.445 kg

Uses EPA 0.394 kgCO2/kWh average electricity factor.

Heavy reasoning sensitivity 132,009.4 Wh

Heavy frontier reasoning replay minus task-matched execution.

Execution tier Share Reasoning avg Matched avg Multiplier Saved vs reasoning
T0a No LLM: local tool/software113 / 10,000, 1.1%8.104 Wh~0 cloud Whlocal / 0915.7 Wh
T0b No LLM: search/lookup516 / 10,000, 5.2%9.593 Wh0.300 Wh32.0x4,794.9 Wh
T1 Small LLM3,537 / 10,000, 35.4%9.259 Wh0.058 Wh159.0x32,544.5 Wh
T2 Frontier standard4,790 / 10,000, 47.9%12.732 Wh3.230 Wh3.9x45,514.7 Wh
T3 Frontier + long context109 / 10,000, 1.1%20.857 Wh9.904 Wh2.1x1,193.9 Wh
T4 Frontier + reasoning528 / 10,000, 5.3%13.115 Wh13.115 Wh1.0x0.0 Wh
T5 LLM agent + tools151 / 10,000, 1.5%22.547 Wh7.440 Wh3.0x2,281.2 Wh
T6 Expert / not comparable241 / 10,000, 2.4%10.852 Wh10.852 Whexcluded0.0 Wh

Interpretation

Search is one route, not the whole story.

The biggest sustainability mistake is not using a frontier model for every task; it is using reasoning compute for ordinary tasks. T4 keeps reasoning where it is justified. T1 shows the largest model-rightsizing opportunity: 35.4% of chats need language capability, but only a small model.

Scale-Up

Small savings per chat become TWh-scale at global AI volume

If the avoidable behavior is reasoning-frontier overuse, task matching saves 8.74 to 13.20 Wh per row under the central and heavy sensitivity cases. Scaling that routing intensity shows the platform-level stakes.

Human population 8.3B

Approximate 2026 world population.

ChatGPT weekly users 800M+

OpenAI public usage anchor.

Saved per pilot row 8.74-13.20 Wh

Reasoning-frontier replay to task-matched execution.

Conversion 0.394 kg/kWh

EPA U.S. average electricity factor.

Scale scenario Conversations/year Energy saved CO2 saved Forest equivalent
800M weekly users × 5 chats/week208B1.82-2.75 TWh/year0.72-1.08 MtCO2/year0.72M-1.08M acres of U.S. forest/year
1B users × 5 chats/day1.83T15.95-24.09 TWh/year6.28-9.49 MtCO2/year6.28M-9.49M acres of U.S. forest/year
10B AI chats/day globally3.65T31.89-48.18 TWh/year12.56-18.98 MtCO2/year12.56M-18.98M acres of U.S. forest/year

Research Rule

Identify the task first; decompose it second; decide whether reasoning is justified third.

Human time is not converted into carbon. This page estimates cloud energy from the task decomposition; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.

Method

The task-energy model is additive

The central scenario uses a 0.85 Wh frontier base work unit and a 6.5 Wh reasoning response. T1 uses 0.04 Wh per 1,000 visible tokens for Gemma-class small models. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.

Base visible inference Ebase × max(responses, tokens / reference)
Reasoning add-on max(0, Ereason total - Ebase) × responses
Search add-on 0.30 Wh × search calls
Cloud carbon CO2 = E / 1000 × 0.394 kg/kWh
Reasoning replay carbon 45.310 kgCO2

115,000.2Wh across 10,000 pilot rows.

Task-matched carbon 10.867 kgCO2

27,575.5Wh after matching each task to its minimum sufficient execution tier.

Cloud carbon saved 34.445 kgCO2

Pilot-scale value; platform-scale value is shown in the scale-up section.

Appendix

Data coverage and frontier coefficient derivation

The main story uses frontier standard and frontier reasoning replay. The details below show where the pilot conversations came from and how the energy assumptions are anchored.

GPT-4o anchor 0.31-0.34 Wh

Public standard-query anchor from Epoch/OpenAI-era estimates.

Frontier central base 0.85 Wh

0.34Wh × 2.5 active-compute multiplier.

Frontier reasoning total 6.5 Wh

Central estimate when reasoning/test-time compute is invoked.

Heavy frontier path 1.5Wh / 10Wh

Heavy standard base / reasoning total for Pro-style paths.

Replay Assumption

Estimate active compute; do not pretend we have vendor telemetry.

GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops. T1 uses Gemma-class small-model inference anchored to Gemma 4 E2B/E4B effective-parameter models. This is a calibrated counterfactual, not a measurement of any vendor's internal serving stack.

Question Answer Treatment in the page Why acceptable Residual risk
Did we re-ask GPT-4?NoCounterfactual energy replayObserved tasks and token structure are fixedNot a direct output-quality experiment
Is GPT-4 the same as Pro?NoSeparate central and heavy scenariosOpenAI distinguishes standard, Thinking, and Pro-style modesExact hidden compute is not public
Why estimate at all?Production telemetry is closedUse Wh/request and test-time scaling anchorsInference energy is driven by tokens, active compute, and serving efficiencyPoint estimates need sensitivity ranges
Frontier assumption Active compute multiplier Standard Wh/request Reasoning / agent route Interpretation
Frontier central2.5x GPT-4o anchor0.85 Wh6.5 Wh reasoning totalMain sensitivity used in the page headline
Frontier heavy4.4x GPT-4o anchor1.5 Wh10 Wh reasoning totalUpper sensitivity for heavy routes
Long-context / agentic pathadd by actual tokens/tools2.5-40+ Whcontext, retrieval, and tool loops dominateUse for document-scale and agentic workflows

Why The Estimate Holds

The claim is comparative before it is absolute.

The exact Wh number can move with model architecture, batching, cache hits, quantization, and hardware. The ranking is more stable: longer generations use more compute than shorter ones; reasoning/test-time scaling uses more compute than ordinary answering; small sufficient models use less than frontier models; local deterministic tools avoid cloud inference. That is why the page reports central and heavy sensitivity scenarios rather than one alleged exact footprint.

Pilot model family Count Years observed Role in this page Treatment
GPT-3.53,3262023-2024Observed demandReplay through the frontier tier model
GPT-4 early / turbo / preview1,1742023-2024Observed demandReplay with frontier central/heavy
GPT-4o3,1072024Energy anchor0.31-0.34 Wh optimized proxy
GPT-4o mini7182024-2025Small-model route evidenceLower-compute alternative
o1 reasoning1,2782024Reasoning route evidenceBase plus reasoning add-on
GPT-4.1 / GPT-4.1 mini3972025Observed demandReplay with frontier central/heavy

Sources

Sources and anchors for the calculation