Pilot design before the headline
We start with 10,000 labeled chat episodes.
The pilot draws 10,000 chat episodes from WildChat-4.8M, a public dataset of real user-chatbot interactions. The unit of analysis is the conversation episode. Each row is counted once, first interpreted as an actual user task, then decomposed into the capabilities needed to answer it.
Data And Labels
Each chat is converted into a task decomposition
For each row, Gemini first identifies the actual task the user is trying to complete. It then decomposes the task into needed capabilities: lookup, reading, writing, calculation, local software, small-model generation, standard LLM synthesis, reasoning, tools, or expert judgment. The final execution bucket is a consequence of that decomposition, not the starting assumption.
WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.
What the user is really trying to accomplish, not just the words in the prompt.
Break the task into lookup, reading, generation, calculation, software, reasoning, tool, and expert components.
Estimate GPT-5-class answering energy, then compare it with the lowest sufficient execution bucket.
Core Message
The 28.8% number is a counterfactual replay result.
It comes from replaying the same 10,000 tasks two ways: GPT-5.5 central answering every task costs 35,453.0Wh, while task-matched execution costs 25,257.2Wh. The difference is 10,195.8Wh, so 10,195.8 / 35,453.0 = 28.8%. "Cutting energy" means changing the execution path after task decomposition, not claiming the GPT-5-class model itself became more efficient. We did not re-query GPT-4 or GPT-5.5; this is an energy replay over observed tasks and published coefficient assumptions.
Counterfactual Accounting
Decompose first, then compare the same tasks two ways
Decomposition is necessary because a chat is not a single generic "AI request." One conversation may be simple lookup, another may be writing, another may be calculation, and another may need reasoning or tools. The accounting compares the same 10,000 actual tasks under two execution plans.
Counterfactual baseline: every task is answered through the GPT-5.5 central path.
After decomposition, each task uses the lowest sufficient execution bucket.
(35,453.0 - 25,257.2) / 35,453.0 = 10,195.8 / 35,453.0.
Why This Equation
The numerator is avoidable cloud execution, not total AI energy.
The equation isolates the part of GPT-5-class answering energy that changes when the execution plan changes. It does not convert human time into carbon, and it does not count model training. It asks one narrow question: for the same task mix, how much cloud inference energy changes if we do not send every task through the same GPT-5.5 central path?
Replay Definition
This is not asking GPT-4 to answer again.
"Replay" means applying the same accounting model to the same observed WildChat tasks under two execution plans. The baseline assumes every task goes through a GPT-5.5 central cloud path. The alternative uses the task decomposition to choose the lowest sufficient bucket. No GPT-4, GPT-4o, GPT-4.5, GPT-5.5, or Pro model was re-run on the 10,000 conversations for this result.
Energy Multipliers
Use one additive task-energy model before comparing alternatives
Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each actual task is compared with the lowest sufficient execution bucket: local software, direct search, search plus reading, small model, standard LLM, reasoning/agent, or expert.
(35,453.0Wh GPT-5.5 replay - 25,257.2Wh task-matched execution) / 35,453.0Wh.
Uses EPA 0.394 kgCO2/kWh average electricity factor.
Heavy GPT-5.5 active-compute sensitivity.
Same task decomposition, heavier GPT-5.5 path.
Interpretation
Search is one route, not the whole story.
Averages differ because the conversations differ in length, turn count, and model route. Under GPT-5.5 central, direct-search cases average 1.87Wh on the GPT-5.5 path and 0.410Wh on the search path. Small-model cases are the largest gap: 2.28Wh on GPT-5.5 versus 0.016Wh on the assigned small-model execution path.
Scale-Up
Small savings per chat become TWh-scale at global AI volume
The pilot saves 1.02 to 1.67 Wh per row under GPT-5.5 central and heavy assumptions. Scaling that routing intensity shows what the opportunity looks like at platform and global volume.
Approximate 2026 world population.
OpenAI public usage anchor.
GPT-5.5 central to GPT-5.5 heavy range.
EPA U.S. average electricity factor.
Research Rule
Identify the task first; decompose it second; replay GPT-5-class energy third.
Human time is not converted into carbon. This page estimates cloud energy from the task decomposition; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.
Method
The task-energy model is additive
GPT-5.5 central uses a 0.85 Wh base response-equivalent and a 6.5 Wh reasoning total. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.
35,453.0Wh across 10,000 pilot rows.
25,257.2Wh after matching each task to its lowest sufficient execution bucket.
Pilot-scale value; platform-scale value is shown in the scale-up section.
Appendix
Data coverage and GPT-5.5 coefficient derivation
The main story uses a GPT-5.5 replay. The details below show where the pilot conversations came from and how the GPT-5.5 energy assumptions are anchored.
Public standard-query anchor from Epoch/OpenAI-era estimates.
0.34Wh × 2.5 active-compute multiplier.
Central estimate when GPT-5.5 reasoning is invoked.
Heavy standard base / reasoning total for Pro-style paths.
Replay Assumption
Estimate active compute; do not pretend we have vendor telemetry.
GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops. It is a calibrated counterfactual, not a measurement of OpenAI's internal serving stack.
Why The Estimate Holds
The claim is comparative before it is absolute.
The exact Wh number can move with model architecture, batching, cache hits, quantization, and hardware. The ranking is more stable: longer generations use more compute than shorter ones; reasoning/test-time scaling uses more compute than ordinary answering; small sufficient models use less than frontier models; local deterministic tools avoid cloud inference. That is why the page reports central and heavy sensitivity scenarios rather than one alleged exact footprint.
Sources
Sources and anchors for the calculation
- de Vries, Joule 2023: 0.3 Wh search and up to 2.9 Wh LLM interaction
- Epoch AI: about 0.3 Wh for a typical GPT-4o-style query
- Oviedo et al., Joule 2026: 0.31 Wh frontier inference and order-of-magnitude higher long reasoning
- Oviedo et al. preprint: 0.34 Wh standard and 4.32 Wh test-time scaling scenario
- EPA eGRID: 0.394 kgCO2/kWh U.S. average electricity factor
- OpenAI API model page: GPT-5.5 pricing, context, and reasoning support
- OpenAI API model page: GPT-5.5 Pro pricing and long-running hard-task behavior
- OpenAI Help: GPT-5.5 Instant, Thinking, and Pro modes in ChatGPT
- OpenAI GPT-5.5 release and product notes
- OpenAI ChatGPT Pro: Pro/reasoning modes use more compute for harder problems
- OpenAI GPT-4.5: large compute-intensive model, not a GPT-4o replacement
- Worldometer / UN WPP 2024: 2026 world population around 8.3B
- OpenAI: ChatGPT serves more than 800M weekly users
- WildChat-4.8M