Pilot design before the headline
We start with 300 real LLM conversations.
The pilot draws 300 conversations from WildChat-4.8M, a public dataset of real user-chatbot interactions. After duplicate classified rows are removed, this page reports 295 unique conversations. Each conversation is classified into the task it performs and the lowest-compute route that could plausibly satisfy it.
Data And Labels
Each chat is converted into a routing decision
For each conversation, the classifiers label the user's intent, observed route, lowest sufficient route, feasibility under a five-minute human-time constraint, and energy inputs such as visible tokens, model responses, reasoning, search, and tool calls.
WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.
Search, writing, coding, calculation, local software, reasoning, tool workflow, or not comparable.
Direct search, local tool, small model, standard model, reasoning model, expert, or tool agent.
Replay the same task under GPT-5.5 central/heavy and switch only if extra human effort is ≤ 5 min.
Core Message
A third of GPT-5.5 chat energy is a routing choice.
In this pilot, 197 of 295 unique conversations have a lower-compute route that stays within the five-minute constraint. Replaying the tasks through GPT-5.5 central cuts cloud energy from 1188Wh to 788Wh, a 33.7% reduction before any change to model architecture.
Energy Multipliers
Use one additive route model before comparing alternatives
Every conversation is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each task is compared with the lowest-compute route that still keeps extra human effort under five minutes.
1188.1Wh status quo minus 787.6Wh under the five-minute policy.
Uses EPA 0.394 kgCO2/kWh average electricity factor.
Heavy GPT-5.5 active-compute sensitivity.
Same routing labels and time constraint, heavier GPT-5.5 path.
Interpretation
Search is one route, not the whole story.
Averages differ because the conversations differ in length and number of model responses. Under GPT-5.5 central, direct-search cases average 1.00Wh, local-tool cases average 2.27Wh, and reasoning-downshift cases average 7.54Wh. Search adds another 0.30Wh per query when retrieval is actually used.
Scale-Up
Small savings per chat become TWh-scale at global AI volume
The pilot saves 1.36 to 2.31 Wh per routed conversation under GPT-5.5 central and heavy assumptions. Scaling that routing intensity shows what the opportunity looks like at platform and global volume.
Approximate 2026 world population.
OpenAI public usage anchor.
GPT-5.5 central to GPT-5.5 heavy range.
EPA U.S. average electricity factor.
Policy Rule
Minimize cloud energy subject to human time ≤ 5 minutes.
Human time is the constraint, not a carbon term. A lower-compute route is used only when it keeps the user's estimated extra effort within the five-minute budget.
Method
The route model is additive
GPT-5.5 central uses a 0.85 Wh base response-equivalent and a 6.5 Wh reasoning total. Search adds 0.30 Wh per query. Carbon is electricity multiplied by the EPA U.S. average grid factor.
1188.1Wh across 295 unique pilot conversations.
787.6Wh after feasible switching under τ=5.
Pilot-scale value; platform-scale value is shown in the scale-up section.
Appendix
Data coverage and GPT-5.5 coefficient derivation
The main story uses a GPT-5.5 replay. The details below show where the pilot conversations came from and how the GPT-5.5 energy assumptions are anchored.
Public standard-query anchor from Epoch/OpenAI-era estimates.
0.34Wh × 2.5 active-compute multiplier.
Central estimate when GPT-5.5 reasoning is invoked.
Heavy standard base / reasoning total for Pro-style paths.
Parameter Proxy
Define GPT-5.5 by active compute.
GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops.
Sources
Sources and anchors for the calculation
- de Vries, Joule 2023: 0.3 Wh search and up to 2.9 Wh LLM interaction
- Epoch AI: about 0.3 Wh for a typical GPT-4o-style query
- Oviedo et al., Joule 2026: 0.31 Wh frontier inference and order-of-magnitude higher long reasoning
- Oviedo et al. preprint: 0.34 Wh standard and 4.32 Wh test-time scaling scenario
- EPA eGRID: 0.394 kgCO2/kWh U.S. average electricity factor
- OpenAI API model page: GPT-5.5 pricing, context, and reasoning support
- OpenAI GPT-5.5 release: more capable and fewer tokens on Codex tasks
- Worldometer / UN WPP 2024: 2026 world population around 8.3B
- OpenAI: ChatGPT serves more than 800M weekly users
- WildChat-4.8M