Pilot design before the headline
We start with 10,000 labeled chat episodes.
The pilot draws 10,000 chat episodes from WildChat-4.8M, a public dataset of real user-chatbot interactions. The unit of analysis is the conversation episode. Each row is counted once, first interpreted as an actual user task, then decomposed into the capabilities needed to answer it.
Data And Labels
Each chat is converted into a task decomposition
For each row, Gemini first identifies the actual task the user is trying to complete. It then decomposes the task into needed capabilities: deterministic tools, search/lookup, small-model language work, standard frontier generation, long context, reasoning/test-time compute, tool agents, or expert judgment. The final tier is a consequence of that decomposition, not the starting assumption.
WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.
What the user is really trying to accomplish, not just the words in the prompt.
Break the task into lookup, reading, generation, calculation, software, reasoning, tool, and expert components.
Compare standard frontier replay and reasoning replay against the minimum sufficient execution tier.
Example Display
Prompt examples are original excerpts.
WildChat episodes are conversation-level records, so one row can contain multiple user turns or a very long pasted prompt. The example cards keep the original language and wording, but clip long prompts to the first 650 characters for readability. The episode id points back to the full local record.
Core Message
The headline depends on which frontier path people overuse.
The same 10,000 tasks save 10.0% if the baseline is standard frontier replay for every task. They save 76.0% if the baseline is reasoning-frontier replay for every task. That distinction is the research point: routing ordinary tasks away from reasoning/test-time compute matters far more than routing every task away from frontier models. We did not re-query GPT-4 or GPT-5.5; this is an energy replay over observed tasks and published coefficient assumptions.
Counterfactual Accounting
Decompose first, then compare the same tasks two ways
Decomposition is necessary because a chat is not a single generic "AI request." One conversation may be simple lookup, another may be writing, another may be calculation, and another may need reasoning or tools. The accounting compares the same 10,000 actual tasks under two execution plans.
Counterfactual baseline: every task is answered through a standard frontier path.
Stress baseline: every task is pushed through reasoning/test-time compute.
Minimum sufficient tiers: no-LLM, small LLM, standard frontier, context, reasoning, agent, or exclusion.
2,838.6Wh saved; T4/T5 quality upgrades offset part of the T0/T1 savings.
87,424.7Wh saved when ordinary tasks avoid reasoning/test-time compute.
3,537 / 10,000 tasks need language capability, but not frontier execution.
Why This Equation
The numerator is avoidable cloud execution, not total AI energy.
The equation isolates the cloud inference energy that changes when the execution plan changes. It does not convert human time into carbon, and it does not count model training. It asks two narrow questions: what if every task uses standard frontier execution, and what if every task uses reasoning/test-time compute?
Replay Definition
This is not asking GPT-4 to answer again.
"Replay" means applying the same accounting model to the same observed WildChat tasks under two execution plans. The baselines replay standard frontier and reasoning-frontier execution. The alternative uses the task decomposition to choose the minimum sufficient tier. No GPT-4, GPT-4o, GPT-4.5, GPT-5.5, or Pro model was re-run on the 10,000 conversations for this result.
Energy Multipliers
Use one additive task-energy model before comparing alternatives
Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each actual task is compared with the lowest sufficient execution tier: no-LLM tool/search/API, small LLM, standard frontier, long-context frontier, reasoning frontier, LLM agent with tools, or expert/not comparable.
(28,259.3Wh standard replay - 25,420.7Wh task-matched execution) / 28,259.3Wh = 10.0%.
(115,000.2Wh reasoning replay - 27,575.5Wh task-matched execution) / 115,000.2Wh = 76.0%.
Uses EPA 0.394 kgCO2/kWh average electricity factor.
Heavy frontier reasoning replay minus task-matched execution.
Interpretation
Search is one route, not the whole story.
The biggest sustainability mistake is not using a frontier model for every task; it is using reasoning compute for ordinary tasks. T4 keeps reasoning where it is justified. T1 shows the largest model-rightsizing opportunity: 35.4% of chats need language capability, but only a small model.
Scale-Up
Small savings per chat become TWh-scale at global AI volume
If the avoidable behavior is reasoning-frontier overuse, task matching saves 8.74 to 13.20 Wh per row under the central and heavy sensitivity cases. Scaling that routing intensity shows the platform-level stakes.
Approximate 2026 world population.
OpenAI public usage anchor.
Reasoning-frontier replay to task-matched execution.
EPA U.S. average electricity factor.
Research Rule
Identify the task first; decompose it second; decide whether reasoning is justified third.
Human time is not converted into carbon. This page estimates cloud energy from the task decomposition; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.
Method
The task-energy model is additive
The central scenario uses a 0.85 Wh frontier base work unit and a 6.5 Wh reasoning response. T1 uses 0.04 Wh per 1,000 visible tokens for Gemma-class small models. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.
115,000.2Wh across 10,000 pilot rows.
27,575.5Wh after matching each task to its minimum sufficient execution tier.
Pilot-scale value; platform-scale value is shown in the scale-up section.
Appendix
Data coverage and frontier coefficient derivation
The main story uses frontier standard and frontier reasoning replay. The details below show where the pilot conversations came from and how the energy assumptions are anchored.
Public standard-query anchor from Epoch/OpenAI-era estimates.
0.34Wh × 2.5 active-compute multiplier.
Central estimate when reasoning/test-time compute is invoked.
Heavy standard base / reasoning total for Pro-style paths.
Replay Assumption
Estimate active compute; do not pretend we have vendor telemetry.
GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops. T1 uses Gemma-class small-model inference anchored to Gemma 4 E2B/E4B effective-parameter models. This is a calibrated counterfactual, not a measurement of any vendor's internal serving stack.
Why The Estimate Holds
The claim is comparative before it is absolute.
The exact Wh number can move with model architecture, batching, cache hits, quantization, and hardware. The ranking is more stable: longer generations use more compute than shorter ones; reasoning/test-time scaling uses more compute than ordinary answering; small sufficient models use less than frontier models; local deterministic tools avoid cloud inference. That is why the page reports central and heavy sensitivity scenarios rather than one alleged exact footprint.
Sources
Sources and anchors for the calculation
- de Vries, Joule 2023: 0.3 Wh search and up to 2.9 Wh LLM interaction
- Epoch AI: about 0.3 Wh for a typical GPT-4o-style query
- Oviedo et al., Joule 2026: 0.31 Wh frontier inference and order-of-magnitude higher long reasoning
- Oviedo et al. preprint: 0.34 Wh standard and 4.32 Wh test-time scaling scenario
- EPA eGRID: 0.394 kgCO2/kWh U.S. average electricity factor
- OpenAI API model page: GPT-5.5 pricing, context, and reasoning support
- OpenAI API model page: GPT-5.5 Pro pricing and long-running hard-task behavior
- OpenAI Help: GPT-5.5 Instant, Thinking, and Pro modes in ChatGPT
- OpenAI GPT-5.5 release and product notes
- OpenAI ChatGPT Pro: Pro/reasoning modes use more compute for harder problems
- OpenAI GPT-4.5: large compute-intensive model, not a GPT-4o replacement
- Google Gemma 4: E2B/E4B small open models and 26B MoE active 3.8B parameters
- Google AI for Developers: Gemma 4 model sizes, context windows, and memory requirements
- FrugalGPT: cascaded model selection for lower cost
- RouteLLM: routing queries between cheaper and stronger LLMs
- Worldometer / UN WPP 2024: 2026 world population around 8.3B
- OpenAI: ChatGPT serves more than 800M weekly users
- WildChat-4.8M