AI Sustainability

The Electricity Cost of Everyday AI

2025 re-run: 1,000 real chat episodes

Actual model choice changes the sustainability story.

The earlier 10,000-row pilot asked what each task could be handled by. The 2025 re-run adds the missing denominator: what model the user actually used. In this sample, almost everyone used OpenAI mini-tier non-reasoning models, which are not the same thing as local small models. The key question is no longer only over-use of frontier reasoning; it is whether observed model choice is matched to task difficulty.

Source 2025 WildChat-4.8M public conversations restricted to calendar year 2025
Sample rows 1,000 Random 2025 sample; not the earlier stratified 10,000-row panel
Actual model mix 99.2% Observed conversations used OpenAI mini-tier non-reasoning models
Classification Codex Same tier schema as before, now paired with observed model family

Validation sits inside the first page

The pilot label is not the proof.

  1. Gold labelsTwo human annotators read the actual task and adjudicate the AI-use level.
  2. Run outputsSearch/tool, Gemma/Qwen-class small model, standard frontier, and reasoning paths answer matched cases.
  3. Blind qualityOnly non-inferior answers count as lower-compute sufficient.
  4. Conservative ruleUncertain or high-stakes tasks stay high-compute or move to expert review.

2025 Observational Re-Run

Compare what users used with what the task appeared to require

The new unit is an actual-vs-required pair. For each 2025 conversation, we keep the observed model name and classify the minimum sufficient route. This is closer to the paper's empirical object: mismatch between model intensity and task demand.

Actual observed energy 179.8 Wh

Estimated from the model the user actually used: almost entirely GPT-4.1-mini and GPT-4o-mini.

Task-matched energy 269.5 Wh

Estimated after routing each task to its Codex-assigned minimum sufficient tier.

Net difference -49.9%

Negative savings: in this 2025 sample, strict task matching would upgrade more work than it downroutes.

Over-compute 7.0%

Actual model intensity is above the required tier, excluding expert/not-comparable cases.

Under-compute 11.2%

Actual model intensity is below the required tier; these cases need output-quality validation.

Reasoning actually used 0.2%

Only 2 of 1,000 sampled 2025 conversations used o1-preview. o1-mini would also be reasoning if observed.

Naming Rule

OpenAI mini-tier is not the paper's small-model tier.

Actual model family describes the product the user used: GPT-4o-mini and GPT-4.1-mini are OpenAI mini-tier non-reasoning models. Required tier describes the least sufficient execution path: T1 Small LLM means a smaller local/open model such as Gemma, Qwen, Phi, or Llama-class. o1, o1-mini, and o1-preview are reasoning models, not small-model routes.

Actual model family Count Share Interpretation
OpenAI mini-tier non-reasoning99299.2%GPT-4.1-mini and GPT-4o-mini are lower-cost OpenAI product tiers, not Gemma-class local small models.
Large non-reasoning60.6%Small tail of GPT-4o-class use remains in the 2025 public sample.
o1-preview reasoning model20.2%Reasoning overuse cannot be inferred from this 2025 natural sample alone.

2025 Full Distribution

Reasoning appears only in the January o1 rows in this public 2025 data

Across 863,322 unique 2025 WildChat conversations available locally, the observed date range is 2025-01-01 to 2025-07-31. The last available month is July 2025, and a 1,000-row July sample is entirely GPT-4.1-mini. This means the natural 2025 sample is a poor test bed for mass reasoning overuse, but a useful test bed for whether mini-tier answers are already adequate or sometimes under-powered.

Jan 60,948

94.2% mini-tier, 3.5% GPT-4o, 2.4% o1 reasoning.

Feb 172,743

100% GPT-4o-mini.

Mar 54,409

100% GPT-4o-mini.

Apr 222,847

Switch month: 89.0% GPT-4.1-mini, 11.0% GPT-4o-mini.

May 229,582

100% GPT-4.1-mini.

Jun 77,655

100% GPT-4.1-mini.

Jul 45,138

100% GPT-4.1-mini; July random sample n=1,000.

GPT-4o-mini GPT-4.1-mini GPT-4o non-reasoning o1 reasoning
Required tier Count Share Meaning Implication
T1 Small LLM78978.9%Simple generation, rewriting, short explanation, extractionCandidate for Gemma/Qwen-class validation; actual OpenAI mini is related but not identical.
T2/T3 Large non-reasoning11211.2%Standard frontier or long-context quality appears usefulThese are potential under-compute cases if mini output is not good enough.
T0 No LLM: tool/search/API676.7%Calculator, search, local software, specialized toolThis is the cleanest over-compute margin.
T4/T5 Reasoning or agent50.5%Reasoning/test-time compute or tool orchestrationRare in this natural 2025 sample.
T6 Expert / not comparable272.7%High-stakes, unsafe, ambiguous, or expert-review tasksExcluded from simple savings claims.
Actual family n No LLM Small LLM Large non-reasoning Reasoning Agent / Expert
OpenAI mini-tier non-reasoning99267786108229
Large non-reasoning602400
o1-preview reasoning model201010

Interpretation Change

For 2025 WildChat, the immediate problem is not mass reasoning use.

The earlier 10,000-row stratified panel is still useful for stress-testing reasoning and frontier overuse. The 2025 natural sample says something different: observed users are already mostly on OpenAI mini-tier non-reasoning models. The next empirical question is whether those answers are good enough, whether some should be upgraded, and which tasks can move further down to search, local tools, or true small local models.

Earlier Stratified Pilot

The 10,000-row run is now a stress-test panel, not the main 2025 estimate

This older panel deliberately mixed model-proportional rows with reasoning, multi-turn, non-English, and edge-case strata. It is useful for seeing the full route taxonomy and stress-testing reasoning-heavy replay, but it should not be read as the natural 2025 model-use distribution.

1 Raw conversation

WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.

2 Actual task

What the user is really trying to accomplish, not just the words in the prompt.

3 Compute demand

Ask what level of AI would have been enough if the user had chosen deliberately.

4 Energy replay

Compare standard frontier replay and reasoning replay against the minimum sufficient execution tier.

Execution tier Count Share Meaning Example
T0a No LLM: local tool/software1131.1%Calculator, spreadsheet, local script, or software workflow
T0b No LLM: search/lookup5165.2%Direct reference lookup or search result is enough
T0c No LLM: specialized tool/API150.2%Non-LLM API, app, database, or specialized calculator
T1 Small LLM3,53735.4%Gemma-class simple translation, rewrite, extraction, short generation
T2 Frontier standard4,79047.9%Normal frontier writing, explanation, coding, and synthesis
T3 Frontier + long context1091.1%Frontier model mainly because context length or fidelity is the bottleneck
T4 Frontier + reasoning5285.3%Frontier model plus reasoning/test-time compute
T5 LLM agent + tools1511.5%LLM must orchestrate search, tools, APIs, or actions
T6 Expert / not comparable2412.4%High-stakes, unsafe, unclear, or excluded from simple substitution

Example Display

Prompt examples are original excerpts.

WildChat episodes are conversation-level records, so one row can contain multiple user turns or a very long pasted prompt. The example cards keep the original language and wording, but clip long prompts to the first 1,500 characters for readability and show the raw prompt length. The episode id points back to the full local record.

Core Message

The problem is not using AI. The problem is using too much AI for the task.

In this pilot, the savings are modest if the alternative is ordinary frontier inference for every task, but much larger if ordinary tasks are being pushed through reasoning-heavy paths. That is the paper's opening: AI is an intensity choice, and users need information that helps them conserve high-compute AI when it is not needed.

Counterfactual Accounting

Decompose first, then compare the same tasks two ways

Decomposition is necessary because a chat is not a single generic "AI request." One conversation may be simple lookup, another may be writing, another may be calculation, and another may need reasoning or tools. The accounting compares the same 10,000 actual tasks under two execution plans.

1. Standard frontier replay 28,259.3 Wh

Counterfactual baseline: every task is answered through a standard frontier path.

2. Reasoning frontier replay 115,000.2 Wh

Stress baseline: every task is pushed through reasoning/test-time compute.

3. Task-matched execution 25,420.7 Wh

Careful-use benchmark: no-LLM, small LLM, standard frontier, context, reasoning, agent, or exclusion.

Saved vs standard frontier 10.0%

2,838.6Wh saved; T4/T5 quality upgrades offset part of the T0/T1 savings.

Saved vs reasoning frontier 76.0%

87,424.7Wh saved when ordinary tasks avoid reasoning/test-time compute.

Small-model opportunity 35.4%

3,537 / 10,000 tasks need language capability, but not frontier execution.

Why This Equation

The numerator is avoidable cloud execution, not total AI energy.

The equation isolates the cloud inference energy that changes when the execution plan changes. It does not convert human time into carbon, and it does not count model training. It asks two narrow questions: what if every task uses standard frontier execution, and what if every task uses reasoning/test-time compute?

Replay Definition

This is not asking GPT-4 to answer again.

"Replay" means applying the same accounting model to the same observed WildChat tasks under two execution plans. The baselines replay standard frontier and reasoning-frontier execution. The alternative uses the task decomposition to choose the minimum sufficient tier. No GPT-4, GPT-4o, GPT-4.5, GPT-5.5, or Pro model was re-run on the 10,000 conversations for this result.

Task tier Reasoning replay Task-matched Saved vs reasoning Share of reasoning savings Why energy changes
T0a No LLM: local tool/software915.7 Wh0.0 Wh915.7 Wh1.0%Deterministic local execution avoids cloud inference.
T0b No LLM: search/lookup4,949.8 Wh154.8 Wh4,794.9 Wh5.5%Known facts and references are lower-compute lookup tasks.
T0c No LLM: specialized tool/API180.5 Wh0.8 Wh179.7 Wh0.2%Domain tools replace general model generation.
T1 Small LLM32,750.7 Wh206.2 Wh32,544.5 Wh37.2%Simple language work moves to Gemma-class small models.
T2 Frontier standard60,985.4 Wh15,470.7 Wh45,514.7 Wh52.1%These tasks need frontier quality, but not reasoning mode.
T3 Frontier + long context2,273.4 Wh1,079.5 Wh1,193.9 Wh1.4%Context/fidelity matters; reasoning is not the main bottleneck.
T4 Frontier + reasoning6,924.7 Wh6,924.7 Wh0.0 Wh0.0%Reasoning/test-time compute is justified here.
T5 LLM agent + tools3,404.6 Wh1,123.4 Wh2,281.2 Wh2.6%Tool orchestration is needed, but not every step is reasoning.
T6 Expert / not comparable2,615.4 Wh2,615.4 Wh0.0 Wh0.0%Excluded from simple cloud-substitution savings.

Energy Multipliers

Use one additive task-energy model before comparing alternatives

Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each actual task is compared with the lowest sufficient execution tier: no-LLM tool/search/API, small LLM, standard frontier, long-context frontier, reasoning frontier, LLM agent with tools, or expert/not comparable.

Saved vs standard frontier 2,838.6 Wh

(28,259.3Wh standard replay - 25,420.7Wh task-matched execution) / 28,259.3Wh = 10.0%.

Saved vs reasoning frontier 87,424.7 Wh

(115,000.2Wh reasoning replay - 27,575.5Wh task-matched execution) / 115,000.2Wh = 76.0%.

Reasoning replay CO2 cut 34.445 kg

Uses EPA 0.394 kgCO2/kWh average electricity factor.

Heavy reasoning sensitivity 132,009.4 Wh

Heavy frontier reasoning replay minus task-matched execution.

Execution tier Share Reasoning avg Matched avg Multiplier Saved vs reasoning
T0a No LLM: local tool/software113 / 10,000, 1.1%8.104 Wh~0 cloud Whlocal / 0915.7 Wh
T0b No LLM: search/lookup516 / 10,000, 5.2%9.593 Wh0.300 Wh32.0x4,794.9 Wh
T1 Small LLM3,537 / 10,000, 35.4%9.259 Wh0.058 Wh159.0x32,544.5 Wh
T2 Frontier standard4,790 / 10,000, 47.9%12.732 Wh3.230 Wh3.9x45,514.7 Wh
T3 Frontier + long context109 / 10,000, 1.1%20.857 Wh9.904 Wh2.1x1,193.9 Wh
T4 Frontier + reasoning528 / 10,000, 5.3%13.115 Wh13.115 Wh1.0x0.0 Wh
T5 LLM agent + tools151 / 10,000, 1.5%22.547 Wh7.440 Wh3.0x2,281.2 Wh
T6 Expert / not comparable241 / 10,000, 2.4%10.852 Wh10.852 Whexcluded0.0 Wh

Interpretation

Search is one lower-intensity option, not the whole story.

The biggest sustainability mistake is not using a frontier model for every task; it is using reasoning compute for ordinary tasks. T4 keeps reasoning where it is justified. T1 shows the largest model-rightsizing opportunity: 35.4% of chats need language capability, but only a small model.

Scale-Up

We need two benchmarks, not one headline number

The pilot should be benchmarked twice. If every task is replayed through a standard frontier model, careful use saves 0.28-0.68 Wh per row. If every task is replayed through a reasoning-frontier path, careful use saves 8.74-13.20 Wh per row. The second number is not total LLM electricity; it is the avoided cost of reasoning overuse.

Human population 8.3B

Approximate 2026 world population.

ChatGPT weekly users 800M+

OpenAI public usage anchor.

Saved per pilot row 0.28-13.20 Wh

Lower end is standard-frontier replay; upper end is reasoning-frontier replay.

Conversion 0.394 kg/kWh

EPA U.S. average electricity factor; forest storage uses 0.77 tCO2/acre/year.

Benchmark A

All standard frontier

2.83-4.99 Wh/row baseline; task matching saves 10.0-13.7%.

Benchmark B

All reasoning frontier

11.50-17.83 Wh/row baseline; task matching saves 74.0-76.0%.

Benchmark Replay energy Task-matched energy Saved per row Reduction
All standard frontier2.83-4.99 Wh2.54-4.30 Wh0.28-0.68 Wh10.0-13.7%
All reasoning frontier11.50-17.83 Wh2.76-4.63 Wh8.74-13.20 Wh74.0-76.0%
Scale scenario Conversations/year All standard frontier: total / saved All reasoning frontier: total / saved Reasoning saved CO2 / forest
800M weekly users × 5 chats/week208B0.59-1.04 / 0.06-0.14 TWh2.39-3.71 / 1.82-2.75 TWh0.72-1.08 MtCO2 / 0.93-1.40M acres
1B users × 5 chats/day1.83T5.16-9.10 / 0.52-1.24 TWh20.99-32.54 / 15.95-24.09 TWh6.28-9.49 MtCO2 / 8.16-12.33M acres
10B AI chats/day globally3.65T10.32-18.20 / 1.04-2.49 TWh41.98-65.07 / 31.89-48.18 TWh12.56-18.98 MtCO2 / 16.32-24.65M acres

Research Rule

Identify the task first; decompose it second; decide whether reasoning is justified third.

Human time is not converted into carbon. This page estimates cloud energy from the task decomposition; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.

Method

The task-energy model is additive

The central scenario uses a 0.85 Wh frontier base work unit and a 6.5 Wh reasoning response. T1 uses 0.04 Wh per 1,000 visible tokens for Gemma-class small models. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.

Base visible inference Ebase × max(responses, tokens / reference)
Reasoning add-on max(0, Ereason total - Ebase) × responses
Search add-on 0.30 Wh × search calls
Cloud carbon CO2 = E / 1000 × 0.394 kg/kWh
Reasoning replay carbon 45.310 kgCO2

115,000.2Wh across 10,000 pilot rows.

Task-matched carbon 10.867 kgCO2

27,575.5Wh after matching each task to its minimum sufficient execution tier.

Cloud carbon saved 34.445 kgCO2

Pilot-scale value; platform-scale value is shown in the scale-up section.

Appendix

Data coverage and frontier coefficient derivation

The main story uses frontier standard and frontier reasoning replay. The details below show where the pilot conversations came from and how the energy assumptions are anchored.

GPT-4o anchor 0.31-0.34 Wh

Public standard-query anchor from Epoch/OpenAI-era estimates.

Frontier central base 0.85 Wh

0.34Wh × 2.5 active-compute multiplier.

Frontier reasoning total 6.5 Wh

Central estimate when reasoning/test-time compute is invoked.

Heavy frontier path 1.5Wh / 10Wh

Heavy standard base / reasoning total for Pro-style paths.

Replay Assumption

Estimate active compute; do not pretend we have vendor telemetry.

GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops. T1 uses Gemma-class small-model inference anchored to Gemma 4 E2B/E4B effective-parameter models. This is a calibrated counterfactual, not a measurement of any vendor's internal serving stack.

Question Answer Treatment in the page Why acceptable Residual risk
Did we re-ask GPT-4?NoCounterfactual energy replayObserved tasks and token structure are fixedNot a direct output-quality experiment
Is GPT-4 the same as Pro?NoSeparate central and heavy scenariosOpenAI distinguishes standard, Thinking, and Pro-style modesExact hidden compute is not public
Why estimate at all?Production telemetry is closedUse Wh/request and test-time scaling anchorsInference energy is driven by tokens, active compute, and serving efficiencyPoint estimates need sensitivity ranges
Frontier assumption Active compute multiplier Standard Wh/request Reasoning / agent mode Interpretation
Frontier central2.5x GPT-4o anchor0.85 Wh6.5 Wh reasoning totalMain sensitivity used in the page headline
Frontier heavy4.4x GPT-4o anchor1.5 Wh10 Wh reasoning totalUpper sensitivity for heavy modes
Long-context / agentic pathadd by actual tokens/tools2.5-40+ Whcontext, retrieval, and tool loops dominateUse for document-scale and agentic workflows

Why The Estimate Holds

The claim is comparative before it is absolute.

The exact Wh number can move with model architecture, batching, cache hits, quantization, and hardware. The ranking is more stable: longer generations use more compute than shorter ones; reasoning/test-time scaling uses more compute than ordinary answering; small sufficient models use less than frontier models; local deterministic tools avoid cloud inference. That is why the page reports central and heavy sensitivity scenarios rather than one alleged exact footprint.

Pilot model family Count Years observed Role in this page Treatment
GPT-3.53,3262023-2024Observed demandReplay through the frontier tier model
GPT-4 early / turbo / preview1,1742023-2024Observed demandReplay with frontier central/heavy
GPT-4o3,1072024Energy anchor0.31-0.34 Wh optimized proxy
GPT-4o mini7182024-2025Small-model evidenceLower-compute alternative
o1 reasoning1,2782024Reasoning evidenceBase plus reasoning add-on
GPT-4.1 / GPT-4.1 mini3972025Observed demandReplay with frontier central/heavy

Sources

Sources and anchors for the calculation