AI Sustainability

The Electricity Cost of Everyday AI

Three-stratum model-use study

How much electricity is wasted when everyday AI tasks are run on more model than they need?

The study starts from a simple concern: users may be sending ordinary tasks to unnecessarily expensive AI modes. We therefore do not read WildChat's raw model distribution as truth. Instead, we use public 2025 conversations to build three model-use strata, classify what each task actually needed, and compute how much electricity a user could save by switching down when the task allows.

Frontier reasoning 500 o1/o-series style rows; estimate how often reasoning was actually necessary
Frontier non-reasoning 500 GPT-4o/GPT-4.1/GPT-4.5 style rows; estimate when standard frontier can be downshifted
Small model 500 GPT-4o-mini/GPT-4.1-mini proxy rows plus local small-model validation
Mixture scenario 30/60/10 Example aggregate: reasoning / frontier / small, varied in sensitivity checks
Gemini classified 1,500 All three model-use strata now have minimum-sufficient-tier labels

First-Page Setup

Sample by model used, classify by model needed, then price the gap.

  1. SampleTake 500 rows from each 2025 model-use class.
  2. ClassifyAsk whether the task needs reasoning, frontier, small model, search, or local tool.
  3. CalculateWithin each model class, compute average avoidable Wh and percentage saving.
  4. AggregateCombine model-class savings under NBER-inspired mixture scenarios.

Operational Design

For each model class, estimate what share of tasks could have used less compute

The unit is a conversation row with a visible model class. For each row, we classify the true task demand: whether reasoning was needed, whether a smaller model was enough, whether local search/tool use was enough, or whether the original model class was justified. This produces a within-class saving curve rather than a claim about the true global frequency of each model.

Model class used Rows Examples Quantity estimated
Frontier Reasoning500o1, o1-preview, o3, o4-high style usageAverage Wh saved if non-reasoning, small model, search, or tool use was enough.
Frontier Non-Reasoning500GPT-4o, GPT-4.1, GPT-4.5 style usageAverage Wh saved if small model, search, or local tool use was enough.
Small Model500GPT-4o-mini / GPT-4.1-mini as public proxies; Gemma/Qwen/Phi/Llama for validationAverage Wh saved from local search/tool substitution, and how often small models are under-powered.

Gemini Classified Results

The three model classes produce very different savings margins

These are actual Gemini labels from the 1,500-row model-stratified sample. The key result is not a single global number. It is a conditional statement: if a user is currently using a given model class, how often could the task have been handled by a lower-compute route?

Frontier Reasoning 3.89 Wh

Average avoidable electricity per reasoning-row under central assumptions; 57.2% of chosen-model energy.

Frontier Non-Reasoning 0.70 Wh

Average avoidable electricity per frontier-row; 13.3% of chosen-model energy.

Small Model 1.50 Wh

Average upgrade pressure per small-row; many small-model rows appear to need frontier quality.

Model class used Local/search/tool Small model enough Frontier non-reasoning needed Reasoning needed Agent/expert
Frontier Reasoning5.2%21.0%46.4%19.0%8.4%
Frontier Non-Reasoning10.4%22.2%59.4%1.8%6.2%
Small Model7.6%61.8%28.0%0.4%2.2%
Model class used Avg chosen Wh Avg avoidable Wh Interpretation
Frontier Reasoning6.803.89, 57.2%Most of the savings come from reasoning rows that only need standard frontier, small model, or lookup/tool execution.
Frontier Non-Reasoning5.240.70, 13.3%The downshift margin exists, but is smaller because many rows still need standard frontier quality.
Small Model0.130.00, 0.1%Small rows do not save much by downshifting further; the main issue is upgrade pressure, not waste.

Classifier Questions

The labels should answer exactly the switching question users face

We are not classifying broad topics for their own sake. The classification decides the lowest sufficient execution path for a task, so the electricity comparison is meaningful.

Required route Decision question Example task If original was reasoning If original was frontier If original was small
Local tool / local searchCould search, calculator, spreadsheet, local script, or lookup solve it?Weather, arithmetic, exact fact lookupLarge savingModerate savingSmall saving
Small modelCould a small model handle the language task without quality loss?Translation, short rewrite, extractionLarge savingModerate savingNo saving
Frontier non-reasoningDoes the task need frontier quality but not reasoning?Substantial writing, moderate coding, synthesisReasoning overhead savedNo savingUpgrade / rework risk
Frontier reasoningDoes it truly need test-time reasoning?Hard math, complex debugging, deep planningNo savingUpgrade neededUpgrade needed
Expert / excludedIs this high-stakes, unsafe, or not a simple substitution?Medical, legal, cybersecurity release decisionsDo not scoreDo not scoreDo not score

Within-Class Output

Each model class gets its own saving estimate.

The reasoning stratum is the clearest over-compute margin: only 19.0% of sampled reasoning rows were classified as truly requiring reasoning, while 46.4% needed only frontier non-reasoning, 21.0% needed only a small model, and 5.2% could be handled by local search/tool paths. The frontier non-reasoning stratum has a smaller but still meaningful downshift margin. The small-model stratum mostly tells a different story: some tasks should be upgraded, so the paper must report quality-adjusted savings rather than energy alone.

Electricity Calculation

Compute average savings inside each model class, then combine classes

For each class used by the user, we estimate both avoidable electricity and upgrade pressure. Avoidable electricity counts cases where a lower-compute route is sufficient. Upgrade pressure counts cases where the observed model class was too weak and quality would require more compute.

Required-route shares p[m,r] = share of model class m classified as required route r
Average avoidable Wh A[m] = sum_r p[m,r] × max(E[m] - E[r], 0)
Upgrade pressure U[m] = sum_r p[m,r] × max(E[r] - E[m], 0)
Scenario aggregate A = 0.30 A[R] + 0.60 A[F] + 0.10 A[S]
Scenario ingredient Example value What it means Sensitivity
q[R]30%Share of interactions assumed to use Frontier ReasoningVary from low to reasoning-heavy scenarios.
q[F]60%Share assumed to use Frontier Non-ReasoningVary with paid/default product assumptions.
q[S]10%Share assumed to use Small Model / mini-tier pathVary as small-model adoption grows.
Total savingq[R]A[R] + q[F]A[F] + q[S]A[S]Average avoidable Wh per chat under the scenarioReport central, low, and high cases.
Mixture result Central Heavy Meaning
Avg chosen electricity5.20 Wh/chat8.76 Wh/chatExpected electricity under the 30/60/10 model-use scenario.
Avg avoidable electricity1.58 Wh/chat2.52 Wh/chatExpected saving if users switch to the lowest sufficient route.
Avoidable percentage30.5%28.7%Scenario-level energy reduction from task-rightsizing.
Avg upgrade pressure0.21 Wh/chat0.34 Wh/chatExtra electricity needed where the chosen model appears too weak.

How NBER Enters

NBER anchors the usage mix; our scenario anchors the model mix.

OpenAI/NBER tells us representative ChatGPT use is not mostly coding or benchmarks: it is Practical Guidance, Writing, Seeking Information, Asking, Doing, and Expressing. We use those task-mix margins to keep the 500-row strata from becoming unrepresentative. The 30/60/10 model mixture is not claimed as a measured fact; it is a transparent scenario that can be varied.

OpenAI/NBER Calibration Bridge

The 1,000 rows now speak the same taxonomy as OpenAI Signals

We classified the same 2025 WildChat sample into the public OpenAI/NBER usage taxonomy: work-related, work/school/other, Asking/Doing/Expressing, 24 fine topics, and seven coarse topics. This lets us compare our public-text sample against the OpenAI Signals aggregate margins before making any population claim.

Classified rows 1,000

Codex classified public WildChat messages with up to 10 prior messages as context, matching the paper's message-level logic.

Signals anchor 300k/mo

OpenAI Signals publishes aggregate shares from monthly consumer ChatGPT samples, not row-level messages.

Key implication Calibrate

The raw public sample is biased toward Doing/Writing, so it should be raked to Signals margins before headline estimates.

Ground Truth Split

Signals gives population margins, not private message rows.

The internal 1.1M ChatGPT sample is not publicly released. The public ground truth is aggregate: topic, work-related status, work/school/other, and Asking/Doing/Expressing shares. We use those margins to reweight public WildChat/ShareChat text, while requesting the paper's 100,000 classified WildChat validation sample from the authors.

Coarse topic WildChat 1,000 OpenAI Signals 2025 avg Direction
Writing37.3%26.9%Public sample high: translation and rewriting are over-represented.
Practical Guidance7.2%28.7%Public sample low: everyday advice/tutoring is under-represented.
Seeking Information9.1%18.1%Public sample low: search-like use needs upweighting.
Self-Expression17.9%8.4%Public sample high: roleplay/chitchat are selected into public logs.
Technical Help9.2%6.8%Slightly high, but close enough to use with weights.
Multimedia1.6%6.2%Public text logs miss image/media use.
Other/Unknown17.7%4.9%Public sample has many unclear or prompt-fragment rows.
Dimension WildChat 1,000 OpenAI Signals 2025 avg What it means
Asking17.2%43.2%Decision support and information-seeking are much larger in representative ChatGPT use.
Doing71.7%36.2%The public sample overstates direct output-production tasks.
Expressing11.1%20.7%Representative usage has more expression than this sample captures.
Work-related22.2%34.5%Work use is under-represented in this 2025 public WildChat slice.
Schoolwork7.0%19.5%Education use must be separately calibrated.

Usage Taxonomy × Compute Demand

After joining taxonomy with tiers, the compute story varies sharply by topic

This bridge table is the first version of the paper's empirical spine: user demand on the left, required execution tier on the right. It is still uncalibrated, but it shows where validation should focus.

Coarse topic n Small LLM Frontier standard No LLM Long / reasoning / agent
Writing37385.3%9.7%3.2%1.6%
Self-Expression17988.8%1.7%0.6%0.0%
Other/Unknown17795.5%1.7%0.6%0.0%
Technical Help9233.7%39.1%20.7%5.5%
Seeking Information9162.6%5.5%29.7%1.1%
Practical Guidance7258.3%25.0%6.9%4.2%
Multimedia1681.2%6.2%12.5%0.0%

Next Calculation

Do not report the raw 1,000-row shares as population facts.

The right next statistic is a calibrated estimate: reweight this labeled public sample to OpenAI Signals topic, intent, and work/school margins, then recompute no-LLM share, small-model-sufficient share, frontier-standard share, and reasoning-needed share. A separate model-stress panel should handle GPT-4o, o1/o3, GPT-4.5, and o4-mini cases because the natural 2025 WildChat sample is almost entirely mini-tier.

Earlier Stratified Pilot

The 10,000-row run is now a stress-test panel, not the main 2025 estimate

This older panel deliberately mixed model-proportional rows with reasoning, multi-turn, non-English, and edge-case strata. It is useful for seeing the full route taxonomy and stress-testing reasoning-heavy replay, but it should not be read as the natural 2025 model-use distribution.

1 Raw conversation

WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.

2 Actual task

What the user is really trying to accomplish, not just the words in the prompt.

3 Compute demand

Ask what level of AI would have been enough if the user had chosen deliberately.

4 Energy replay

Compare standard frontier replay and reasoning replay against the minimum sufficient execution tier.

Execution tier Count Share Meaning Example
T0a No LLM: local tool/software1131.1%Calculator, spreadsheet, local script, or software workflow
T0b No LLM: search/lookup5165.2%Direct reference lookup or search result is enough
T0c No LLM: specialized tool/API150.2%Non-LLM API, app, database, or specialized calculator
T1 Small LLM3,53735.4%Gemma-class simple translation, rewrite, extraction, short generation
T2 Frontier standard4,79047.9%Normal frontier writing, explanation, coding, and synthesis
T3 Frontier + long context1091.1%Frontier model mainly because context length or fidelity is the bottleneck
T4 Frontier + reasoning5285.3%Frontier model plus reasoning/test-time compute
T5 LLM agent + tools1511.5%LLM must orchestrate search, tools, APIs, or actions
T6 Expert / not comparable2412.4%High-stakes, unsafe, unclear, or excluded from simple substitution

Example Display

Prompt examples are original excerpts.

WildChat episodes are conversation-level records, so one row can contain multiple user turns or a very long pasted prompt. The example cards keep the original language and wording, but clip long prompts to the first 1,500 characters for readability and show the raw prompt length. The episode id points back to the full local record.

Core Message

The problem is not using AI. The problem is using too much AI for the task.

In this pilot, the savings are modest if the alternative is ordinary frontier inference for every task, but much larger if ordinary tasks are being pushed through reasoning-heavy paths. That is the paper's opening: AI is an intensity choice, and users need information that helps them conserve high-compute AI when it is not needed.

Counterfactual Accounting

Decompose first, then compare the same tasks two ways

Decomposition is necessary because a chat is not a single generic "AI request." One conversation may be simple lookup, another may be writing, another may be calculation, and another may need reasoning or tools. The accounting compares the same 10,000 actual tasks under two execution plans.

1. Standard frontier replay 28,259.3 Wh

Counterfactual baseline: every task is answered through a standard frontier path.

2. Reasoning frontier replay 115,000.2 Wh

Stress baseline: every task is pushed through reasoning/test-time compute.

3. Task-matched execution 25,420.7 Wh

Careful-use benchmark: no-LLM, small LLM, standard frontier, context, reasoning, agent, or exclusion.

Saved vs standard frontier 10.0%

2,838.6Wh saved; T4/T5 quality upgrades offset part of the T0/T1 savings.

Saved vs reasoning frontier 76.0%

87,424.7Wh saved when ordinary tasks avoid reasoning/test-time compute.

Small-model opportunity 35.4%

3,537 / 10,000 tasks need language capability, but not frontier execution.

Why This Equation

The numerator is avoidable cloud execution, not total AI energy.

The equation isolates the cloud inference energy that changes when the execution plan changes. It does not convert human time into carbon, and it does not count model training. It asks two narrow questions: what if every task uses standard frontier execution, and what if every task uses reasoning/test-time compute?

Replay Definition

This is not asking GPT-4 to answer again.

"Replay" means applying the same accounting model to the same observed WildChat tasks under two execution plans. The baselines replay standard frontier and reasoning-frontier execution. The alternative uses the task decomposition to choose the minimum sufficient tier. No GPT-4, GPT-4o, GPT-4.5, GPT-5.5, or Pro model was re-run on the 10,000 conversations for this result.

Task tier Reasoning replay Task-matched Saved vs reasoning Share of reasoning savings Why energy changes
T0a No LLM: local tool/software915.7 Wh0.0 Wh915.7 Wh1.0%Deterministic local execution avoids cloud inference.
T0b No LLM: search/lookup4,949.8 Wh154.8 Wh4,794.9 Wh5.5%Known facts and references are lower-compute lookup tasks.
T0c No LLM: specialized tool/API180.5 Wh0.8 Wh179.7 Wh0.2%Domain tools replace general model generation.
T1 Small LLM32,750.7 Wh206.2 Wh32,544.5 Wh37.2%Simple language work moves to Gemma-class small models.
T2 Frontier standard60,985.4 Wh15,470.7 Wh45,514.7 Wh52.1%These tasks need frontier quality, but not reasoning mode.
T3 Frontier + long context2,273.4 Wh1,079.5 Wh1,193.9 Wh1.4%Context/fidelity matters; reasoning is not the main bottleneck.
T4 Frontier + reasoning6,924.7 Wh6,924.7 Wh0.0 Wh0.0%Reasoning/test-time compute is justified here.
T5 LLM agent + tools3,404.6 Wh1,123.4 Wh2,281.2 Wh2.6%Tool orchestration is needed, but not every step is reasoning.
T6 Expert / not comparable2,615.4 Wh2,615.4 Wh0.0 Wh0.0%Excluded from simple cloud-substitution savings.

Energy Multipliers

Use one additive task-energy model before comparing alternatives

Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each actual task is compared with the lowest sufficient execution tier: no-LLM tool/search/API, small LLM, standard frontier, long-context frontier, reasoning frontier, LLM agent with tools, or expert/not comparable.

Saved vs standard frontier 2,838.6 Wh

(28,259.3Wh standard replay - 25,420.7Wh task-matched execution) / 28,259.3Wh = 10.0%.

Saved vs reasoning frontier 87,424.7 Wh

(115,000.2Wh reasoning replay - 27,575.5Wh task-matched execution) / 115,000.2Wh = 76.0%.

Reasoning replay CO2 cut 34.445 kg

Uses EPA 0.394 kgCO2/kWh average electricity factor.

Heavy reasoning sensitivity 132,009.4 Wh

Heavy frontier reasoning replay minus task-matched execution.

Execution tier Share Reasoning avg Matched avg Multiplier Saved vs reasoning
T0a No LLM: local tool/software113 / 10,000, 1.1%8.104 Wh~0 cloud Whlocal / 0915.7 Wh
T0b No LLM: search/lookup516 / 10,000, 5.2%9.593 Wh0.300 Wh32.0x4,794.9 Wh
T1 Small LLM3,537 / 10,000, 35.4%9.259 Wh0.058 Wh159.0x32,544.5 Wh
T2 Frontier standard4,790 / 10,000, 47.9%12.732 Wh3.230 Wh3.9x45,514.7 Wh
T3 Frontier + long context109 / 10,000, 1.1%20.857 Wh9.904 Wh2.1x1,193.9 Wh
T4 Frontier + reasoning528 / 10,000, 5.3%13.115 Wh13.115 Wh1.0x0.0 Wh
T5 LLM agent + tools151 / 10,000, 1.5%22.547 Wh7.440 Wh3.0x2,281.2 Wh
T6 Expert / not comparable241 / 10,000, 2.4%10.852 Wh10.852 Whexcluded0.0 Wh

Interpretation

Search is one lower-intensity option, not the whole story.

The biggest sustainability mistake is not using a frontier model for every task; it is using reasoning compute for ordinary tasks. T4 keeps reasoning where it is justified. T1 shows the largest model-rightsizing opportunity: 35.4% of chats need language capability, but only a small model.

Scale-Up

We need two benchmarks, not one headline number

The pilot should be benchmarked twice. If every task is replayed through a standard frontier model, careful use saves 0.28-0.68 Wh per row. If every task is replayed through a reasoning-frontier path, careful use saves 8.74-13.20 Wh per row. The second number is not total LLM electricity; it is the avoided cost of reasoning overuse.

Human population 8.3B

Approximate 2026 world population.

ChatGPT weekly users 800M+

OpenAI public usage anchor.

Saved per pilot row 0.28-13.20 Wh

Lower end is standard-frontier replay; upper end is reasoning-frontier replay.

Conversion 0.394 kg/kWh

EPA U.S. average electricity factor; forest storage uses 0.77 tCO2/acre/year.

Benchmark A

All standard frontier

2.83-4.99 Wh/row baseline; task matching saves 10.0-13.7%.

Benchmark B

All reasoning frontier

11.50-17.83 Wh/row baseline; task matching saves 74.0-76.0%.

Benchmark Replay energy Task-matched energy Saved per row Reduction
All standard frontier2.83-4.99 Wh2.54-4.30 Wh0.28-0.68 Wh10.0-13.7%
All reasoning frontier11.50-17.83 Wh2.76-4.63 Wh8.74-13.20 Wh74.0-76.0%
Scale scenario Conversations/year All standard frontier: total / saved All reasoning frontier: total / saved Reasoning saved CO2 / forest
800M weekly users × 5 chats/week208B0.59-1.04 / 0.06-0.14 TWh2.39-3.71 / 1.82-2.75 TWh0.72-1.08 MtCO2 / 0.93-1.40M acres
1B users × 5 chats/day1.83T5.16-9.10 / 0.52-1.24 TWh20.99-32.54 / 15.95-24.09 TWh6.28-9.49 MtCO2 / 8.16-12.33M acres
10B AI chats/day globally3.65T10.32-18.20 / 1.04-2.49 TWh41.98-65.07 / 31.89-48.18 TWh12.56-18.98 MtCO2 / 16.32-24.65M acres

Research Rule

Identify the task first; decompose it second; decide whether reasoning is justified third.

Human time is not converted into carbon. This page estimates cloud energy from the task decomposition; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.

Method

The task-energy model is additive

The central scenario uses a 0.85 Wh frontier base work unit and a 6.5 Wh reasoning response. T1 uses 0.04 Wh per 1,000 visible tokens for Gemma-class small models. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.

Base visible inference Ebase × max(responses, tokens / reference)
Reasoning add-on max(0, Ereason total - Ebase) × responses
Search add-on 0.30 Wh × search calls
Cloud carbon CO2 = E / 1000 × 0.394 kg/kWh
Reasoning replay carbon 45.310 kgCO2

115,000.2Wh across 10,000 pilot rows.

Task-matched carbon 10.867 kgCO2

27,575.5Wh after matching each task to its minimum sufficient execution tier.

Cloud carbon saved 34.445 kgCO2

Pilot-scale value; platform-scale value is shown in the scale-up section.

Appendix

Data coverage and frontier coefficient derivation

The main story uses frontier standard and frontier reasoning replay. The details below show where the pilot conversations came from and how the energy assumptions are anchored.

GPT-4o anchor 0.31-0.34 Wh

Public standard-query anchor from Epoch/OpenAI-era estimates.

Frontier central base 0.85 Wh

0.34Wh × 2.5 active-compute multiplier.

Frontier reasoning total 6.5 Wh

Central estimate when reasoning/test-time compute is invoked.

Heavy frontier path 1.5Wh / 10Wh

Heavy standard base / reasoning total for Pro-style paths.

Replay Assumption

Estimate active compute; do not pretend we have vendor telemetry.

GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops. T1 uses Gemma-class small-model inference anchored to Gemma 4 E2B/E4B effective-parameter models. This is a calibrated counterfactual, not a measurement of any vendor's internal serving stack.

Question Answer Treatment in the page Why acceptable Residual risk
Did we re-ask GPT-4?NoCounterfactual energy replayObserved tasks and token structure are fixedNot a direct output-quality experiment
Is GPT-4 the same as Pro?NoSeparate central and heavy scenariosOpenAI distinguishes standard, Thinking, and Pro-style modesExact hidden compute is not public
Why estimate at all?Production telemetry is closedUse Wh/request and test-time scaling anchorsInference energy is driven by tokens, active compute, and serving efficiencyPoint estimates need sensitivity ranges
Frontier assumption Active compute multiplier Standard Wh/request Reasoning / agent mode Interpretation
Frontier central2.5x GPT-4o anchor0.85 Wh6.5 Wh reasoning totalMain sensitivity used in the page headline
Frontier heavy4.4x GPT-4o anchor1.5 Wh10 Wh reasoning totalUpper sensitivity for heavy modes
Long-context / agentic pathadd by actual tokens/tools2.5-40+ Whcontext, retrieval, and tool loops dominateUse for document-scale and agentic workflows

Why The Estimate Holds

The claim is comparative before it is absolute.

The exact Wh number can move with model architecture, batching, cache hits, quantization, and hardware. The ranking is more stable: longer generations use more compute than shorter ones; reasoning/test-time scaling uses more compute than ordinary answering; small sufficient models use less than frontier models; local deterministic tools avoid cloud inference. That is why the page reports central and heavy sensitivity scenarios rather than one alleged exact footprint.

Pilot model family Count Years observed Role in this page Treatment
GPT-3.53,3262023-2024Observed demandReplay through the frontier tier model
GPT-4 early / turbo / preview1,1742023-2024Observed demandReplay with frontier central/heavy
GPT-4o3,1072024Energy anchor0.31-0.34 Wh optimized proxy
GPT-4o mini7182024-2025Small-model evidenceLower-compute alternative
o1 reasoning1,2782024Reasoning evidenceBase plus reasoning add-on
GPT-4.1 / GPT-4.1 mini3972025Observed demandReplay with frontier central/heavy

Sources

Sources and anchors for the calculation