The Electricity Cost of Everyday AI

Three-stratum model-use study

How much electricity is wasted when everyday AI tasks are run on more model than they need?

The study starts from a simple concern: users may be sending ordinary tasks to unnecessarily expensive AI modes. We therefore do not read WildChat's raw model distribution as truth. Instead, we use public 2025 conversations to build three model-use strata, classify what each task actually needed, and compute how much electricity a user could save by switching down when the task allows.

Frontier reasoning 500 o1/o-series style rows; estimate how often reasoning was actually necessary

Frontier non-reasoning 500 GPT-4o/GPT-4.1/GPT-4.5 style rows; estimate when standard frontier can be downshifted

Small model 500 GPT-4o-mini/GPT-4.1-mini proxy rows plus local small-model validation

Mixture scenario 30/60/10 Example aggregate: reasoning / frontier / small, varied in sensitivity checks

Gemini classified 1,500 All three model-use strata now have minimum-sufficient-tier labels

First-Page Setup

Sample by model used, classify by model needed, then price the gap.

SampleTake 500 rows from each 2025 model-use class.
ClassifyAsk whether the task needs reasoning, frontier, small model, search, or local tool.
CalculateWithin each model class, compute average avoidable Wh and percentage saving.
AggregateCombine model-class savings under NBER-inspired mixture scenarios.

Operational Design

For each model class, estimate what share of tasks could have used less compute

The unit is a conversation row with a visible model class. For each row, we classify the true task demand: whether reasoning was needed, whether a smaller model was enough, whether local search/tool use was enough, or whether the original model class was justified. This produces a within-class saving curve rather than a claim about the true global frequency of each model.

Model class used Rows Examples Quantity estimated

Frontier Reasoning500o1, o1-preview, o3, o4-high style usageAverage Wh saved if non-reasoning, small model, search, or tool use was enough.

Frontier Non-Reasoning500GPT-4o, GPT-4.1, GPT-4.5 style usageAverage Wh saved if small model, search, or local tool use was enough.

Small Model500GPT-4o-mini / GPT-4.1-mini as public proxies; Gemma/Qwen/Phi/Llama for validationAverage Wh saved from local search/tool substitution, and how often small models are under-powered.

Gemini Classified Results

The three model classes produce very different savings margins

These are actual Gemini labels from the 1,500-row model-stratified sample. The key result is not a single global number. It is a conditional statement: if a user is currently using a given model class, how often could the task have been handled by a lower-compute route?

Frontier Reasoning 3.89 Wh

Average avoidable electricity per reasoning-row under central assumptions; 57.2% of chosen-model energy.

Frontier Non-Reasoning 0.70 Wh

Average avoidable electricity per frontier-row; 13.3% of chosen-model energy.

Small Model 1.50 Wh

Average upgrade pressure per small-row; many small-model rows appear to need frontier quality.

Model class used Local/search/tool Small model enough Frontier non-reasoning needed Reasoning needed Agent/expert

Frontier Reasoning5.2%21.0%46.4%19.0%8.4%

Frontier Non-Reasoning10.4%22.2%59.4%1.8%6.2%

Small Model7.6%61.8%28.0%0.4%2.2%

Model class used Avg chosen Wh Avg avoidable Wh Interpretation

Frontier Reasoning6.803.89, 57.2%Most of the savings come from reasoning rows that only need standard frontier, small model, or lookup/tool execution.

Frontier Non-Reasoning5.240.70, 13.3%The downshift margin exists, but is smaller because many rows still need standard frontier quality.

Small Model0.130.00, 0.1%Small rows do not save much by downshifting further; the main issue is upgrade pressure, not waste.

Classifier Questions

The labels should answer exactly the switching question users face

We are not classifying broad topics for their own sake. The classification decides the lowest sufficient execution path for a task, so the electricity comparison is meaningful.

Required route Decision question Example task If original was reasoning If original was frontier If original was small

Local tool / local searchCould search, calculator, spreadsheet, local script, or lookup solve it?Weather, arithmetic, exact fact lookupLarge savingModerate savingSmall saving

Small modelCould a small model handle the language task without quality loss?Translation, short rewrite, extractionLarge savingModerate savingNo saving

Frontier non-reasoningDoes the task need frontier quality but not reasoning?Substantial writing, moderate coding, synthesisReasoning overhead savedNo savingUpgrade / rework risk

Frontier reasoningDoes it truly need test-time reasoning?Hard math, complex debugging, deep planningNo savingUpgrade neededUpgrade needed

Expert / excludedIs this high-stakes, unsafe, or not a simple substitution?Medical, legal, cybersecurity release decisionsDo not scoreDo not scoreDo not score

Within-Class Output

Each model class gets its own saving estimate.

The reasoning stratum is the clearest over-compute margin: only 19.0% of sampled reasoning rows were classified as truly requiring reasoning, while 46.4% needed only frontier non-reasoning, 21.0% needed only a small model, and 5.2% could be handled by local search/tool paths. The frontier non-reasoning stratum has a smaller but still meaningful downshift margin. The small-model stratum mostly tells a different story: some tasks should be upgraded, so the paper must report quality-adjusted savings rather than energy alone.

Electricity Calculation

Compute average savings inside each model class, then combine classes

For each class used by the user, we estimate both avoidable electricity and upgrade pressure. Avoidable electricity counts cases where a lower-compute route is sufficient. Upgrade pressure counts cases where the observed model class was too weak and quality would require more compute.

Required-route shares p[m,r] = share of model class m classified as required route r

Average avoidable Wh A[m] = sum_r p[m,r] × max(E[m] - E[r], 0)

Upgrade pressure U[m] = sum_r p[m,r] × max(E[r] - E[m], 0)

Scenario aggregate A = 0.30 A[R] + 0.60 A[F] + 0.10 A[S]

Scenario ingredient Example value What it means Sensitivity

q[R]30%Share of interactions assumed to use Frontier ReasoningVary from low to reasoning-heavy scenarios.

q[F]60%Share assumed to use Frontier Non-ReasoningVary with paid/default product assumptions.

q[S]10%Share assumed to use Small Model / mini-tier pathVary as small-model adoption grows.

Total savingq[R]A[R] + q[F]A[F] + q[S]A[S]Average avoidable Wh per chat under the scenarioReport central, low, and high cases.

Mixture result Central Heavy Meaning

Avg chosen electricity5.20 Wh/chat8.76 Wh/chatExpected electricity under the 30/60/10 model-use scenario.

Avg avoidable electricity1.58 Wh/chat2.52 Wh/chatExpected saving if users switch to the lowest sufficient route.

Avoidable percentage30.5%28.7%Scenario-level energy reduction from task-rightsizing.

Avg upgrade pressure0.21 Wh/chat0.34 Wh/chatExtra electricity needed where the chosen model appears too weak.

How NBER Enters

NBER anchors the usage mix; our scenario anchors the model mix.

OpenAI/NBER tells us representative ChatGPT use is not mostly coding or benchmarks: it is Practical Guidance, Writing, Seeking Information, Asking, Doing, and Expressing. We use those task-mix margins to keep the 500-row strata from becoming unrepresentative. The 30/60/10 model mixture is not claimed as a measured fact; it is a transparent scenario that can be varied.

OpenAI/NBER Calibration Bridge

The 1,000 rows now speak the same taxonomy as OpenAI Signals

We classified the same 2025 WildChat sample into the public OpenAI/NBER usage taxonomy: work-related, work/school/other, Asking/Doing/Expressing, 24 fine topics, and seven coarse topics. This lets us compare our public-text sample against the OpenAI Signals aggregate margins before making any population claim.

Classified rows 1,000

Codex classified public WildChat messages with up to 10 prior messages as context, matching the paper's message-level logic.

Signals anchor 300k/mo

OpenAI Signals publishes aggregate shares from monthly consumer ChatGPT samples, not row-level messages.

Key implication Calibrate

The raw public sample is biased toward Doing/Writing, so it should be raked to Signals margins before headline estimates.

Ground Truth Split

Signals gives population margins, not private message rows.

The internal 1.1M ChatGPT sample is not publicly released. The public ground truth is aggregate: topic, work-related status, work/school/other, and Asking/Doing/Expressing shares. We use those margins to reweight public WildChat/ShareChat text, while requesting the paper's 100,000 classified WildChat validation sample from the authors.

Coarse topic WildChat 1,000 OpenAI Signals 2025 avg Direction

Writing37.3%26.9%Public sample high: translation and rewriting are over-represented.

Practical Guidance7.2%28.7%Public sample low: everyday advice/tutoring is under-represented.

Seeking Information9.1%18.1%Public sample low: search-like use needs upweighting.

Self-Expression17.9%8.4%Public sample high: roleplay/chitchat are selected into public logs.

Technical Help9.2%6.8%Slightly high, but close enough to use with weights.

Multimedia1.6%6.2%Public text logs miss image/media use.

Other/Unknown17.7%4.9%Public sample has many unclear or prompt-fragment rows.

Dimension WildChat 1,000 OpenAI Signals 2025 avg What it means

Asking17.2%43.2%Decision support and information-seeking are much larger in representative ChatGPT use.

Doing71.7%36.2%The public sample overstates direct output-production tasks.

Expressing11.1%20.7%Representative usage has more expression than this sample captures.

Work-related22.2%34.5%Work use is under-represented in this 2025 public WildChat slice.

Schoolwork7.0%19.5%Education use must be separately calibrated.

Usage Taxonomy × Compute Demand

After joining taxonomy with tiers, the compute story varies sharply by topic

This bridge table is the first version of the paper's empirical spine: user demand on the left, required execution tier on the right. It is still uncalibrated, but it shows where validation should focus.

Coarse topic n Small LLM Frontier standard No LLM Long / reasoning / agent

Writing37385.3%9.7%3.2%1.6%

Self-Expression17988.8%1.7%0.6%0.0%

Other/Unknown17795.5%1.7%0.6%0.0%

Technical Help9233.7%39.1%20.7%5.5%

Seeking Information9162.6%5.5%29.7%1.1%

Practical Guidance7258.3%25.0%6.9%4.2%

Multimedia1681.2%6.2%12.5%0.0%

Next Calculation

Do not report the raw 1,000-row shares as population facts.

The right next statistic is a calibrated estimate: reweight this labeled public sample to OpenAI Signals topic, intent, and work/school margins, then recompute no-LLM share, small-model-sufficient share, frontier-standard share, and reasoning-needed share. A separate model-stress panel should handle GPT-4o, o1/o3, GPT-4.5, and o4-mini cases because the natural 2025 WildChat sample is almost entirely mini-tier.

Earlier Stratified Pilot

The 10,000-row run is now a stress-test panel, not the main 2025 estimate

This older panel deliberately mixed model-proportional rows with reasoning, multi-turn, non-English, and edge-case strata. It is useful for seeing the full route taxonomy and stress-testing reasoning-heavy replay, but it should not be read as the natural 2025 model-use distribution.

1 Raw conversation

WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.

2 Actual task

What the user is really trying to accomplish, not just the words in the prompt.

3 Compute demand

Ask what level of AI would have been enough if the user had chosen deliberately.

4 Energy replay

Compare standard frontier replay and reasoning replay against the minimum sufficient execution tier.

Execution tier Count Share Meaning Example

T0a No LLM: local tool/software1131.1%Calculator, spreadsheet, local script, or software workflow

T0b No LLM: search/lookup5165.2%Direct reference lookup or search result is enough

T0c No LLM: specialized tool/API150.2%Non-LLM API, app, database, or specialized calculator

T1 Small LLM3,53735.4%Gemma-class simple translation, rewrite, extraction, short generation

T2 Frontier standard4,79047.9%Normal frontier writing, explanation, coding, and synthesis

T3 Frontier + long context1091.1%Frontier model mainly because context length or fidelity is the bottleneck

T4 Frontier + reasoning5285.3%Frontier model plus reasoning/test-time compute

T5 LLM agent + tools1511.5%LLM must orchestrate search, tools, APIs, or actions

T6 Expert / not comparable2412.4%High-stakes, unsafe, unclear, or excluded from simple substitution

Example Display

Prompt examples are original excerpts.

WildChat episodes are conversation-level records, so one row can contain multiple user turns or a very long pasted prompt. The example cards keep the original language and wording, but clip long prompts to the first 1,500 characters for readability and show the raw prompt length. The episode id points back to the full local record.

Core Message

The problem is not using AI. The problem is using too much AI for the task.

In this pilot, the savings are modest if the alternative is ordinary frontier inference for every task, but much larger if ordinary tasks are being pushed through reasoning-heavy paths. That is the paper's opening: AI is an intensity choice, and users need information that helps them conserve high-compute AI when it is not needed.

Counterfactual Accounting

Decompose first, then compare the same tasks two ways

Decomposition is necessary because a chat is not a single generic "AI request." One conversation may be simple lookup, another may be writing, another may be calculation, and another may need reasoning or tools. The accounting compares the same 10,000 actual tasks under two execution plans.

1. Standard frontier replay 28,259.3 Wh

Counterfactual baseline: every task is answered through a standard frontier path.

2. Reasoning frontier replay 115,000.2 Wh

Stress baseline: every task is pushed through reasoning/test-time compute.

3. Task-matched execution 25,420.7 Wh

Careful-use benchmark: no-LLM, small LLM, standard frontier, context, reasoning, agent, or exclusion.

Saved vs standard frontier 10.0%

2,838.6Wh saved; T4/T5 quality upgrades offset part of the T0/T1 savings.

Saved vs reasoning frontier 76.0%

87,424.7Wh saved when ordinary tasks avoid reasoning/test-time compute.

Small-model opportunity 35.4%

3,537 / 10,000 tasks need language capability, but not frontier execution.

Why This Equation

The numerator is avoidable cloud execution, not total AI energy.

The equation isolates the cloud inference energy that changes when the execution plan changes. It does not convert human time into carbon, and it does not count model training. It asks two narrow questions: what if every task uses standard frontier execution, and what if every task uses reasoning/test-time compute?

Replay Definition

This is not asking GPT-4 to answer again.

"Replay" means applying the same accounting model to the same observed WildChat tasks under two execution plans. The baselines replay standard frontier and reasoning-frontier execution. The alternative uses the task decomposition to choose the minimum sufficient tier. No GPT-4, GPT-4o, GPT-4.5, GPT-5.5, or Pro model was re-run on the 10,000 conversations for this result.

Task tier Reasoning replay Task-matched Saved vs reasoning Share of reasoning savings Why energy changes

T0a No LLM: local tool/software915.7 Wh0.0 Wh915.7 Wh1.0%Deterministic local execution avoids cloud inference.

T0b No LLM: search/lookup4,949.8 Wh154.8 Wh4,794.9 Wh5.5%Known facts and references are lower-compute lookup tasks.

T0c No LLM: specialized tool/API180.5 Wh0.8 Wh179.7 Wh0.2%Domain tools replace general model generation.

T1 Small LLM32,750.7 Wh206.2 Wh32,544.5 Wh37.2%Simple language work moves to Gemma-class small models.

T2 Frontier standard60,985.4 Wh15,470.7 Wh45,514.7 Wh52.1%These tasks need frontier quality, but not reasoning mode.

T3 Frontier + long context2,273.4 Wh1,079.5 Wh1,193.9 Wh1.4%Context/fidelity matters; reasoning is not the main bottleneck.

T4 Frontier + reasoning6,924.7 Wh6,924.7 Wh0.0 Wh0.0%Reasoning/test-time compute is justified here.

T5 LLM agent + tools3,404.6 Wh1,123.4 Wh2,281.2 Wh2.6%Tool orchestration is needed, but not every step is reasoning.

T6 Expert / not comparable2,615.4 Wh2,615.4 Wh0.0 Wh0.0%Excluded from simple cloud-substitution savings.

Energy Multipliers

Use one additive task-energy model before comparing alternatives

Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each actual task is compared with the lowest sufficient execution tier: no-LLM tool/search/API, small LLM, standard frontier, long-context frontier, reasoning frontier, LLM agent with tools, or expert/not comparable.

Saved vs standard frontier 2,838.6 Wh

(28,259.3Wh standard replay - 25,420.7Wh task-matched execution) / 28,259.3Wh = 10.0%.

Saved vs reasoning frontier 87,424.7 Wh

(115,000.2Wh reasoning replay - 27,575.5Wh task-matched execution) / 115,000.2Wh = 76.0%.

Reasoning replay CO2 cut 34.445 kg

Uses EPA 0.394 kgCO2/kWh average electricity factor.

Heavy reasoning sensitivity 132,009.4 Wh

Heavy frontier reasoning replay minus task-matched execution.

Execution tier Share Reasoning avg Matched avg Multiplier Saved vs reasoning

T0a No LLM: local tool/software113 / 10,000, 1.1%8.104 Wh~0 cloud Whlocal / 0915.7 Wh

T0b No LLM: search/lookup516 / 10,000, 5.2%9.593 Wh0.300 Wh32.0x4,794.9 Wh

T1 Small LLM3,537 / 10,000, 35.4%9.259 Wh0.058 Wh159.0x32,544.5 Wh

T2 Frontier standard4,790 / 10,000, 47.9%12.732 Wh3.230 Wh3.9x45,514.7 Wh

T3 Frontier + long context109 / 10,000, 1.1%20.857 Wh9.904 Wh2.1x1,193.9 Wh

T4 Frontier + reasoning528 / 10,000, 5.3%13.115 Wh13.115 Wh1.0x0.0 Wh

T5 LLM agent + tools151 / 10,000, 1.5%22.547 Wh7.440 Wh3.0x2,281.2 Wh

T6 Expert / not comparable241 / 10,000, 2.4%10.852 Wh10.852 Whexcluded0.0 Wh

Interpretation

Search is one lower-intensity option, not the whole story.

The biggest sustainability mistake is not using a frontier model for every task; it is using reasoning compute for ordinary tasks. T4 keeps reasoning where it is justified. T1 shows the largest model-rightsizing opportunity: 35.4% of chats need language capability, but only a small model.

Scale-Up

We need two benchmarks, not one headline number

The pilot should be benchmarked twice. If every task is replayed through a standard frontier model, careful use saves 0.28-0.68 Wh per row. If every task is replayed through a reasoning-frontier path, careful use saves 8.74-13.20 Wh per row. The second number is not total LLM electricity; it is the avoided cost of reasoning overuse.

Human population 8.3B

Approximate 2026 world population.

ChatGPT weekly users 800M+

OpenAI public usage anchor.

Saved per pilot row 0.28-13.20 Wh

Lower end is standard-frontier replay; upper end is reasoning-frontier replay.

Conversion 0.394 kg/kWh

EPA U.S. average electricity factor; forest storage uses 0.77 tCO2/acre/year.

Benchmark A

All standard frontier

2.83-4.99 Wh/row baseline; task matching saves 10.0-13.7%.

Benchmark B

All reasoning frontier

11.50-17.83 Wh/row baseline; task matching saves 74.0-76.0%.

Benchmark Replay energy Task-matched energy Saved per row Reduction

All standard frontier2.83-4.99 Wh2.54-4.30 Wh0.28-0.68 Wh10.0-13.7%

All reasoning frontier11.50-17.83 Wh2.76-4.63 Wh8.74-13.20 Wh74.0-76.0%

Scale scenario Conversations/year All standard frontier: total / saved All reasoning frontier: total / saved Reasoning saved CO2 / forest

800M weekly users × 5 chats/week208B0.59-1.04 / 0.06-0.14 TWh2.39-3.71 / 1.82-2.75 TWh0.72-1.08 MtCO2 / 0.93-1.40M acres

1B users × 5 chats/day1.83T5.16-9.10 / 0.52-1.24 TWh20.99-32.54 / 15.95-24.09 TWh6.28-9.49 MtCO2 / 8.16-12.33M acres

10B AI chats/day globally3.65T10.32-18.20 / 1.04-2.49 TWh41.98-65.07 / 31.89-48.18 TWh12.56-18.98 MtCO2 / 16.32-24.65M acres

Research Rule

Identify the task first; decompose it second; decide whether reasoning is justified third.

Human time is not converted into carbon. This page estimates cloud energy from the task decomposition; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.

Method

The task-energy model is additive

The central scenario uses a 0.85 Wh frontier base work unit and a 6.5 Wh reasoning response. T1 uses 0.04 Wh per 1,000 visible tokens for Gemma-class small models. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.

Base visible inference Ebase × max(responses, tokens / reference)

Reasoning add-on max(0, Ereason total - Ebase) × responses

Search add-on 0.30 Wh × search calls

Cloud carbon CO2 = E / 1000 × 0.394 kg/kWh

Reasoning replay carbon 45.310 kgCO2

115,000.2Wh across 10,000 pilot rows.

Task-matched carbon 10.867 kgCO2

27,575.5Wh after matching each task to its minimum sufficient execution tier.

Cloud carbon saved 34.445 kgCO2

Pilot-scale value; platform-scale value is shown in the scale-up section.

Appendix

Data coverage and frontier coefficient derivation

The main story uses frontier standard and frontier reasoning replay. The details below show where the pilot conversations came from and how the energy assumptions are anchored.

GPT-4o anchor 0.31-0.34 Wh

Public standard-query anchor from Epoch/OpenAI-era estimates.

Frontier central base 0.85 Wh

0.34Wh × 2.5 active-compute multiplier.

Frontier reasoning total 6.5 Wh

Central estimate when reasoning/test-time compute is invoked.

Heavy frontier path 1.5Wh / 10Wh

Heavy standard base / reasoning total for Pro-style paths.

Replay Assumption

Estimate active compute; do not pretend we have vendor telemetry.

GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops. T1 uses Gemma-class small-model inference anchored to Gemma 4 E2B/E4B effective-parameter models. This is a calibrated counterfactual, not a measurement of any vendor's internal serving stack.

Question Answer Treatment in the page Why acceptable Residual risk

Did we re-ask GPT-4?NoCounterfactual energy replayObserved tasks and token structure are fixedNot a direct output-quality experiment

Is GPT-4 the same as Pro?NoSeparate central and heavy scenariosOpenAI distinguishes standard, Thinking, and Pro-style modesExact hidden compute is not public

Why estimate at all?Production telemetry is closedUse Wh/request and test-time scaling anchorsInference energy is driven by tokens, active compute, and serving efficiencyPoint estimates need sensitivity ranges

Frontier assumption Active compute multiplier Standard Wh/request Reasoning / agent mode Interpretation

Frontier central2.5x GPT-4o anchor0.85 Wh6.5 Wh reasoning totalMain sensitivity used in the page headline

Frontier heavy4.4x GPT-4o anchor1.5 Wh10 Wh reasoning totalUpper sensitivity for heavy modes

Long-context / agentic pathadd by actual tokens/tools2.5-40+ Whcontext, retrieval, and tool loops dominateUse for document-scale and agentic workflows

Why The Estimate Holds

The claim is comparative before it is absolute.

The exact Wh number can move with model architecture, batching, cache hits, quantization, and hardware. The ranking is more stable: longer generations use more compute than shorter ones; reasoning/test-time scaling uses more compute than ordinary answering; small sufficient models use less than frontier models; local deterministic tools avoid cloud inference. That is why the page reports central and heavy sensitivity scenarios rather than one alleged exact footprint.

Pilot model family Count Years observed Role in this page Treatment

GPT-3.53,3262023-2024Observed demandReplay through the frontier tier model

GPT-4 early / turbo / preview1,1742023-2024Observed demandReplay with frontier central/heavy

GPT-4o3,1072024Energy anchor0.31-0.34 Wh optimized proxy

GPT-4o mini7182024-2025Small-model evidenceLower-compute alternative

o1 reasoning1,2782024Reasoning evidenceBase plus reasoning add-on

GPT-4.1 / GPT-4.1 mini3972025Observed demandReplay with frontier central/heavy

Sources

Sources and anchors for the calculation

Experimental study

The experiment must separate information from authority and defaults.

The referee risk is clear: a user may follow a recommendation because it looks authoritative, not because they learned the energy cost. The revised design uses five arms to isolate energy information, suggested AI intensity, and default preselection while preserving user override.

"Rewrite this email and make it warmer."

Visible recommendation Small model

Low-risk rewrite. A small model is likely sufficient; frontier reasoning is unlikely to improve the result enough to justify the extra energy.

Definition

What "recommendation" means in this study

A recommendation is a visible, optional task-specific suggestion shown to the user before they choose a model. It is not hidden platform routing and it is not a command. It names the task type, proposes the least validated AI intensity likely to work, gives a short reason, and lets the user override.

Task Recommendation shown Reason shown User choice What we learn

Weather or opening-hours lookupUse searchDirect lookup is enough; generation adds energy without adding much value.Accept or use AI anywayWhether users avoid strict overuse.

Short rewrite or translationUse small modelValidated small model quality is sufficient for this low-risk language task.Accept or use frontierWhether users still choose frontier for quality insurance.

Complex reasoning or codeUse frontier/reasoningLower-intensity options failed validation or have high rework risk.Accept or downgradeWhether the intervention wrongly suppresses needed compute.

High-stakes adviceExpert reviewThe task is not a simple AI substitution target.Accept or proceedWhether users respect abstention.

Design

One experiment, five interface conditions

Each condition changes the choice architecture, not the underlying task. This separates belief updating from authority cues and default effects.

Control Normal model picker

No energy label, no recommended alternative.

Label Wh, CO2, cost, latency

Users see the footprint but must decide what to do.

Recommend only Task-specific suggestion

Users see the suggested AI intensity but no energy/cost numbers.

Label + recommend Energy plus suggestion

Users see both the footprint and the task-matched AI-use suggestion.

Default Right-sized preselect

The lower-intensity option is selected when confidence is high; override remains explicit.

Measurement

The outcome is a frontier, not one number

Task quality

Energy and cost

Control Label Label + recommendation Default option

Primary 1Unnecessary high-compute choice on validated lower-intensity tasks.

Primary 2Wh per successful task, including rework and failure penalties.

SecondaryModel choice, override, task time, satisfaction, tokens, and latency.

MechanismBelief update, authority response, default response, frontier preference.

Validation In The Experiment

Every claimed saving has to survive output checks

A lower-compute recommendation counts only when it produces a usable result. Objective tasks use tests or exact checks. Writing uses blind preference. Factual tasks require source support. High-stakes or uncertain tasks are abstentions, not savings. The experiment also includes falsification tasks where high-intensity AI is demonstrably better; a good intervention should not reduce frontier/reasoning use there.

Task observed AI intensity recommended User accepts or overrides Output evaluated Wh per successful task

First-pass research story

A user sees one button. The system sees many possible energy paths.

The paper should begin from that mismatch. Modern AI products make model choice feel frictionless, but under the surface they can invoke search, small language models, frontier inference, long context, test-time reasoning, or tool loops. The experiment asks whether users will choose differently when the path becomes visible.

Thesis Right-size Compute follows the task instead of the user's anxiety or the platform default.

Unit Task The real object is the job the user wants done, not the sentence typed into the box.

Risk Abstain When the system is unsure, it should not pretend that cheaper is good enough.

Evidence 3 studies Observed tasks, user beliefs, and a field intervention become one argument.

Research Flow

From an observed mismatch to a behavioral experiment

The pilot shows what the world looks like when we read real chats as tasks. The paper becomes publishable when it proves two more things: lower-compute answers can be good enough, and users actually change choices when the interface gives them a usable alternative.

Study 1

Observed task prevalence

Public WildChat episodes plus participant task logs. Estimate which real tasks are no-LLM, small-model, standard frontier, long-context, reasoning, agentic, or expert-only.

Output: frontier-avoidable task share with confidence intervals.

Study 2

User perception and model choice

Survey and vignette experiment measuring whether users understand cost, energy, reasoning, long-context, agent loops, and the quality tradeoffs of small/open models.

Output: frontier preference scale and immediate recommendation effect.

Study 3

Field intervention

A right-sizing interface compares normal use against energy labels plus task-matched recommendations and an easy lower-compute option with override.

Output: behavior, quality, latency, rework, and satisfaction effects.

What Makes The Claim Strong

Three things cannot be blurred together

Task share is not energy share

A large fraction of simple tasks does not automatically imply a large energy saving. Savings depend on the baseline: standard frontier replay, reasoning replay, long-context replay, or agentic replay.

Local is not automatically green

Local inference must be measured. Low-throughput personal hardware can be worse than optimized cloud serving. The correct claim is least sufficient execution under measured conditions.

Awareness is not behavior change

Showing watts alone is weak. The intervention must pair resource feedback with a recommended alternative and a one-click path that preserves task quality.

Questions

The paper should answer four questions in order

RQ1How prevalent is frontier over-compute in everyday LLM tasks?

RQ2Do users understand the cost and energy differences among search, small models, frontier models, reasoning, and agents?

RQ3Can a calibrated task classifier recommend lower-compute use without unsafe underuse?

RQ4Does energy-aware right-sizing change real model choice while preserving task success, time, and satisfaction?

From labels to proof

A cheaper option only counts if the answer still works.

The pilot gives a map. The paper has to test the map. For a subset of tasks, we should actually run the no-LLM option, the small-model option, the standard frontier option, and the reasoning option, then evaluate which answers survive blind quality checks.

Gold labels 500-1k Human-adjudicated tasks with at least two annotators before final adjudication.

Validation set 200-500 Tasks answered by lower-compute and frontier options for blind evaluation.

Key error Underuse False low-intensity recommendations are costlier than conservative frontier use.

Decision Selective Recommend low compute only when confidence and risk constraints are satisfied.

Annotation Upgrade

Stop asking annotators to jump straight to a tier

A tier by itself is too compressed. The adjudication sheet should expose the reasons: freshness, context, risk, deterministic solvability, language work, reasoning depth, and tool need. The final recommendation should fall out of those facts.

Deterministic solvabilitycalculator, regex, spreadsheet, script, database

Freshness / lookupstatic knowledge, web search, cited source required

Language operationrewrite, translation, extraction, summary, creative generation

Reasoning depthshallow, moderate, deep, verifiable multi-step

Context requirementshort, medium, document-scale, repository-scale

Tool neednone, search, code execution, API, browser/action loop

Risk and stakeslow, medium, high-stakes, expert-only, unsafe

Evaluation clarityobjective answer, tests, pairwise preference, subjective utility

Output Validation

How we prove lower compute is sufficient

Candidate options

No-LLM tool/search, local or small open model, standard frontier, frontier + reasoning, and agentic AI.

Generate answers

Run a stratified subset through multiple options with logged tokens, latency, cost, and estimated or measured Wh.

Blind evaluation

Use tests for objective tasks, pairwise preference for writing, factual checks for lookup, and expert exclusion for high-risk tasks.

Non-inferiority

Declare a task frontier-avoidable only if lower compute preserves usefulness, correctness, and rework within the pre-registered margin.

Recommendation Metrics

The recommendation rule should be evaluated like a safety-critical classifier

Metric Why it matters Target interpretation

False low-intensity rateCaptures harmful underuse.Must be low before claiming savings.

Selective riskRisk conditional on only the cases recommended for lower-intensity use.Savings should be reported at fixed selective risk.

Abstention rateHigh uncertainty should trigger no recommendation.A higher abstention rate is acceptable if it protects quality.

Quality-adjusted WhEnergy alone rewards bad answers.Report Wh per successful task, not only Wh per request.

Field experiment

A carbon label alone is not enough. The alternative has to be one click away.

The experiment should not simply tell people that AI has a footprint. It should say: this looks like search, this looks like a small-model rewrite, this one needs reasoning, and this one should not be assigned lower-intensity AI. Then we measure whether people accept, override, and still get the job done.

Control Normal No resource feedback, no AI-use recommendation.

Label Wh + $ Estimated energy, carbon, cost, and latency are shown at task time.

Recommend Option Task-matched recommendation with one-click lower-compute alternative.

Override Logged Users can choose frontier anyway; override behavior becomes an outcome.

Experimental Arms

From resource awareness to actionable right-sizing

Control

Participants use their normal LLM workflow. We log task type, chosen model, tokens, time, and satisfaction.

Energy label only

Participants see estimated Wh, CO2e, dollar cost, and latency, but no recommended alternative.

Label + recommendation

The interface recommends search, local tool, small model, standard frontier, reasoning, agent, or abstain.

Default right-sized option

The lower-compute recommendation is preselected when confidence is high; users retain explicit override.

Outcome Dashboard

The frontier is time, quality, and carbon together

Model choiceFrontier/reasoning share, small-model share, no-LLM share, override rate.

Resource useEstimated cloud Wh, measured local Wh, dollar cost, tokens, and latency.

Task successUser satisfaction, blind quality, rework rate, completion time, and abandonment.

MechanismFrontier preference scale, trust in recommendation, AI literacy, privacy concern.

Behavioral Logic

Why recommendations may work when labels alone do not

PromptTask is observed

User submits or describes the intended task.

ClassifyCapability vector

The system estimates risk, freshness, reasoning, context, and tool need.

RecommendLeast sufficient tier

The interface shows an AI-use level and the reason for it.

ChooseUser accepts or overrides

The override becomes a revealed-preference measure.

EvaluateQuality and rework

Savings count only if the task is successful.

First pass manuscript

The Electricity Cost of Everyday AI

The paper starts from the pilot finding, not from the benchmark machinery: in 10,000 real chat episodes, 6.4% appear to need no LLM and another 35.4% appear small-model sufficient. The sustainability question is why everyday users still reach for high-compute AI when a lower-energy option would work.

Framing 41.8% Frontier-avoidable in the 10,000-row pilot: no-LLM plus small-model-sufficient tasks.

Evidence Two results Users miss the AI energy ladder; simple information changes model choice.

Manuscript Open Compiled PDF for discussion.

Major Framing

What we want the paper to do

The strongest version of the paper leads with empirical magnitudes. Some observed AI use is strict overuse: search, calculation, local software, or a specialized tool would have been enough. A larger share needs language capability but not frontier AI. The behavioral question is whether users know this energy gradient and whether information makes them choose differently.

Strict overuse

6.4% of pilot conversations appear answerable without an LLM: search, tools, or local software.

Simpler model enough

35.4% need language capability but appear small-model sufficient, not frontier necessary.

Reasoning overuse

If ordinary tasks default to reasoning, many become 2-4x or much more energy-intensive.

Subject response

Subjects see the task type and energy/cost comparison, then choose whether to use a lower-intensity option.

Abstract And Literature

The abstract should make one clean move: sustainable AI use is an electricity problem

Existing energy work shows that inference cost varies with model size, token length, reasoning, and serving efficiency. Our abstract should lead with what the data show: the share of strict overuse, the share that can use simpler models, the electricity multiples from reasoning overuse, and then the subject experiment.

Working abstract

Generative AI is turning electricity-intensive computation into an everyday consumer habit. In 10,000 real chat episodes, 6.4% appear to need no LLM and another 35.4% appear small-model sufficient. If these ordinary tasks are handled by reasoning-heavy AI, many become 2-4x or much more energy-intensive. We then test whether subjects understand this energy ladder and whether a simple task-specific energy and cost intervention shifts them toward lower-intensity AI use.

Possible Results

The results should be stated as tradeoffs, not slogans

The paper should make the two behavioral results front and center, then connect them to the two benchmark worlds. If all tasks were standard frontier, savings are modest. If ordinary tasks default to reasoning, savings are large. The information intervention tells us whether users can be guided toward careful AI use.

Finding A Users are unaware

Subjects do not correctly perceive the resource gradient between search, small models, frontier, reasoning, and agents.

Finding B Information helps

A simple task-specific recommendation changes model choice toward the right level of AI intensity.

Finding C Baseline matters

Standard-frontier overuse yields modest savings; reasoning-frontier overuse yields much larger savings.

Finding D Quality gates bind

The welfare claim survives only if lower-compute options preserve task success and avoid rework.

Research roadmap

The referee response is clear: finish the empirical core.

The question is strong, but the current draft reads like a research design memo. The next version needs two audited empirical modules: computational validation of task sufficiency, and a randomized experiment that separates information from recommendation authority and defaults.

Critical gap No results yet The subject experiment needs treatment effects, balance, power, and pre-registration.

Measurement gap Sufficiency The 41.8% headline must survive output validation, not just classifier labeling.

Mechanism gap 5 arms The experiment must split energy information, recommendation authority, and default effects.

Referee Diagnosis

The revised paper needs two completed studies, not better prose

The referee consensus is not that the framing is weak. It is that the causal and measurement claims are premature. The plan below converts each critique into a concrete empirical requirement.

Critique Why it matters Computational response Experimental response Paper output

No experimental resultsCentral causal claim cannot be evaluated.Freeze validated task set before fielding.Run RCT with balance table, treatment effects, clustered SEs, and power.Main treatment-effect table.

Sufficiency not demonstratedClassifier labels do not prove a small model works.Generate lower-intensity outputs and blind-rate success/rework.Use only validated vignettes for recommendations.Wh per successful task.

Mechanism conflatedUsers may follow authority/defaults, not information.Identify tasks where lower intensity is and is not equivalent.Use control, label, recommendation-only, label+recommendation, and default arms.Mechanism decomposition.

External validityWildChat is selected, not population-representative.Compare task mix with LMSYS/ShareChat and usage reports.Recruit a documented high-quality subject pool and report AI experience.Scenario analysis, not point extrapolation.

Self-Critique And ROI

The next dollar should buy validation, not more labels

The current 10,000-row result is a strong pilot, but it is still classifier evidence. The highest-ROI next step is to prove that the lower-energy option actually works: no-LLM cases must be executable without an LLM, small-model cases must survive output validation, and standard-frontier cases must be shown not to need reasoning. Expanding to more raw chats before this validation would increase precision around an unverified construct.

Workstream Pilot stake Why ROI is high Main risk Decision rule

T2/T4 reasoning boundary52.1% of reasoning-scenario savingsStandard-frontier tasks become 3.9x more energy-intensive if sent to reasoning.Classifier may understate when reasoning improves quality.Report reasoning-avoidable only when standard output is non-inferior.

T1 small-model validation35.4% of chats; 37.2% of reasoning-scenario savingsThis supports the 41.8% frontier-avoidable headline.Small model may fail style, language, or factual nuance.Keep T1 only if Gemma-class output passes blind quality/rework checks.

T0 no-LLM validation6.4% of chats; strict overuse claimMost intuitive result for readers: search, calculator, software, or API was enough.Some prompts hide implicit synthesis or personalization.Count only cases where the no-LLM path fully satisfies the task.

Subject experimentBehavioral mechanismTurns energy accounting into a policy-relevant behavior result.If run before validation, recommendations may be wrong.Use only validated task vignettes and preserve override.

ROI calculation

In the central 10,000-row accounting, task matching saves 2.84 kWh versus all-standard-frontier use and 87.42 kWh versus all-reasoning-frontier use. Per one million chats, that scales to 0.28 MWh and 8.74 MWh. The best research ROI is therefore not another larger pilot. It is validating the two margins that create the headline: T1 small-model sufficiency and T2 reasoning avoidance.

Data Access And Back-Out Plan

Use their public taxonomy now; request their WildChat validation labels in parallel

The OpenAI/NBER paper does not release the internal ChatGPT message rows. What it does give us is enough structure to back out a credible sampling design: public classifier prompts, public aggregate Signals margins, and a stated 100,000-row public WildChat validation sample that we should request from the authors.

1 Replicate taxonomy

Classify public chat rows into work/non-work, work/school/other, Asking/Doing/Expressing, and the 24 topic categories.

2 Compare margins

Measure how far WildChat/ShareChat public rows deviate from OpenAI Signals aggregate usage shares.

3 Rake sample

Weight or quota-sample public rows until topic, intent, and work/school distributions match Signals.

4 Join compute tiers

Estimate no-LLM, small-model, frontier, reasoning, and agent shares under the calibrated task mix.

Item Status Use in paper Current action

1.1M internal ChatGPT messagesNot publicCannot be used directlyTreat OpenAI results as aggregate calibration, not raw data.

OpenAI Signals CSVDownloadedPopulation marginsUse topic, intent, work-related, and work/school shares for raking.

100k classified WildChat sampleAvailable on requestClassifier validationRequest from authors before finalizing taxonomy validity.

Our 1,000-row classifier bridgeCompletePublic proxyUse as first calibration test, not the final population estimate.

Computational Plan

Prove which lower-energy answers actually work

This module turns the 10,000-row classifier pilot into auditable evidence. The unit is no longer a label; it is a task with generated alternatives, measured energy, quality ratings, and rework.

Step 1

Gold-label 1,500 tasks

Stratify by tier and uncertainty: 250 T0, 500 T1, 350 T2, 200 T4, 100 T5/T6, and 100 low-confidence or disagreement-prone cases. Two annotators label task, risk, no-LLM feasibility, small-model sufficiency, frontier need, and reasoning need; disagreements are adjudicated.

Step 2

Output-validate 600-800 tasks

Run no-LLM paths where applicable, Gemma-class local small models through Ollama for T1, standard frontier for T2, and reasoning frontier for T4. Use blind preference for writing, exact checks for objective tasks, source checks for lookup, and rework as a penalty.

Step 3

Measure energy per successful task

For small-model candidates, record idle-subtracted Wh, latency, tokens/sec, and failure/retry rate on the actual host machine. Compare against standard-frontier and reasoning-frontier accounting as Wh per successful task, not Wh per request.

Step 4

Recompute the headline with uncertainty

Report frontier-avoidable share with confidence intervals and false-positive adjustment. The abstract should use validated rates, not raw classifier rates, once this module is complete.

Experimental Plan

Test whether information changes choices without degrading quality

This module should be pre-registered after computational validation. Its job is not to prove small models work; that is the computational module. Its job is to test whether users understand and act on the energy ladder when the lower-intensity option has already been validated.

Design

Five arms

Randomize subjects to control, energy label only, recommendation only, label plus recommendation, and lower-intensity default with override. This separates information from authority and default effects.

Sample

High-quality subject pool

Recruit a documented subject pool, target roughly 800 subjects with 160 per arm, and record AI experience, baseline energy knowledge, domain familiarity, and environmental attitudes.

Outcomes

Two primary outcomes

Primary outcomes are unnecessary high-compute choice and Wh per successful task. Secondary outcomes are model choice, reasoning share, override, time, satisfaction, perceived quality, tokens, latency, and rework with multiple-testing correction.

Falsification

Include high-intensity-needed tasks

The intervention should lower high-compute use on validated-equivalent tasks, but not on tasks where frontier or reasoning outputs are demonstrably better. This separates conservation from anti-compute nudging.

Replication Request

The email is ready; send it before we lock the final sampling design

The request is narrowly scoped: code repository, 100,000 classified public WildChat messages, taxonomy mappings, validation code, and non-sensitive cross-tabs. We are not asking for raw ChatGPT messages.

Role Name Email Why include

ToAaron "Ronnie" Chatterjironnie@duke.eduLead author; Duke/OpenAI affiliation.

ToDavid J. Demingdavid_deming@harvard.eduNBER/Harvard author; likely academic replication contact.

CcThomas Cunninghamtom.cunningham@gmail.comOpenAI author.

CcZoe Hitzigzhitzig@g.harvard.eduOpenAI / Harvard Society of Fellows author.

CcChristopher Ongchristopherong@hks.harvard.eduHarvard/OpenAI author.

CcCarl Yan Shancshan@openai.comOpenAI author.

CcKevin Wadmankevin.wadman@c-openai.comOpenAI author.

Email subject

Request for replication materials for "How People Use ChatGPT" (NBER w34255)

Core ask

Please share the code repository, the classified 100,000 public WildChat messages, classifier prompts, taxonomy mappings, validation code, and any non-sensitive aggregate cross-tabs that help align public chatbot datasets with OpenAI Signals.