Three-stratum model-use study
How much electricity is wasted when everyday AI tasks are run on more model than they need?
The study starts from a simple concern: users may be sending ordinary tasks to unnecessarily expensive AI modes. We therefore do not read WildChat's raw model distribution as truth. Instead, we use public 2025 conversations to build three model-use strata, classify what each task actually needed, and compute how much electricity a user could save by switching down when the task allows.
First-Page Setup
Sample by model used, classify by model needed, then price the gap.
- SampleTake 500 rows from each 2025 model-use class.
- ClassifyAsk whether the task needs reasoning, frontier, small model, search, or local tool.
- CalculateWithin each model class, compute average avoidable Wh and percentage saving.
- AggregateCombine model-class savings under NBER-inspired mixture scenarios.
Operational Design
For each model class, estimate what share of tasks could have used less compute
The unit is a conversation row with a visible model class. For each row, we classify the true task demand: whether reasoning was needed, whether a smaller model was enough, whether local search/tool use was enough, or whether the original model class was justified. This produces a within-class saving curve rather than a claim about the true global frequency of each model.
Gemini Classified Results
The three model classes produce very different savings margins
These are actual Gemini labels from the 1,500-row model-stratified sample. The key result is not a single global number. It is a conditional statement: if a user is currently using a given model class, how often could the task have been handled by a lower-compute route?
Average avoidable electricity per reasoning-row under central assumptions; 57.2% of chosen-model energy.
Average avoidable electricity per frontier-row; 13.3% of chosen-model energy.
Average upgrade pressure per small-row; many small-model rows appear to need frontier quality.
Classifier Questions
The labels should answer exactly the switching question users face
We are not classifying broad topics for their own sake. The classification decides the lowest sufficient execution path for a task, so the electricity comparison is meaningful.
Within-Class Output
Each model class gets its own saving estimate.
The reasoning stratum is the clearest over-compute margin: only 19.0% of sampled reasoning rows were classified as truly requiring reasoning, while 46.4% needed only frontier non-reasoning, 21.0% needed only a small model, and 5.2% could be handled by local search/tool paths. The frontier non-reasoning stratum has a smaller but still meaningful downshift margin. The small-model stratum mostly tells a different story: some tasks should be upgraded, so the paper must report quality-adjusted savings rather than energy alone.
Electricity Calculation
Compute average savings inside each model class, then combine classes
For each class used by the user, we estimate both avoidable electricity and upgrade pressure. Avoidable electricity counts cases where a lower-compute route is sufficient. Upgrade pressure counts cases where the observed model class was too weak and quality would require more compute.
How NBER Enters
NBER anchors the usage mix; our scenario anchors the model mix.
OpenAI/NBER tells us representative ChatGPT use is not mostly coding or benchmarks: it is Practical Guidance, Writing, Seeking Information, Asking, Doing, and Expressing. We use those task-mix margins to keep the 500-row strata from becoming unrepresentative. The 30/60/10 model mixture is not claimed as a measured fact; it is a transparent scenario that can be varied.
OpenAI/NBER Calibration Bridge
The 1,000 rows now speak the same taxonomy as OpenAI Signals
We classified the same 2025 WildChat sample into the public OpenAI/NBER usage taxonomy: work-related, work/school/other, Asking/Doing/Expressing, 24 fine topics, and seven coarse topics. This lets us compare our public-text sample against the OpenAI Signals aggregate margins before making any population claim.
Codex classified public WildChat messages with up to 10 prior messages as context, matching the paper's message-level logic.
OpenAI Signals publishes aggregate shares from monthly consumer ChatGPT samples, not row-level messages.
The raw public sample is biased toward Doing/Writing, so it should be raked to Signals margins before headline estimates.
Ground Truth Split
Signals gives population margins, not private message rows.
The internal 1.1M ChatGPT sample is not publicly released. The public ground truth is aggregate: topic, work-related status, work/school/other, and Asking/Doing/Expressing shares. We use those margins to reweight public WildChat/ShareChat text, while requesting the paper's 100,000 classified WildChat validation sample from the authors.
Usage Taxonomy × Compute Demand
After joining taxonomy with tiers, the compute story varies sharply by topic
This bridge table is the first version of the paper's empirical spine: user demand on the left, required execution tier on the right. It is still uncalibrated, but it shows where validation should focus.
Next Calculation
Do not report the raw 1,000-row shares as population facts.
The right next statistic is a calibrated estimate: reweight this labeled public sample to OpenAI Signals topic, intent, and work/school margins, then recompute no-LLM share, small-model-sufficient share, frontier-standard share, and reasoning-needed share. A separate model-stress panel should handle GPT-4o, o1/o3, GPT-4.5, and o4-mini cases because the natural 2025 WildChat sample is almost entirely mini-tier.
Earlier Stratified Pilot
The 10,000-row run is now a stress-test panel, not the main 2025 estimate
This older panel deliberately mixed model-proportional rows with reasoning, multi-turn, non-English, and edge-case strata. It is useful for seeing the full route taxonomy and stress-testing reasoning-heavy replay, but it should not be read as the natural 2025 model-use distribution.
WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.
What the user is really trying to accomplish, not just the words in the prompt.
Ask what level of AI would have been enough if the user had chosen deliberately.
Compare standard frontier replay and reasoning replay against the minimum sufficient execution tier.
Example Display
Prompt examples are original excerpts.
WildChat episodes are conversation-level records, so one row can contain multiple user turns or a very long pasted prompt. The example cards keep the original language and wording, but clip long prompts to the first 1,500 characters for readability and show the raw prompt length. The episode id points back to the full local record.
Core Message
The problem is not using AI. The problem is using too much AI for the task.
In this pilot, the savings are modest if the alternative is ordinary frontier inference for every task, but much larger if ordinary tasks are being pushed through reasoning-heavy paths. That is the paper's opening: AI is an intensity choice, and users need information that helps them conserve high-compute AI when it is not needed.
Counterfactual Accounting
Decompose first, then compare the same tasks two ways
Decomposition is necessary because a chat is not a single generic "AI request." One conversation may be simple lookup, another may be writing, another may be calculation, and another may need reasoning or tools. The accounting compares the same 10,000 actual tasks under two execution plans.
Counterfactual baseline: every task is answered through a standard frontier path.
Stress baseline: every task is pushed through reasoning/test-time compute.
Careful-use benchmark: no-LLM, small LLM, standard frontier, context, reasoning, agent, or exclusion.
2,838.6Wh saved; T4/T5 quality upgrades offset part of the T0/T1 savings.
87,424.7Wh saved when ordinary tasks avoid reasoning/test-time compute.
3,537 / 10,000 tasks need language capability, but not frontier execution.
Why This Equation
The numerator is avoidable cloud execution, not total AI energy.
The equation isolates the cloud inference energy that changes when the execution plan changes. It does not convert human time into carbon, and it does not count model training. It asks two narrow questions: what if every task uses standard frontier execution, and what if every task uses reasoning/test-time compute?
Replay Definition
This is not asking GPT-4 to answer again.
"Replay" means applying the same accounting model to the same observed WildChat tasks under two execution plans. The baselines replay standard frontier and reasoning-frontier execution. The alternative uses the task decomposition to choose the minimum sufficient tier. No GPT-4, GPT-4o, GPT-4.5, GPT-5.5, or Pro model was re-run on the 10,000 conversations for this result.
Energy Multipliers
Use one additive task-energy model before comparing alternatives
Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each actual task is compared with the lowest sufficient execution tier: no-LLM tool/search/API, small LLM, standard frontier, long-context frontier, reasoning frontier, LLM agent with tools, or expert/not comparable.
(28,259.3Wh standard replay - 25,420.7Wh task-matched execution) / 28,259.3Wh = 10.0%.
(115,000.2Wh reasoning replay - 27,575.5Wh task-matched execution) / 115,000.2Wh = 76.0%.
Uses EPA 0.394 kgCO2/kWh average electricity factor.
Heavy frontier reasoning replay minus task-matched execution.
Interpretation
Search is one lower-intensity option, not the whole story.
The biggest sustainability mistake is not using a frontier model for every task; it is using reasoning compute for ordinary tasks. T4 keeps reasoning where it is justified. T1 shows the largest model-rightsizing opportunity: 35.4% of chats need language capability, but only a small model.
Scale-Up
We need two benchmarks, not one headline number
The pilot should be benchmarked twice. If every task is replayed through a standard frontier model, careful use saves 0.28-0.68 Wh per row. If every task is replayed through a reasoning-frontier path, careful use saves 8.74-13.20 Wh per row. The second number is not total LLM electricity; it is the avoided cost of reasoning overuse.
Approximate 2026 world population.
OpenAI public usage anchor.
Lower end is standard-frontier replay; upper end is reasoning-frontier replay.
EPA U.S. average electricity factor; forest storage uses 0.77 tCO2/acre/year.
All standard frontier
2.83-4.99 Wh/row baseline; task matching saves 10.0-13.7%.
All reasoning frontier
11.50-17.83 Wh/row baseline; task matching saves 74.0-76.0%.
Research Rule
Identify the task first; decompose it second; decide whether reasoning is justified third.
Human time is not converted into carbon. This page estimates cloud energy from the task decomposition; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.
Method
The task-energy model is additive
The central scenario uses a 0.85 Wh frontier base work unit and a 6.5 Wh reasoning response. T1 uses 0.04 Wh per 1,000 visible tokens for Gemma-class small models. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.
115,000.2Wh across 10,000 pilot rows.
27,575.5Wh after matching each task to its minimum sufficient execution tier.
Pilot-scale value; platform-scale value is shown in the scale-up section.
Appendix
Data coverage and frontier coefficient derivation
The main story uses frontier standard and frontier reasoning replay. The details below show where the pilot conversations came from and how the energy assumptions are anchored.
Public standard-query anchor from Epoch/OpenAI-era estimates.
0.34Wh × 2.5 active-compute multiplier.
Central estimate when reasoning/test-time compute is invoked.
Heavy standard base / reasoning total for Pro-style paths.
Replay Assumption
Estimate active compute; do not pretend we have vendor telemetry.
GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops. T1 uses Gemma-class small-model inference anchored to Gemma 4 E2B/E4B effective-parameter models. This is a calibrated counterfactual, not a measurement of any vendor's internal serving stack.
Why The Estimate Holds
The claim is comparative before it is absolute.
The exact Wh number can move with model architecture, batching, cache hits, quantization, and hardware. The ranking is more stable: longer generations use more compute than shorter ones; reasoning/test-time scaling uses more compute than ordinary answering; small sufficient models use less than frontier models; local deterministic tools avoid cloud inference. That is why the page reports central and heavy sensitivity scenarios rather than one alleged exact footprint.
Sources
Sources and anchors for the calculation
- de Vries, Joule 2023: 0.3 Wh search and up to 2.9 Wh LLM interaction
- Epoch AI: about 0.3 Wh for a typical GPT-4o-style query
- Oviedo et al., Joule 2026: 0.31 Wh frontier inference and order-of-magnitude higher long reasoning
- Oviedo et al. preprint: 0.34 Wh standard and 4.32 Wh test-time scaling scenario
- EPA eGRID: 0.394 kgCO2/kWh U.S. average electricity factor
- EPA equivalencies: 0.77 metric ton CO2 per acre of U.S. forest storing carbon for one year
- OpenAI API model page: GPT-5.5 pricing, context, and reasoning support
- OpenAI API model page: GPT-5.5 Pro pricing and long-running hard-task behavior
- OpenAI Help: GPT-5.5 Instant, Thinking, and Pro modes in ChatGPT
- OpenAI GPT-5.5 release and product notes
- OpenAI ChatGPT Pro: Pro/reasoning modes use more compute for harder problems
- OpenAI GPT-4.5: large compute-intensive model, not a GPT-4o replacement
- Google Gemma 4: E2B/E4B small open models and 26B MoE active 3.8B parameters
- Google AI for Developers: Gemma 4 model sizes, context windows, and memory requirements
- FrugalGPT: cascaded model selection for lower cost
- RouteLLM: routing queries between cheaper and stronger LLMs
- Worldometer / UN WPP 2024: 2026 world population around 8.3B
- OpenAI: ChatGPT serves more than 800M weekly users
- WildChat-4.8M
Experimental study
The experiment must separate information from authority and defaults.
The referee risk is clear: a user may follow a recommendation because it looks authoritative, not because they learned the energy cost. The revised design uses five arms to isolate energy information, suggested AI intensity, and default preselection while preserving user override.
Low-risk rewrite. A small model is likely sufficient; frontier reasoning is unlikely to improve the result enough to justify the extra energy.
Definition
What "recommendation" means in this study
A recommendation is a visible, optional task-specific suggestion shown to the user before they choose a model. It is not hidden platform routing and it is not a command. It names the task type, proposes the least validated AI intensity likely to work, gives a short reason, and lets the user override.
Design
One experiment, five interface conditions
Each condition changes the choice architecture, not the underlying task. This separates belief updating from authority cues and default effects.
No energy label, no recommended alternative.
Users see the footprint but must decide what to do.
Users see the suggested AI intensity but no energy/cost numbers.
Users see both the footprint and the task-matched AI-use suggestion.
The lower-intensity option is selected when confidence is high; override remains explicit.
Measurement
The outcome is a frontier, not one number
Validation In The Experiment
Every claimed saving has to survive output checks
A lower-compute recommendation counts only when it produces a usable result. Objective tasks use tests or exact checks. Writing uses blind preference. Factual tasks require source support. High-stakes or uncertain tasks are abstentions, not savings. The experiment also includes falsification tasks where high-intensity AI is demonstrably better; a good intervention should not reduce frontier/reasoning use there.
First-pass research story
A user sees one button. The system sees many possible energy paths.
The paper should begin from that mismatch. Modern AI products make model choice feel frictionless, but under the surface they can invoke search, small language models, frontier inference, long context, test-time reasoning, or tool loops. The experiment asks whether users will choose differently when the path becomes visible.
Research Flow
From an observed mismatch to a behavioral experiment
The pilot shows what the world looks like when we read real chats as tasks. The paper becomes publishable when it proves two more things: lower-compute answers can be good enough, and users actually change choices when the interface gives them a usable alternative.
Observed task prevalence
Public WildChat episodes plus participant task logs. Estimate which real tasks are no-LLM, small-model, standard frontier, long-context, reasoning, agentic, or expert-only.
Output: frontier-avoidable task share with confidence intervals.User perception and model choice
Survey and vignette experiment measuring whether users understand cost, energy, reasoning, long-context, agent loops, and the quality tradeoffs of small/open models.
Output: frontier preference scale and immediate recommendation effect.Field intervention
A right-sizing interface compares normal use against energy labels plus task-matched recommendations and an easy lower-compute option with override.
Output: behavior, quality, latency, rework, and satisfaction effects.What Makes The Claim Strong
Three things cannot be blurred together
Task share is not energy share
A large fraction of simple tasks does not automatically imply a large energy saving. Savings depend on the baseline: standard frontier replay, reasoning replay, long-context replay, or agentic replay.
Local is not automatically green
Local inference must be measured. Low-throughput personal hardware can be worse than optimized cloud serving. The correct claim is least sufficient execution under measured conditions.
Awareness is not behavior change
Showing watts alone is weak. The intervention must pair resource feedback with a recommended alternative and a one-click path that preserves task quality.
Questions
The paper should answer four questions in order
From labels to proof
A cheaper option only counts if the answer still works.
The pilot gives a map. The paper has to test the map. For a subset of tasks, we should actually run the no-LLM option, the small-model option, the standard frontier option, and the reasoning option, then evaluate which answers survive blind quality checks.
Annotation Upgrade
Stop asking annotators to jump straight to a tier
A tier by itself is too compressed. The adjudication sheet should expose the reasons: freshness, context, risk, deterministic solvability, language work, reasoning depth, and tool need. The final recommendation should fall out of those facts.
Output Validation
How we prove lower compute is sufficient
Candidate options
No-LLM tool/search, local or small open model, standard frontier, frontier + reasoning, and agentic AI.
Generate answers
Run a stratified subset through multiple options with logged tokens, latency, cost, and estimated or measured Wh.
Blind evaluation
Use tests for objective tasks, pairwise preference for writing, factual checks for lookup, and expert exclusion for high-risk tasks.
Non-inferiority
Declare a task frontier-avoidable only if lower compute preserves usefulness, correctness, and rework within the pre-registered margin.
Recommendation Metrics
The recommendation rule should be evaluated like a safety-critical classifier
Field experiment
A carbon label alone is not enough. The alternative has to be one click away.
The experiment should not simply tell people that AI has a footprint. It should say: this looks like search, this looks like a small-model rewrite, this one needs reasoning, and this one should not be assigned lower-intensity AI. Then we measure whether people accept, override, and still get the job done.
Experimental Arms
From resource awareness to actionable right-sizing
Control
Participants use their normal LLM workflow. We log task type, chosen model, tokens, time, and satisfaction.
Energy label only
Participants see estimated Wh, CO2e, dollar cost, and latency, but no recommended alternative.
Label + recommendation
The interface recommends search, local tool, small model, standard frontier, reasoning, agent, or abstain.
Default right-sized option
The lower-compute recommendation is preselected when confidence is high; users retain explicit override.
Outcome Dashboard
The frontier is time, quality, and carbon together
Behavioral Logic
Why recommendations may work when labels alone do not
User submits or describes the intended task.
The system estimates risk, freshness, reasoning, context, and tool need.
The interface shows an AI-use level and the reason for it.
The override becomes a revealed-preference measure.
Savings count only if the task is successful.
First pass manuscript
The Electricity Cost of Everyday AI
The paper starts from the pilot finding, not from the benchmark machinery: in 10,000 real chat episodes, 6.4% appear to need no LLM and another 35.4% appear small-model sufficient. The sustainability question is why everyday users still reach for high-compute AI when a lower-energy option would work.
Major Framing
What we want the paper to do
The strongest version of the paper leads with empirical magnitudes. Some observed AI use is strict overuse: search, calculation, local software, or a specialized tool would have been enough. A larger share needs language capability but not frontier AI. The behavioral question is whether users know this energy gradient and whether information makes them choose differently.
Strict overuse
6.4% of pilot conversations appear answerable without an LLM: search, tools, or local software.
Simpler model enough
35.4% need language capability but appear small-model sufficient, not frontier necessary.
Reasoning overuse
If ordinary tasks default to reasoning, many become 2-4x or much more energy-intensive.
Subject response
Subjects see the task type and energy/cost comparison, then choose whether to use a lower-intensity option.
Abstract And Literature
The abstract should make one clean move: sustainable AI use is an electricity problem
Existing energy work shows that inference cost varies with model size, token length, reasoning, and serving efficiency. Our abstract should lead with what the data show: the share of strict overuse, the share that can use simpler models, the electricity multiples from reasoning overuse, and then the subject experiment.
Generative AI is turning electricity-intensive computation into an everyday consumer habit. In 10,000 real chat episodes, 6.4% appear to need no LLM and another 35.4% appear small-model sufficient. If these ordinary tasks are handled by reasoning-heavy AI, many become 2-4x or much more energy-intensive. We then test whether subjects understand this energy ladder and whether a simple task-specific energy and cost intervention shifts them toward lower-intensity AI use.
Possible Results
The results should be stated as tradeoffs, not slogans
The paper should make the two behavioral results front and center, then connect them to the two benchmark worlds. If all tasks were standard frontier, savings are modest. If ordinary tasks default to reasoning, savings are large. The information intervention tells us whether users can be guided toward careful AI use.
Subjects do not correctly perceive the resource gradient between search, small models, frontier, reasoning, and agents.
A simple task-specific recommendation changes model choice toward the right level of AI intensity.
Standard-frontier overuse yields modest savings; reasoning-frontier overuse yields much larger savings.
The welfare claim survives only if lower-compute options preserve task success and avoid rework.
Research roadmap
The referee response is clear: finish the empirical core.
The question is strong, but the current draft reads like a research design memo. The next version needs two audited empirical modules: computational validation of task sufficiency, and a randomized experiment that separates information from recommendation authority and defaults.
Referee Diagnosis
The revised paper needs two completed studies, not better prose
The referee consensus is not that the framing is weak. It is that the causal and measurement claims are premature. The plan below converts each critique into a concrete empirical requirement.
Self-Critique And ROI
The next dollar should buy validation, not more labels
The current 10,000-row result is a strong pilot, but it is still classifier evidence. The highest-ROI next step is to prove that the lower-energy option actually works: no-LLM cases must be executable without an LLM, small-model cases must survive output validation, and standard-frontier cases must be shown not to need reasoning. Expanding to more raw chats before this validation would increase precision around an unverified construct.
In the central 10,000-row accounting, task matching saves 2.84 kWh versus all-standard-frontier use and 87.42 kWh versus all-reasoning-frontier use. Per one million chats, that scales to 0.28 MWh and 8.74 MWh. The best research ROI is therefore not another larger pilot. It is validating the two margins that create the headline: T1 small-model sufficiency and T2 reasoning avoidance.
Data Access And Back-Out Plan
Use their public taxonomy now; request their WildChat validation labels in parallel
The OpenAI/NBER paper does not release the internal ChatGPT message rows. What it does give us is enough structure to back out a credible sampling design: public classifier prompts, public aggregate Signals margins, and a stated 100,000-row public WildChat validation sample that we should request from the authors.
Classify public chat rows into work/non-work, work/school/other, Asking/Doing/Expressing, and the 24 topic categories.
Measure how far WildChat/ShareChat public rows deviate from OpenAI Signals aggregate usage shares.
Weight or quota-sample public rows until topic, intent, and work/school distributions match Signals.
Estimate no-LLM, small-model, frontier, reasoning, and agent shares under the calibrated task mix.
Computational Plan
Prove which lower-energy answers actually work
This module turns the 10,000-row classifier pilot into auditable evidence. The unit is no longer a label; it is a task with generated alternatives, measured energy, quality ratings, and rework.
Gold-label 1,500 tasks
Stratify by tier and uncertainty: 250 T0, 500 T1, 350 T2, 200 T4, 100 T5/T6, and 100 low-confidence or disagreement-prone cases. Two annotators label task, risk, no-LLM feasibility, small-model sufficiency, frontier need, and reasoning need; disagreements are adjudicated.
Output-validate 600-800 tasks
Run no-LLM paths where applicable, Gemma-class local small models through Ollama for T1, standard frontier for T2, and reasoning frontier for T4. Use blind preference for writing, exact checks for objective tasks, source checks for lookup, and rework as a penalty.
Measure energy per successful task
For small-model candidates, record idle-subtracted Wh, latency, tokens/sec, and failure/retry rate on the actual host machine. Compare against standard-frontier and reasoning-frontier accounting as Wh per successful task, not Wh per request.
Recompute the headline with uncertainty
Report frontier-avoidable share with confidence intervals and false-positive adjustment. The abstract should use validated rates, not raw classifier rates, once this module is complete.
Experimental Plan
Test whether information changes choices without degrading quality
This module should be pre-registered after computational validation. Its job is not to prove small models work; that is the computational module. Its job is to test whether users understand and act on the energy ladder when the lower-intensity option has already been validated.
Five arms
Randomize subjects to control, energy label only, recommendation only, label plus recommendation, and lower-intensity default with override. This separates information from authority and default effects.
High-quality subject pool
Recruit a documented subject pool, target roughly 800 subjects with 160 per arm, and record AI experience, baseline energy knowledge, domain familiarity, and environmental attitudes.
Two primary outcomes
Primary outcomes are unnecessary high-compute choice and Wh per successful task. Secondary outcomes are model choice, reasoning share, override, time, satisfaction, perceived quality, tokens, latency, and rework with multiple-testing correction.
Include high-intensity-needed tasks
The intervention should lower high-compute use on validated-equivalent tasks, but not on tasks where frontier or reasoning outputs are demonstrably better. This separates conservation from anti-compute nudging.
Replication Request
The email is ready; send it before we lock the final sampling design
The request is narrowly scoped: code repository, 100,000 classified public WildChat messages, taxonomy mappings, validation code, and non-sensitive cross-tabs. We are not asking for raw ChatGPT messages.
Request for replication materials for "How People Use ChatGPT" (NBER w34255)
Core askPlease share the code repository, the classified 100,000 public WildChat messages, classifier prompts, taxonomy mappings, validation code, and any non-sensitive aggregate cross-tabs that help align public chatbot datasets with OpenAI Signals.