2025 re-run: 1,000 real chat episodes
Actual model choice changes the sustainability story.
The earlier 10,000-row pilot asked what each task could be handled by. The 2025 re-run adds the missing denominator: what model the user actually used. In this sample, almost everyone used OpenAI mini-tier non-reasoning models, which are not the same thing as local small models. The key question is no longer only over-use of frontier reasoning; it is whether observed model choice is matched to task difficulty.
Validation sits inside the first page
The pilot label is not the proof.
- Gold labelsTwo human annotators read the actual task and adjudicate the AI-use level.
- Run outputsSearch/tool, Gemma/Qwen-class small model, standard frontier, and reasoning paths answer matched cases.
- Blind qualityOnly non-inferior answers count as lower-compute sufficient.
- Conservative ruleUncertain or high-stakes tasks stay high-compute or move to expert review.
2025 Observational Re-Run
Compare what users used with what the task appeared to require
The new unit is an actual-vs-required pair. For each 2025 conversation, we keep the observed model name and classify the minimum sufficient route. This is closer to the paper's empirical object: mismatch between model intensity and task demand.
Estimated from the model the user actually used: almost entirely GPT-4.1-mini and GPT-4o-mini.
Estimated after routing each task to its Codex-assigned minimum sufficient tier.
Negative savings: in this 2025 sample, strict task matching would upgrade more work than it downroutes.
Actual model intensity is above the required tier, excluding expert/not-comparable cases.
Actual model intensity is below the required tier; these cases need output-quality validation.
Only 2 of 1,000 sampled 2025 conversations used o1-preview. o1-mini would also be reasoning if observed.
Naming Rule
OpenAI mini-tier is not the paper's small-model tier.
Actual model family describes the product the user used: GPT-4o-mini and GPT-4.1-mini are OpenAI mini-tier non-reasoning models. Required tier describes the least sufficient execution path: T1 Small LLM means a smaller local/open model such as Gemma, Qwen, Phi, or Llama-class. o1, o1-mini, and o1-preview are reasoning models, not small-model routes.
2025 Full Distribution
Reasoning appears only in the January o1 rows in this public 2025 data
Across 863,322 unique 2025 WildChat conversations available locally, the observed date range is 2025-01-01 to 2025-07-31. The last available month is July 2025, and a 1,000-row July sample is entirely GPT-4.1-mini. This means the natural 2025 sample is a poor test bed for mass reasoning overuse, but a useful test bed for whether mini-tier answers are already adequate or sometimes under-powered.
94.2% mini-tier, 3.5% GPT-4o, 2.4% o1 reasoning.
100% GPT-4o-mini.
100% GPT-4o-mini.
Switch month: 89.0% GPT-4.1-mini, 11.0% GPT-4o-mini.
100% GPT-4.1-mini.
100% GPT-4.1-mini.
100% GPT-4.1-mini; July random sample n=1,000.
Interpretation Change
For 2025 WildChat, the immediate problem is not mass reasoning use.
The earlier 10,000-row stratified panel is still useful for stress-testing reasoning and frontier overuse. The 2025 natural sample says something different: observed users are already mostly on OpenAI mini-tier non-reasoning models. The next empirical question is whether those answers are good enough, whether some should be upgraded, and which tasks can move further down to search, local tools, or true small local models.
Earlier Stratified Pilot
The 10,000-row run is now a stress-test panel, not the main 2025 estimate
This older panel deliberately mixed model-proportional rows with reasoning, multi-turn, non-English, and edge-case strata. It is useful for seeing the full route taxonomy and stress-testing reasoning-heavy replay, but it should not be read as the natural 2025 model-use distribution.
WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.
What the user is really trying to accomplish, not just the words in the prompt.
Ask what level of AI would have been enough if the user had chosen deliberately.
Compare standard frontier replay and reasoning replay against the minimum sufficient execution tier.
Example Display
Prompt examples are original excerpts.
WildChat episodes are conversation-level records, so one row can contain multiple user turns or a very long pasted prompt. The example cards keep the original language and wording, but clip long prompts to the first 1,500 characters for readability and show the raw prompt length. The episode id points back to the full local record.
Core Message
The problem is not using AI. The problem is using too much AI for the task.
In this pilot, the savings are modest if the alternative is ordinary frontier inference for every task, but much larger if ordinary tasks are being pushed through reasoning-heavy paths. That is the paper's opening: AI is an intensity choice, and users need information that helps them conserve high-compute AI when it is not needed.
Counterfactual Accounting
Decompose first, then compare the same tasks two ways
Decomposition is necessary because a chat is not a single generic "AI request." One conversation may be simple lookup, another may be writing, another may be calculation, and another may need reasoning or tools. The accounting compares the same 10,000 actual tasks under two execution plans.
Counterfactual baseline: every task is answered through a standard frontier path.
Stress baseline: every task is pushed through reasoning/test-time compute.
Careful-use benchmark: no-LLM, small LLM, standard frontier, context, reasoning, agent, or exclusion.
2,838.6Wh saved; T4/T5 quality upgrades offset part of the T0/T1 savings.
87,424.7Wh saved when ordinary tasks avoid reasoning/test-time compute.
3,537 / 10,000 tasks need language capability, but not frontier execution.
Why This Equation
The numerator is avoidable cloud execution, not total AI energy.
The equation isolates the cloud inference energy that changes when the execution plan changes. It does not convert human time into carbon, and it does not count model training. It asks two narrow questions: what if every task uses standard frontier execution, and what if every task uses reasoning/test-time compute?
Replay Definition
This is not asking GPT-4 to answer again.
"Replay" means applying the same accounting model to the same observed WildChat tasks under two execution plans. The baselines replay standard frontier and reasoning-frontier execution. The alternative uses the task decomposition to choose the minimum sufficient tier. No GPT-4, GPT-4o, GPT-4.5, GPT-5.5, or Pro model was re-run on the 10,000 conversations for this result.
Energy Multipliers
Use one additive task-energy model before comparing alternatives
Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each actual task is compared with the lowest sufficient execution tier: no-LLM tool/search/API, small LLM, standard frontier, long-context frontier, reasoning frontier, LLM agent with tools, or expert/not comparable.
(28,259.3Wh standard replay - 25,420.7Wh task-matched execution) / 28,259.3Wh = 10.0%.
(115,000.2Wh reasoning replay - 27,575.5Wh task-matched execution) / 115,000.2Wh = 76.0%.
Uses EPA 0.394 kgCO2/kWh average electricity factor.
Heavy frontier reasoning replay minus task-matched execution.
Interpretation
Search is one lower-intensity option, not the whole story.
The biggest sustainability mistake is not using a frontier model for every task; it is using reasoning compute for ordinary tasks. T4 keeps reasoning where it is justified. T1 shows the largest model-rightsizing opportunity: 35.4% of chats need language capability, but only a small model.
Scale-Up
We need two benchmarks, not one headline number
The pilot should be benchmarked twice. If every task is replayed through a standard frontier model, careful use saves 0.28-0.68 Wh per row. If every task is replayed through a reasoning-frontier path, careful use saves 8.74-13.20 Wh per row. The second number is not total LLM electricity; it is the avoided cost of reasoning overuse.
Approximate 2026 world population.
OpenAI public usage anchor.
Lower end is standard-frontier replay; upper end is reasoning-frontier replay.
EPA U.S. average electricity factor; forest storage uses 0.77 tCO2/acre/year.
All standard frontier
2.83-4.99 Wh/row baseline; task matching saves 10.0-13.7%.
All reasoning frontier
11.50-17.83 Wh/row baseline; task matching saves 74.0-76.0%.
Research Rule
Identify the task first; decompose it second; decide whether reasoning is justified third.
Human time is not converted into carbon. This page estimates cloud energy from the task decomposition; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.
Method
The task-energy model is additive
The central scenario uses a 0.85 Wh frontier base work unit and a 6.5 Wh reasoning response. T1 uses 0.04 Wh per 1,000 visible tokens for Gemma-class small models. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.
115,000.2Wh across 10,000 pilot rows.
27,575.5Wh after matching each task to its minimum sufficient execution tier.
Pilot-scale value; platform-scale value is shown in the scale-up section.
Appendix
Data coverage and frontier coefficient derivation
The main story uses frontier standard and frontier reasoning replay. The details below show where the pilot conversations came from and how the energy assumptions are anchored.
Public standard-query anchor from Epoch/OpenAI-era estimates.
0.34Wh × 2.5 active-compute multiplier.
Central estimate when reasoning/test-time compute is invoked.
Heavy standard base / reasoning total for Pro-style paths.
Replay Assumption
Estimate active compute; do not pretend we have vendor telemetry.
GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops. T1 uses Gemma-class small-model inference anchored to Gemma 4 E2B/E4B effective-parameter models. This is a calibrated counterfactual, not a measurement of any vendor's internal serving stack.
Why The Estimate Holds
The claim is comparative before it is absolute.
The exact Wh number can move with model architecture, batching, cache hits, quantization, and hardware. The ranking is more stable: longer generations use more compute than shorter ones; reasoning/test-time scaling uses more compute than ordinary answering; small sufficient models use less than frontier models; local deterministic tools avoid cloud inference. That is why the page reports central and heavy sensitivity scenarios rather than one alleged exact footprint.
Sources
Sources and anchors for the calculation
- de Vries, Joule 2023: 0.3 Wh search and up to 2.9 Wh LLM interaction
- Epoch AI: about 0.3 Wh for a typical GPT-4o-style query
- Oviedo et al., Joule 2026: 0.31 Wh frontier inference and order-of-magnitude higher long reasoning
- Oviedo et al. preprint: 0.34 Wh standard and 4.32 Wh test-time scaling scenario
- EPA eGRID: 0.394 kgCO2/kWh U.S. average electricity factor
- EPA equivalencies: 0.77 metric ton CO2 per acre of U.S. forest storing carbon for one year
- OpenAI API model page: GPT-5.5 pricing, context, and reasoning support
- OpenAI API model page: GPT-5.5 Pro pricing and long-running hard-task behavior
- OpenAI Help: GPT-5.5 Instant, Thinking, and Pro modes in ChatGPT
- OpenAI GPT-5.5 release and product notes
- OpenAI ChatGPT Pro: Pro/reasoning modes use more compute for harder problems
- OpenAI GPT-4.5: large compute-intensive model, not a GPT-4o replacement
- Google Gemma 4: E2B/E4B small open models and 26B MoE active 3.8B parameters
- Google AI for Developers: Gemma 4 model sizes, context windows, and memory requirements
- FrugalGPT: cascaded model selection for lower cost
- RouteLLM: routing queries between cheaper and stronger LLMs
- Worldometer / UN WPP 2024: 2026 world population around 8.3B
- OpenAI: ChatGPT serves more than 800M weekly users
- WildChat-4.8M
Experimental study
The experiment must separate information from authority and defaults.
The referee risk is clear: a user may follow a recommendation because it looks authoritative, not because they learned the energy cost. The revised design uses five arms to isolate energy information, suggested AI intensity, and default preselection while preserving user override.
Low-risk rewrite. A small model is likely sufficient; frontier reasoning is unlikely to improve the result enough to justify the extra energy.
Definition
What "recommendation" means in this study
A recommendation is a visible, optional task-specific suggestion shown to the user before they choose a model. It is not hidden platform routing and it is not a command. It names the task type, proposes the least validated AI intensity likely to work, gives a short reason, and lets the user override.
Design
One experiment, five interface conditions
Each condition changes the choice architecture, not the underlying task. This separates belief updating from authority cues and default effects.
No energy label, no recommended alternative.
Users see the footprint but must decide what to do.
Users see the suggested AI intensity but no energy/cost numbers.
Users see both the footprint and the task-matched AI-use suggestion.
The lower-intensity option is selected when confidence is high; override remains explicit.
Measurement
The outcome is a frontier, not one number
Validation In The Experiment
Every claimed saving has to survive output checks
A lower-compute recommendation counts only when it produces a usable result. Objective tasks use tests or exact checks. Writing uses blind preference. Factual tasks require source support. High-stakes or uncertain tasks are abstentions, not savings. The experiment also includes falsification tasks where high-intensity AI is demonstrably better; a good intervention should not reduce frontier/reasoning use there.
First-pass research story
A user sees one button. The system sees many possible energy paths.
The paper should begin from that mismatch. Modern AI products make model choice feel frictionless, but under the surface they can invoke search, small language models, frontier inference, long context, test-time reasoning, or tool loops. The experiment asks whether users will choose differently when the path becomes visible.
Research Flow
From an observed mismatch to a behavioral experiment
The pilot shows what the world looks like when we read real chats as tasks. The paper becomes publishable when it proves two more things: lower-compute answers can be good enough, and users actually change choices when the interface gives them a usable alternative.
Observed task prevalence
Public WildChat episodes plus participant task logs. Estimate which real tasks are no-LLM, small-model, standard frontier, long-context, reasoning, agentic, or expert-only.
Output: frontier-avoidable task share with confidence intervals.User perception and model choice
Survey and vignette experiment measuring whether users understand cost, energy, reasoning, long-context, agent loops, and the quality tradeoffs of small/open models.
Output: frontier preference scale and immediate recommendation effect.Field intervention
A right-sizing interface compares normal use against energy labels plus task-matched recommendations and an easy lower-compute option with override.
Output: behavior, quality, latency, rework, and satisfaction effects.What Makes The Claim Strong
Three things cannot be blurred together
Task share is not energy share
A large fraction of simple tasks does not automatically imply a large energy saving. Savings depend on the baseline: standard frontier replay, reasoning replay, long-context replay, or agentic replay.
Local is not automatically green
Local inference must be measured. Low-throughput personal hardware can be worse than optimized cloud serving. The correct claim is least sufficient execution under measured conditions.
Awareness is not behavior change
Showing watts alone is weak. The intervention must pair resource feedback with a recommended alternative and a one-click path that preserves task quality.
Questions
The paper should answer four questions in order
From labels to proof
A cheaper option only counts if the answer still works.
The pilot gives a map. The paper has to test the map. For a subset of tasks, we should actually run the no-LLM option, the small-model option, the standard frontier option, and the reasoning option, then evaluate which answers survive blind quality checks.
Annotation Upgrade
Stop asking annotators to jump straight to a tier
A tier by itself is too compressed. The adjudication sheet should expose the reasons: freshness, context, risk, deterministic solvability, language work, reasoning depth, and tool need. The final recommendation should fall out of those facts.
Output Validation
How we prove lower compute is sufficient
Candidate options
No-LLM tool/search, local or small open model, standard frontier, frontier + reasoning, and agentic AI.
Generate answers
Run a stratified subset through multiple options with logged tokens, latency, cost, and estimated or measured Wh.
Blind evaluation
Use tests for objective tasks, pairwise preference for writing, factual checks for lookup, and expert exclusion for high-risk tasks.
Non-inferiority
Declare a task frontier-avoidable only if lower compute preserves usefulness, correctness, and rework within the pre-registered margin.
Recommendation Metrics
The recommendation rule should be evaluated like a safety-critical classifier
Field experiment
A carbon label alone is not enough. The alternative has to be one click away.
The experiment should not simply tell people that AI has a footprint. It should say: this looks like search, this looks like a small-model rewrite, this one needs reasoning, and this one should not be assigned lower-intensity AI. Then we measure whether people accept, override, and still get the job done.
Experimental Arms
From resource awareness to actionable right-sizing
Control
Participants use their normal LLM workflow. We log task type, chosen model, tokens, time, and satisfaction.
Energy label only
Participants see estimated Wh, CO2e, dollar cost, and latency, but no recommended alternative.
Label + recommendation
The interface recommends search, local tool, small model, standard frontier, reasoning, agent, or abstain.
Default right-sized option
The lower-compute recommendation is preselected when confidence is high; users retain explicit override.
Outcome Dashboard
The frontier is time, quality, and carbon together
Behavioral Logic
Why recommendations may work when labels alone do not
User submits or describes the intended task.
The system estimates risk, freshness, reasoning, context, and tool need.
The interface shows an AI-use level and the reason for it.
The override becomes a revealed-preference measure.
Savings count only if the task is successful.
First pass manuscript
The Electricity Cost of Everyday AI
The paper starts from the pilot finding, not from the benchmark machinery: in 10,000 real chat episodes, 6.4% appear to need no LLM and another 35.4% appear small-model sufficient. The sustainability question is why everyday users still reach for high-compute AI when a lower-energy option would work.
Major Framing
What we want the paper to do
The strongest version of the paper leads with empirical magnitudes. Some observed AI use is strict overuse: search, calculation, local software, or a specialized tool would have been enough. A larger share needs language capability but not frontier AI. The behavioral question is whether users know this energy gradient and whether information makes them choose differently.
Strict overuse
6.4% of pilot conversations appear answerable without an LLM: search, tools, or local software.
Simpler model enough
35.4% need language capability but appear small-model sufficient, not frontier necessary.
Reasoning overuse
If ordinary tasks default to reasoning, many become 2-4x or much more energy-intensive.
Subject response
Subjects see the task type and energy/cost comparison, then choose whether to use a lower-intensity option.
Abstract And Literature
The abstract should make one clean move: sustainable AI use is an electricity problem
Existing energy work shows that inference cost varies with model size, token length, reasoning, and serving efficiency. Our abstract should lead with what the data show: the share of strict overuse, the share that can use simpler models, the electricity multiples from reasoning overuse, and then the subject experiment.
Generative AI is turning electricity-intensive computation into an everyday consumer habit. In 10,000 real chat episodes, 6.4% appear to need no LLM and another 35.4% appear small-model sufficient. If these ordinary tasks are handled by reasoning-heavy AI, many become 2-4x or much more energy-intensive. We then test whether subjects understand this energy ladder and whether a simple task-specific energy and cost intervention shifts them toward lower-intensity AI use.
Possible Results
The results should be stated as tradeoffs, not slogans
The paper should make the two behavioral results front and center, then connect them to the two benchmark worlds. If all tasks were standard frontier, savings are modest. If ordinary tasks default to reasoning, savings are large. The information intervention tells us whether users can be guided toward careful AI use.
Subjects do not correctly perceive the resource gradient between search, small models, frontier, reasoning, and agents.
A simple task-specific recommendation changes model choice toward the right level of AI intensity.
Standard-frontier overuse yields modest savings; reasoning-frontier overuse yields much larger savings.
The welfare claim survives only if lower-compute options preserve task success and avoid rework.
Research roadmap
The referee response is clear: finish the empirical core.
The question is strong, but the current draft reads like a research design memo. The next version needs two audited empirical modules: computational validation of task sufficiency, and a randomized experiment that separates information from recommendation authority and defaults.
Referee Diagnosis
The revised paper needs two completed studies, not better prose
The referee consensus is not that the framing is weak. It is that the causal and measurement claims are premature. The plan below converts each critique into a concrete empirical requirement.
Self-Critique And ROI
The next dollar should buy validation, not more labels
The current 10,000-row result is a strong pilot, but it is still classifier evidence. The highest-ROI next step is to prove that the lower-energy option actually works: no-LLM cases must be executable without an LLM, small-model cases must survive output validation, and standard-frontier cases must be shown not to need reasoning. Expanding to more raw chats before this validation would increase precision around an unverified construct.
In the central 10,000-row accounting, task matching saves 2.84 kWh versus all-standard-frontier use and 87.42 kWh versus all-reasoning-frontier use. Per one million chats, that scales to 0.28 MWh and 8.74 MWh. The best research ROI is therefore not another larger pilot. It is validating the two margins that create the headline: T1 small-model sufficiency and T2 reasoning avoidance.
Computational Plan
Prove which lower-energy answers actually work
This module turns the 10,000-row classifier pilot into auditable evidence. The unit is no longer a label; it is a task with generated alternatives, measured energy, quality ratings, and rework.
Gold-label 1,500 tasks
Stratify by tier and uncertainty: 250 T0, 500 T1, 350 T2, 200 T4, 100 T5/T6, and 100 low-confidence or disagreement-prone cases. Two annotators label task, risk, no-LLM feasibility, small-model sufficiency, frontier need, and reasoning need; disagreements are adjudicated.
Output-validate 600-800 tasks
Run no-LLM paths where applicable, Gemma-class local small models through Ollama for T1, standard frontier for T2, and reasoning frontier for T4. Use blind preference for writing, exact checks for objective tasks, source checks for lookup, and rework as a penalty.
Measure energy per successful task
For small-model candidates, record idle-subtracted Wh, latency, tokens/sec, and failure/retry rate on the actual host machine. Compare against standard-frontier and reasoning-frontier accounting as Wh per successful task, not Wh per request.
Recompute the headline with uncertainty
Report frontier-avoidable share with confidence intervals and false-positive adjustment. The abstract should use validated rates, not raw classifier rates, once this module is complete.
Experimental Plan
Test whether information changes choices without degrading quality
This module should be pre-registered after computational validation. Its job is not to prove small models work; that is the computational module. Its job is to test whether users understand and act on the energy ladder when the lower-intensity option has already been validated.
Five arms
Randomize subjects to control, energy label only, recommendation only, label plus recommendation, and lower-intensity default with override. This separates information from authority and default effects.
High-quality subject pool
Recruit a documented subject pool, target roughly 800 subjects with 160 per arm, and record AI experience, baseline energy knowledge, domain familiarity, and environmental attitudes.
Two primary outcomes
Primary outcomes are unnecessary high-compute choice and Wh per successful task. Secondary outcomes are model choice, reasoning share, override, time, satisfaction, perceived quality, tokens, latency, and rework with multiple-testing correction.
Include high-intensity-needed tasks
The intervention should lower high-compute use on validated-equivalent tasks, but not on tasks where frontier or reasoning outputs are demonstrably better. This separates conservation from anti-compute nudging.