10,000 real chat episodes
The same AI box now handles lookup, writing, code, reasoning, and tools.
That is the sustainability problem. People are not choosing "compute"; they are asking everyday questions. We take 10,000 WildChat-4.8M episodes, recover what the user was trying to do, and ask a sharper question: what execution path was actually needed for that task?
Validation sits inside the first page
The pilot label is not the proof.
- Gold labelsTwo human annotators read the actual task and adjudicate the AI-use level.
- Run outputsSearch/tool, small model, standard frontier, and reasoning paths answer matched cases.
- Blind qualityOnly non-inferior answers count as lower-compute sufficient.
- Conservative ruleUncertain or high-stakes tasks stay high-compute or move to expert review.
Observational Layer
The first pass turns chat logs into work episodes
The pilot is intentionally simple: one row, one observed episode, one best reading of the user's task. The useful part is not the label itself; it is the movement from a generic "LLM query" to a task that may need search, a deterministic tool, a small model, a standard frontier model, reasoning, an agent, or expert care.
WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.
What the user is really trying to accomplish, not just the words in the prompt.
Ask what level of AI would have been enough if the user had chosen deliberately.
Compare standard frontier replay and reasoning replay against the minimum sufficient execution tier.
Example Display
Prompt examples are original excerpts.
WildChat episodes are conversation-level records, so one row can contain multiple user turns or a very long pasted prompt. The example cards keep the original language and wording, but clip long prompts to the first 1,500 characters for readability and show the raw prompt length. The episode id points back to the full local record.
Core Message
The problem is not using AI. The problem is using too much AI for the task.
In this pilot, the savings are modest if the alternative is ordinary frontier inference for every task, but much larger if ordinary tasks are being pushed through reasoning-heavy paths. That is the paper's opening: AI is an intensity choice, and users need information that helps them conserve high-compute AI when it is not needed.
Counterfactual Accounting
Decompose first, then compare the same tasks two ways
Decomposition is necessary because a chat is not a single generic "AI request." One conversation may be simple lookup, another may be writing, another may be calculation, and another may need reasoning or tools. The accounting compares the same 10,000 actual tasks under two execution plans.
Counterfactual baseline: every task is answered through a standard frontier path.
Stress baseline: every task is pushed through reasoning/test-time compute.
Careful-use benchmark: no-LLM, small LLM, standard frontier, context, reasoning, agent, or exclusion.
2,838.6Wh saved; T4/T5 quality upgrades offset part of the T0/T1 savings.
87,424.7Wh saved when ordinary tasks avoid reasoning/test-time compute.
3,537 / 10,000 tasks need language capability, but not frontier execution.
Why This Equation
The numerator is avoidable cloud execution, not total AI energy.
The equation isolates the cloud inference energy that changes when the execution plan changes. It does not convert human time into carbon, and it does not count model training. It asks two narrow questions: what if every task uses standard frontier execution, and what if every task uses reasoning/test-time compute?
Replay Definition
This is not asking GPT-4 to answer again.
"Replay" means applying the same accounting model to the same observed WildChat tasks under two execution plans. The baselines replay standard frontier and reasoning-frontier execution. The alternative uses the task decomposition to choose the minimum sufficient tier. No GPT-4, GPT-4o, GPT-4.5, GPT-5.5, or Pro model was re-run on the 10,000 conversations for this result.
Energy Multipliers
Use one additive task-energy model before comparing alternatives
Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each actual task is compared with the lowest sufficient execution tier: no-LLM tool/search/API, small LLM, standard frontier, long-context frontier, reasoning frontier, LLM agent with tools, or expert/not comparable.
(28,259.3Wh standard replay - 25,420.7Wh task-matched execution) / 28,259.3Wh = 10.0%.
(115,000.2Wh reasoning replay - 27,575.5Wh task-matched execution) / 115,000.2Wh = 76.0%.
Uses EPA 0.394 kgCO2/kWh average electricity factor.
Heavy frontier reasoning replay minus task-matched execution.
Interpretation
Search is one lower-intensity option, not the whole story.
The biggest sustainability mistake is not using a frontier model for every task; it is using reasoning compute for ordinary tasks. T4 keeps reasoning where it is justified. T1 shows the largest model-rightsizing opportunity: 35.4% of chats need language capability, but only a small model.
Scale-Up
We need two benchmarks, not one headline number
The pilot should be benchmarked twice. If every task is replayed through a standard frontier model, careful use saves 0.28-0.68 Wh per row. If every task is replayed through a reasoning-frontier path, careful use saves 8.74-13.20 Wh per row. The second number is not total LLM electricity; it is the avoided cost of reasoning overuse.
Approximate 2026 world population.
OpenAI public usage anchor.
Lower end is standard-frontier replay; upper end is reasoning-frontier replay.
EPA U.S. average electricity factor; forest storage uses 0.77 tCO2/acre/year.
All standard frontier
2.83-4.99 Wh/row baseline; task matching saves 10.0-13.7%.
All reasoning frontier
11.50-17.83 Wh/row baseline; task matching saves 74.0-76.0%.
Research Rule
Identify the task first; decompose it second; decide whether reasoning is justified third.
Human time is not converted into carbon. This page estimates cloud energy from the task decomposition; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.
Method
The task-energy model is additive
The central scenario uses a 0.85 Wh frontier base work unit and a 6.5 Wh reasoning response. T1 uses 0.04 Wh per 1,000 visible tokens for Gemma-class small models. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.
115,000.2Wh across 10,000 pilot rows.
27,575.5Wh after matching each task to its minimum sufficient execution tier.
Pilot-scale value; platform-scale value is shown in the scale-up section.
Appendix
Data coverage and frontier coefficient derivation
The main story uses frontier standard and frontier reasoning replay. The details below show where the pilot conversations came from and how the energy assumptions are anchored.
Public standard-query anchor from Epoch/OpenAI-era estimates.
0.34Wh × 2.5 active-compute multiplier.
Central estimate when reasoning/test-time compute is invoked.
Heavy standard base / reasoning total for Pro-style paths.
Replay Assumption
Estimate active compute; do not pretend we have vendor telemetry.
GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops. T1 uses Gemma-class small-model inference anchored to Gemma 4 E2B/E4B effective-parameter models. This is a calibrated counterfactual, not a measurement of any vendor's internal serving stack.
Why The Estimate Holds
The claim is comparative before it is absolute.
The exact Wh number can move with model architecture, batching, cache hits, quantization, and hardware. The ranking is more stable: longer generations use more compute than shorter ones; reasoning/test-time scaling uses more compute than ordinary answering; small sufficient models use less than frontier models; local deterministic tools avoid cloud inference. That is why the page reports central and heavy sensitivity scenarios rather than one alleged exact footprint.
Sources
Sources and anchors for the calculation
- de Vries, Joule 2023: 0.3 Wh search and up to 2.9 Wh LLM interaction
- Epoch AI: about 0.3 Wh for a typical GPT-4o-style query
- Oviedo et al., Joule 2026: 0.31 Wh frontier inference and order-of-magnitude higher long reasoning
- Oviedo et al. preprint: 0.34 Wh standard and 4.32 Wh test-time scaling scenario
- EPA eGRID: 0.394 kgCO2/kWh U.S. average electricity factor
- EPA equivalencies: 0.77 metric ton CO2 per acre of U.S. forest storing carbon for one year
- OpenAI API model page: GPT-5.5 pricing, context, and reasoning support
- OpenAI API model page: GPT-5.5 Pro pricing and long-running hard-task behavior
- OpenAI Help: GPT-5.5 Instant, Thinking, and Pro modes in ChatGPT
- OpenAI GPT-5.5 release and product notes
- OpenAI ChatGPT Pro: Pro/reasoning modes use more compute for harder problems
- OpenAI GPT-4.5: large compute-intensive model, not a GPT-4o replacement
- Google Gemma 4: E2B/E4B small open models and 26B MoE active 3.8B parameters
- Google AI for Developers: Gemma 4 model sizes, context windows, and memory requirements
- FrugalGPT: cascaded model selection for lower cost
- RouteLLM: routing queries between cheaper and stronger LLMs
- Worldometer / UN WPP 2024: 2026 world population around 8.3B
- OpenAI: ChatGPT serves more than 800M weekly users
- WildChat-4.8M
Experimental study
The experiment tests whether people use AI more carefully when the cost gradient is visible.
The study puts users at the moment where over-compute happens: they have a task, a powerful default model, and incomplete knowledge of cheaper alternatives. We then show the task type, the recommended AI intensity, and the resource comparison, and measure whether the user conserves high-compute AI without losing quality.
Low risk writing task. Frontier reasoning is unlikely to improve the outcome enough to justify the extra compute.
Design
One experiment, four interface conditions
The intervention and the experimental design should be shown together. Each condition changes the choice architecture, not the underlying user task.
No energy label, no recommended alternative.
Users see the footprint but must decide what to do.
The interface suggests search, local tool, small model, frontier, reasoning, agent, or expert review.
The lower-intensity option is selected when confidence is high; override remains explicit.
Measurement
The outcome is a frontier, not one number
Validation In The Experiment
Every claimed saving has to survive output checks
A lower-compute recommendation counts only when it produces a usable result. Objective tasks use tests or exact checks. Writing uses blind preference. Factual tasks require source support. High-stakes or uncertain tasks are abstentions, not savings.
First-pass research story
A user sees one button. The system sees many possible energy paths.
The paper should begin from that mismatch. Modern AI products make model choice feel frictionless, but under the surface they can invoke search, small language models, frontier inference, long context, test-time reasoning, or tool loops. The experiment asks whether users will choose differently when the path becomes visible.
Research Flow
From an observed mismatch to a behavioral experiment
The pilot shows what the world looks like when we read real chats as tasks. The paper becomes publishable when it proves two more things: lower-compute answers can be good enough, and users actually change choices when the interface gives them a usable alternative.
Observed task prevalence
Public WildChat episodes plus participant task logs. Estimate which real tasks are no-LLM, small-model, standard frontier, long-context, reasoning, agentic, or expert-only.
Output: frontier-avoidable task share with confidence intervals.User perception and model choice
Survey and vignette experiment measuring whether users understand cost, energy, reasoning, long-context, agent loops, and the quality tradeoffs of small/open models.
Output: frontier preference scale and immediate recommendation effect.Field intervention
A right-sizing interface compares normal use against energy labels plus task-matched recommendations and an easy lower-compute option with override.
Output: behavior, quality, latency, rework, and satisfaction effects.What Makes The Claim Strong
Three things cannot be blurred together
Task share is not energy share
A large fraction of simple tasks does not automatically imply a large energy saving. Savings depend on the baseline: standard frontier replay, reasoning replay, long-context replay, or agentic replay.
Local is not automatically green
Local inference must be measured. Low-throughput personal hardware can be worse than optimized cloud serving. The correct claim is least sufficient execution under measured conditions.
Awareness is not behavior change
Showing watts alone is weak. The intervention must pair resource feedback with a recommended alternative and a one-click path that preserves task quality.
Questions
The paper should answer four questions in order
From labels to proof
A cheaper option only counts if the answer still works.
The pilot gives a map. The paper has to test the map. For a subset of tasks, we should actually run the no-LLM option, the small-model option, the standard frontier option, and the reasoning option, then evaluate which answers survive blind quality checks.
Annotation Upgrade
Stop asking annotators to jump straight to a tier
A tier by itself is too compressed. The adjudication sheet should expose the reasons: freshness, context, risk, deterministic solvability, language work, reasoning depth, and tool need. The final recommendation should fall out of those facts.
Output Validation
How we prove lower compute is sufficient
Candidate options
No-LLM tool/search, local or small open model, standard frontier, frontier + reasoning, and agentic AI.
Generate answers
Run a stratified subset through multiple options with logged tokens, latency, cost, and estimated or measured Wh.
Blind evaluation
Use tests for objective tasks, pairwise preference for writing, factual checks for lookup, and expert exclusion for high-risk tasks.
Non-inferiority
Declare a task frontier-avoidable only if lower compute preserves usefulness, correctness, and rework within the pre-registered margin.
Recommendation Metrics
The recommendation rule should be evaluated like a safety-critical classifier
Field experiment
A carbon label alone is not enough. The alternative has to be one click away.
The experiment should not simply tell people that AI has a footprint. It should say: this looks like search, this looks like a small-model rewrite, this one needs reasoning, and this one should not be assigned lower-intensity AI. Then we measure whether people accept, override, and still get the job done.
Experimental Arms
From resource awareness to actionable right-sizing
Control
Participants use their normal LLM workflow. We log task type, chosen model, tokens, time, and satisfaction.
Energy label only
Participants see estimated Wh, CO2e, dollar cost, and latency, but no recommended alternative.
Label + recommendation
The interface recommends search, local tool, small model, standard frontier, reasoning, agent, or abstain.
Default right-sized option
The lower-compute recommendation is preselected when confidence is high; users retain explicit override.
Outcome Dashboard
The frontier is time, quality, and carbon together
Behavioral Logic
Why recommendations may work when labels alone do not
User submits or describes the intended task.
The system estimates risk, freshness, reasoning, context, and tool need.
The interface shows an AI-use level and the reason for it.
The override becomes a revealed-preference measure.
Savings count only if the task is successful.
First pass manuscript
The Electricity Cost of Everyday AI
The paper starts from the pilot finding, not from the benchmark machinery: in 10,000 real chat episodes, 6.4% appear to need no LLM and another 35.4% appear small-model sufficient. The sustainability question is why everyday users still reach for high-compute AI when a lower-energy option would work.
Major Framing
What we want the paper to do
The strongest version of the paper leads with empirical magnitudes. Some observed AI use is strict overuse: search, calculation, local software, or a specialized tool would have been enough. A larger share needs language capability but not frontier AI. The behavioral question is whether users know this energy gradient and whether information makes them choose differently.
Strict overuse
6.4% of pilot conversations appear answerable without an LLM: search, tools, or local software.
Simpler model enough
35.4% need language capability but appear small-model sufficient, not frontier necessary.
Reasoning overuse
If ordinary tasks default to reasoning, many become 2-4x or much more energy-intensive.
Subject response
Subjects see the task type and energy/cost comparison, then choose whether to use a lower-intensity option.
Abstract And Literature
The abstract should make one clean move: sustainable AI use is an electricity problem
Existing energy work shows that inference cost varies with model size, token length, reasoning, and serving efficiency. Our abstract should lead with what the data show: the share of strict overuse, the share that can use simpler models, the electricity multiples from reasoning overuse, and then the subject experiment.
Generative AI is turning electricity-intensive computation into an everyday consumer habit. In 10,000 real chat episodes, 6.4% appear to need no LLM and another 35.4% appear small-model sufficient. If these ordinary tasks are handled by reasoning-heavy AI, many become 2-4x or much more energy-intensive. We then test whether subjects understand this energy ladder and whether a simple task-specific energy and cost intervention shifts them toward lower-intensity AI use.
Possible Results
The results should be stated as tradeoffs, not slogans
The paper should make the two behavioral results front and center, then connect them to the two benchmark worlds. If all tasks were standard frontier, savings are modest. If ordinary tasks default to reasoning, savings are large. The information intervention tells us whether users can be guided toward careful AI use.
Subjects do not correctly perceive the resource gradient between search, small models, frontier, reasoning, and agents.
A simple task-specific recommendation changes model choice toward the right level of AI intensity.
Standard-frontier overuse yields modest savings; reasoning-frontier overuse yields much larger savings.
The welfare claim survives only if lower-compute options preserve task success and avoid rework.
Research roadmap
The next stage is validation, not more raw labels.
The pilot gives a strong direction, but the paper becomes publishable only when lower-energy sufficiency is verified. The next plan focuses on ROI: validate the margins that drive the headline, measure actual local energy, and then run the subject experiment using validated tasks.
Self-Critique And ROI
The next dollar should buy validation, not more labels
The current 10,000-row result is a strong pilot, but it is still classifier evidence. The highest-ROI next step is to prove that the lower-energy option actually works: no-LLM cases must be executable without an LLM, small-model cases must survive output validation, and standard-frontier cases must be shown not to need reasoning. Expanding to more raw chats before this validation would increase precision around an unverified construct.
In the central 10,000-row accounting, task matching saves 2.84 kWh versus all-standard-frontier use and 87.42 kWh versus all-reasoning-frontier use. Per one million chats, that scales to 0.28 MWh and 8.74 MWh. The best research ROI is therefore not another larger pilot. It is validating the two margins that create the headline: T1 small-model sufficiency and T2 reasoning avoidance.
Best Next Plan
Validate the conservation margin, then run the experiment
The plan below turns the pilot into publishable evidence. The sequence matters: first validate the task labels, then validate outputs and local energy, then test whether subjects change choices.
Gold-label 1,200 tasks
Stratify by tier and uncertainty: 250 T0, 400 T1, 300 T2, 150 T4, and 100 T5/T6/borderline cases. Two human annotators label actual task, risk, no-LLM feasibility, small-model sufficiency, frontier need, and reasoning need; disagreements are adjudicated.
Output-validate 600 tasks
Run no-LLM paths where applicable, Gemma-class local small models through Ollama for T1, standard frontier for T2, and reasoning frontier for T4. Use blind preference for writing, exact checks for objective tasks, source checks for lookup, and rework as a penalty.
Measure local and cloud energy
For small-model candidates, record idle-subtracted Wh, latency, tokens/sec, and failure/retry rate on the actual host machine. Compare against standard-frontier and reasoning-frontier accounting as Wh per successful task, not Wh per request.
Run the subject experiment
Use a high-quality subject pool and validated vignettes. Randomize control, energy label, label plus recommendation, and default lower-intensity option with override. Main outcomes are model choice, reasoning share, acceptance, task success, time, satisfaction, and rework.