10,000 real chat episodes
The same AI box now handles lookup, writing, code, reasoning, and tools.
That is the sustainability problem. People are not choosing "compute"; they are asking everyday questions. We take 10,000 WildChat-4.8M episodes, recover what the user was trying to do, and ask a sharper question: what execution path was actually needed for that task?
Validation sits inside the first page
The pilot label is not the proof.
- Gold labelsTwo human annotators read the actual task and adjudicate the route.
- Run outputsSearch/tool, small model, standard frontier, and reasoning paths answer matched cases.
- Blind qualityOnly non-inferior answers count as lower-compute sufficient.
- Selective routerUncertain or high-stakes tasks abstain instead of being down-routed.
Observational Layer
The first pass turns chat logs into work episodes
The pilot is intentionally simple: one row, one observed episode, one best reading of the user's task. The useful part is not the label itself; it is the movement from a generic "LLM query" to a task that may need search, a deterministic tool, a small model, a standard frontier model, reasoning, an agent, or expert care.
WildChat user prompt, assistant answer, model name, timestamp, language, and turn count.
What the user is really trying to accomplish, not just the words in the prompt.
Ask what kind of execution would have been enough if the user had been routed deliberately.
Compare standard frontier replay and reasoning replay against the minimum sufficient execution tier.
Example Display
Prompt examples are original excerpts.
WildChat episodes are conversation-level records, so one row can contain multiple user turns or a very long pasted prompt. The example cards keep the original language and wording, but clip long prompts to the first 1,500 characters for readability and show the raw prompt length. The episode id points back to the full local record.
Core Message
The waste is not "using AI." The waste is using the wrong AI path.
In this pilot, the savings are modest if the alternative is ordinary frontier inference for every task, but much larger if ordinary tasks are being pushed through reasoning-heavy paths. That is the paper's opening: the environmental stake is created by hidden mode choice, not by a simple local-versus-cloud slogan.
Counterfactual Accounting
Decompose first, then compare the same tasks two ways
Decomposition is necessary because a chat is not a single generic "AI request." One conversation may be simple lookup, another may be writing, another may be calculation, and another may need reasoning or tools. The accounting compares the same 10,000 actual tasks under two execution plans.
Counterfactual baseline: every task is answered through a standard frontier path.
Stress baseline: every task is pushed through reasoning/test-time compute.
Minimum sufficient tiers: no-LLM, small LLM, standard frontier, context, reasoning, agent, or exclusion.
2,838.6Wh saved; T4/T5 quality upgrades offset part of the T0/T1 savings.
87,424.7Wh saved when ordinary tasks avoid reasoning/test-time compute.
3,537 / 10,000 tasks need language capability, but not frontier execution.
Why This Equation
The numerator is avoidable cloud execution, not total AI energy.
The equation isolates the cloud inference energy that changes when the execution plan changes. It does not convert human time into carbon, and it does not count model training. It asks two narrow questions: what if every task uses standard frontier execution, and what if every task uses reasoning/test-time compute?
Replay Definition
This is not asking GPT-4 to answer again.
"Replay" means applying the same accounting model to the same observed WildChat tasks under two execution plans. The baselines replay standard frontier and reasoning-frontier execution. The alternative uses the task decomposition to choose the minimum sufficient tier. No GPT-4, GPT-4o, GPT-4.5, GPT-5.5, or Pro model was re-run on the 10,000 conversations for this result.
Energy Multipliers
Use one additive task-energy model before comparing alternatives
Every pilot row is decomposed as base visible inference, multiple model responses, reasoning add-on, search add-on, and tool add-on. Then each actual task is compared with the lowest sufficient execution tier: no-LLM tool/search/API, small LLM, standard frontier, long-context frontier, reasoning frontier, LLM agent with tools, or expert/not comparable.
(28,259.3Wh standard replay - 25,420.7Wh task-matched execution) / 28,259.3Wh = 10.0%.
(115,000.2Wh reasoning replay - 27,575.5Wh task-matched execution) / 115,000.2Wh = 76.0%.
Uses EPA 0.394 kgCO2/kWh average electricity factor.
Heavy frontier reasoning replay minus task-matched execution.
Interpretation
Search is one route, not the whole story.
The biggest sustainability mistake is not using a frontier model for every task; it is using reasoning compute for ordinary tasks. T4 keeps reasoning where it is justified. T1 shows the largest model-rightsizing opportunity: 35.4% of chats need language capability, but only a small model.
Scale-Up
We need two benchmarks, not one headline number
The pilot should be benchmarked twice. If every task is replayed through a standard frontier model, routing saves 0.28-0.68 Wh per row. If every task is replayed through a reasoning-frontier path, routing saves 8.74-13.20 Wh per row. The second number is not total LLM electricity; it is the avoided cost of reasoning overuse.
Approximate 2026 world population.
OpenAI public usage anchor.
Lower end is standard-frontier replay; upper end is reasoning-frontier replay.
EPA U.S. average electricity factor; forest storage uses 0.77 tCO2/acre/year.
All standard frontier
2.83-4.99 Wh/row baseline; task matching saves 10.0-13.7%.
All reasoning frontier
11.50-17.83 Wh/row baseline; task matching saves 74.0-76.0%.
Research Rule
Identify the task first; decompose it second; decide whether reasoning is justified third.
Human time is not converted into carbon. This page estimates cloud energy from the task decomposition; the next research layer should plot the time-carbon frontier rather than collapse time into emissions.
Method
The task-energy model is additive
The central scenario uses a 0.85 Wh frontier base work unit and a 6.5 Wh reasoning response. T1 uses 0.04 Wh per 1,000 visible tokens for Gemma-class small models. Search adds 0.30 Wh per query. Carbon is cloud electricity multiplied by the EPA U.S. average grid factor.
115,000.2Wh across 10,000 pilot rows.
27,575.5Wh after matching each task to its minimum sufficient execution tier.
Pilot-scale value; platform-scale value is shown in the scale-up section.
Appendix
Data coverage and frontier coefficient derivation
The main story uses frontier standard and frontier reasoning replay. The details below show where the pilot conversations came from and how the energy assumptions are anchored.
Public standard-query anchor from Epoch/OpenAI-era estimates.
0.34Wh × 2.5 active-compute multiplier.
Central estimate when reasoning/test-time compute is invoked.
Heavy standard base / reasoning total for Pro-style paths.
Replay Assumption
Estimate active compute; do not pretend we have vendor telemetry.
GPT-5.5 is treated as a larger product surface than GPT-4o: 1,050,000-token context, 128,000 max output, reasoning-token support, and $5/$30 per million input/output tokens. The replay model uses active compute, token length, reasoning steps, retrieval, and tool loops. T1 uses Gemma-class small-model inference anchored to Gemma 4 E2B/E4B effective-parameter models. This is a calibrated counterfactual, not a measurement of any vendor's internal serving stack.
Why The Estimate Holds
The claim is comparative before it is absolute.
The exact Wh number can move with model architecture, batching, cache hits, quantization, and hardware. The ranking is more stable: longer generations use more compute than shorter ones; reasoning/test-time scaling uses more compute than ordinary answering; small sufficient models use less than frontier models; local deterministic tools avoid cloud inference. That is why the page reports central and heavy sensitivity scenarios rather than one alleged exact footprint.
Sources
Sources and anchors for the calculation
- de Vries, Joule 2023: 0.3 Wh search and up to 2.9 Wh LLM interaction
- Epoch AI: about 0.3 Wh for a typical GPT-4o-style query
- Oviedo et al., Joule 2026: 0.31 Wh frontier inference and order-of-magnitude higher long reasoning
- Oviedo et al. preprint: 0.34 Wh standard and 4.32 Wh test-time scaling scenario
- EPA eGRID: 0.394 kgCO2/kWh U.S. average electricity factor
- EPA equivalencies: 0.77 metric ton CO2 per acre of U.S. forest storing carbon for one year
- OpenAI API model page: GPT-5.5 pricing, context, and reasoning support
- OpenAI API model page: GPT-5.5 Pro pricing and long-running hard-task behavior
- OpenAI Help: GPT-5.5 Instant, Thinking, and Pro modes in ChatGPT
- OpenAI GPT-5.5 release and product notes
- OpenAI ChatGPT Pro: Pro/reasoning modes use more compute for harder problems
- OpenAI GPT-4.5: large compute-intensive model, not a GPT-4o replacement
- Google Gemma 4: E2B/E4B small open models and 26B MoE active 3.8B parameters
- Google AI for Developers: Gemma 4 model sizes, context windows, and memory requirements
- FrugalGPT: cascaded model selection for lower cost
- RouteLLM: routing queries between cheaper and stronger LLMs
- Worldometer / UN WPP 2024: 2026 world population around 8.3B
- OpenAI: ChatGPT serves more than 800M weekly users
- WildChat-4.8M
Experimental study
The experiment is not an awareness label. It is a routing choice.
The study should put users at the exact moment where over-compute happens: they have a task, a default model, and incomplete knowledge of the resource path. We then change what the interface makes visible and measure whether the user accepts a lower-compute route without losing quality or time.
Low risk writing task. Frontier reasoning is unlikely to improve the outcome enough to justify the extra compute.
Design
One experiment, four interface conditions
The intervention and the experimental design should be shown together. Each condition changes the choice architecture, not the underlying user task.
No energy label, no recommended alternative.
Users see the footprint but must decide what to do.
The system suggests search, local tool, small model, frontier, reasoning, agent, or abstain.
Low-compute route is selected when confidence is high; override remains explicit.
Measurement
The outcome is a frontier, not one number
Validation In The Experiment
Every claimed saving has to survive output checks
A lower-compute recommendation counts only when it produces a usable result. Objective tasks use tests or exact checks. Writing uses blind preference. Factual tasks require source support. High-stakes or uncertain tasks are abstentions, not savings.
First-pass research story
A user sees one button. The system sees many possible energy paths.
The paper should begin from that mismatch. Modern AI products make model choice feel frictionless, but under the surface they can invoke search, small language models, frontier inference, long context, test-time reasoning, or tool loops. The experiment asks whether users will choose differently when the path becomes visible.
Research Flow
From an observed mismatch to a behavioral experiment
The pilot shows what the world looks like when we read real chats as tasks. The paper becomes publishable when it proves two more things: lower-compute answers can be good enough, and users actually change choices when the interface gives them a usable alternative.
Observed task prevalence
Public WildChat episodes plus participant task logs. Estimate which real tasks are no-LLM, small-model, standard frontier, long-context, reasoning, agentic, or expert-only.
Output: frontier-avoidable task share with confidence intervals.User perception and model choice
Survey and vignette experiment measuring whether users understand cost, energy, reasoning, long-context, agent loops, and the quality tradeoffs of small/open models.
Output: frontier preference scale and immediate recommendation effect.Field intervention
A right-sizing interface compares normal use against energy labels plus task-matched recommendations and an easy lower-compute route with override.
Output: behavior, quality, latency, rework, and satisfaction effects.What Makes The Claim Strong
Three things cannot be blurred together
Task share is not energy share
A large fraction of simple tasks does not automatically imply a large energy saving. Savings depend on the baseline: standard frontier replay, reasoning replay, long-context replay, or agentic replay.
Local is not automatically green
Local inference must be measured. Low-throughput personal hardware can be worse than optimized cloud serving. The correct claim is least sufficient execution under measured conditions.
Awareness is not behavior change
Showing watts alone is weak. The intervention must pair resource feedback with a recommended alternative and a one-click path that preserves task quality.
Questions
The paper should answer four questions in order
From labels to proof
A cheaper route only counts if the answer still works.
The pilot gives a map. The paper has to test the map. For a subset of tasks, we should actually run the no-LLM route, the small-model route, the standard frontier route, and the reasoning route, then evaluate which answers survive blind quality checks.
Annotation Upgrade
Stop asking annotators to jump straight to a tier
A tier by itself is too compressed. The adjudication sheet should expose the reasons: freshness, context, risk, deterministic solvability, language work, reasoning depth, and tool need. The final route should fall out of those facts.
Output Validation
How we prove lower compute is sufficient
Route candidates
No-LLM tool/search, local or small open model, standard frontier, frontier + reasoning, and agentic route.
Generate answers
Run a stratified subset through multiple routes with logged tokens, latency, cost, and estimated or measured Wh.
Blind evaluation
Use tests for objective tasks, pairwise preference for writing, factual checks for lookup, and expert exclusion for high-risk tasks.
Non-inferiority
Declare a task frontier-avoidable only if lower compute preserves usefulness, correctness, and rework within the pre-registered margin.
Router Metrics
The router should be evaluated like a safety-critical classifier
Field experiment
A carbon label alone is not enough. The alternative has to be one click away.
The experiment should not simply tell people that AI has a footprint. It should say: this looks like search, this looks like a small-model rewrite, this one needs reasoning, and this one should not be down-routed. Then we measure whether people accept, override, and still get the job done.
Experimental Arms
From resource awareness to actionable right-sizing
Control
Participants use their normal LLM workflow. We log task type, chosen model, tokens, time, and satisfaction.
Energy label only
Participants see estimated Wh, CO2e, dollar cost, and latency, but no recommended alternative.
Label + recommendation
The interface recommends search, local tool, small model, standard frontier, reasoning, agent, or abstain.
Default right-sized route
The lower-compute recommendation is preselected when confidence is high; users retain explicit override.
Outcome Dashboard
The frontier is time, quality, and carbon together
Behavioral Logic
Why recommendations may work when labels alone do not
User submits or describes the intended task.
The system estimates risk, freshness, reasoning, context, and tool need.
The interface shows a route and the reason for it.
The override becomes a revealed-preference measure.
Savings count only if the task is successful.
First pass manuscript
One Prompt, Many Costs
The top-econ version of the paper starts from a simple fact: a prompt is not a production function. The user sees one box, while the platform chooses among inputs with different social marginal costs. Our first results are simple: users do not know this margin, and information helps them choose better.
Major Framing
What we want the paper to do
The strongest version of the paper treats AI model choice as an economic decision under hidden marginal costs. A person wants work done; the platform chooses a resource path; the user sees neither the full price nor the lower-compute substitutes. The research asks whether revealing that margin changes behavior.
Hidden margin
One prompt box hides search, tools, small models, frontier models, reasoning, and agents.
User misperception
Student subjects do not correctly rank the resource cost of different AI routes.
Information response
A short task recommendation plus energy/cost comparison moves users toward the right route.
Welfare test
Savings count only when task success, time, and satisfaction are preserved.
Abstract And Literature
The abstract should make one clean move
Existing routing work shows that cheaper and stronger models can be selected dynamically. Existing energy work shows that inference cost varies with model, tokens, and reasoning. Existing awareness work shows that users can be told about footprint. Our move is economic: estimate whether hidden prices distort model choice, then test whether information changes the choice.
AI chat systems make a prompt appear to be one economic action. It is not: behind the same text box, the platform chooses among routes with different compute, energy, latency, and dollar costs. We show that users are poorly informed about this margin and that a simple task-specific information intervention shifts choices toward the task-matched route.
Possible Results
The results should be stated as tradeoffs, not slogans
The paper should make the two behavioral results front and center, then connect them to the two benchmark worlds. If all tasks were standard frontier, savings are modest. If ordinary tasks are over-routed to reasoning, savings are large. The information intervention tells us whether that hidden margin is actionable.
Subjects do not correctly perceive the resource gradient between search, small models, frontier, reasoning, and agents.
A simple task-specific recommendation changes model choice toward the right route.
Standard-frontier overuse yields modest savings; reasoning-frontier overuse yields much larger savings.
The welfare claim survives only if lower-compute routes preserve task success and avoid rework.