AI Sustainability Research

AI Over-Compute

Research design, not final estimates

How much everyday cognitive work is routed to high-compute LLMs when lower-compute paths would be enough?

We classify real LLM conversations by the cheapest sufficient path: search, deterministic tools, small models, standard LLMs, tool agents, reasoning models, or human experts.

Pilot labels 3,000 WildChat episodes
Validation 800 ShareChat episodes
Raw data ready 19.3 GB WildChat public + ShareChat
Headline metric OCR Over-Compute Rate

Publication Claim

Not “AI is bad.” The question is whether the task deserved that much machine thinking.

The public message is routing discipline: use search when search is enough, tools when tools are enough, small models when small models are enough, and frontier reasoning models only when the task requires them.

Classification Core

Every chat becomes a route decision

User episode
Tool Calculator, spreadsheet, dictionary, calendar
Small Model Rewrite, translate, short drafting
Standard LLM Tutoring, writing, synthesis, explanation
Tool Agent Browsing, code, files, APIs
Reasoning Model Hard multi-step tasks
Expert Medical, legal, financial, safety-critical

Four labels per episode

  1. IntentWhat the user is trying to do
  2. Cheapest sufficient pathThe lowest-compute path likely good enough
  3. Offload classPublic-facing grouping for the headline metric
  4. Risk featuresCurrent info, citations, long output, retry, exact math, high stakes

Main metric

AI Over-Compute Rate Avoidable LLM episodes / comparable episodes

Avoidable means the task appears solvable by normal search or a deterministic tool without losing the user's likely goal. Small-model-sufficient is reported separately.

More Granular Routing

Separate capability from where the cost happens

A local calculator and a hosted reasoning model may both answer a question, but they do not have the same sustainability profile. The pilot now records cloud inference, direct API cost, and human time as separate quantities.

0

Local deterministic

Calculator, spreadsheet, calendar, local script, dictionary.

Cloud cost: 0 marginal
1

Lookup

Reference page, documentation, FAQ, StackOverflow-style lookup.

Cost: pages + time
2

Web search

Simple query or multi-source reading and synthesis.

Cost: query + reading
3

Small model

Short rewrite, translation, tone change, simple generation.

Cost: small tokens
4

Standard LLM

Tutoring, synthesis, open writing, contextual explanation.

Cost: input + output
5

Reasoning / agent

Long reasoning, browsing, code execution, files, APIs.

Cost: reasoning + tools

Two labels matter most

  1. Actual routeWhat the user used: standard LLM, reasoning LLM, or LLM with tools
  2. Lowest sufficient routeWhat likely would have been enough: local tool, lookup, search, small model, LLM, or expert

We calculate three costs

Cloud inference energy Direct API money cost Human minutes

This avoids the misleading claim that every non-LLM path is free. Local deterministic tools are zero only for marginal cloud inference; search and reading still have time costs.

Sampling Design

3,000 labels, but not a naive random draw

Stratification makes the pilot useful for sustainability analysis: long outputs, multi-turn retries, non-English chats, and reasoning-model chats are exactly where over-compute may hide.

WildChat pilot allocation

ShareChat validation panel

Sustainability Model

We estimate routing cost, not a single magic Wh number

Search baseline ~0.3 Wh Historical reported Google search figure cited in de Vries 2023
Standard LLM prompt ~0.24-0.31 Wh Recent production-scale estimates and disclosures for ordinary text prompts
Long reasoning query ~3.91 Wh Illustrative test-time-scaling estimate for long reasoning-style queries

Actual route

Input tokens + output tokens + reasoning tokens + retries + search/tool calls

Alternative route

Local tools + search queries + pages read + small-model tokens + verification time

Potential reduction

Actual route energy minus lowest sufficient route energy, reported as sensitivity bands

Decision Matrix

Examples of what each route means

User task Likely lower path When LLM is justified
Simple fact lookup Search When explanation, context, or citation synthesis matters
Arithmetic or date conversion Deterministic tool When embedded in a broader reasoning or planning task
Email rewrite or translation Small model When style, context, and iteration carry high value
Multi-source comparison Search plus reading When synthesis and tradeoff reasoning are the main work
Code debugging Docs, search, execution When repo context or test execution is needed
Medical, legal, financial advice Human expert LLM can support preparation, not replace judgment

Research Pipeline

From raw chats to publishable evidence

01

Stratify

Construct the 3,000-episode WildChat pilot and 800-episode ShareChat validation panel.

02

Seed Labels

Human-label 300-500 episodes and measure agreement on intent, route, and offload class.

03

LLM Classifier

Run strict JSON classification with short evidence fields and confidence scores.

04

Audit

Review low-confidence, high-stakes, reasoning-justified, disagreement, and random high-confidence cases.

05

Calibrate

Compare aggregate patterns against OpenAI/NBER, Microsoft, Anthropic, BIDD, and search baselines.

Evidence Base

Primary sources used for this design