AI Over-Compute Research Dashboard

Research design, not final estimates

How much everyday cognitive work is routed to high-compute LLMs when lower-compute paths would be enough?

We classify real LLM conversations by the cheapest sufficient path: search, deterministic tools, small models, standard LLMs, tool agents, reasoning models, or human experts.

Pilot labels 3,000 WildChat episodes

Validation 800 ShareChat episodes

Raw data ready 19.3 GB WildChat public + ShareChat

Headline metric OCR Over-Compute Rate

Publication Claim

Not “AI is bad.” The question is whether the task deserved that much machine thinking.

The public message is routing discipline: use search when search is enough, tools when tools are enough, small models when small models are enough, and frontier reasoning models only when the task requires them.

Classification Core

Every chat becomes a route decision

User episode

Search One or a few queries solve it

Tool Calculator, spreadsheet, dictionary, calendar

Small Model Rewrite, translate, short drafting

Standard LLM Tutoring, writing, synthesis, explanation

Tool Agent Browsing, code, files, APIs

Reasoning Model Hard multi-step tasks

Expert Medical, legal, financial, safety-critical

Four labels per episode

IntentWhat the user is trying to do
Cheapest sufficient pathThe lowest-compute path likely good enough
Offload classPublic-facing grouping for the headline metric
Risk featuresCurrent info, citations, long output, retry, exact math, high stakes

Main metric

AI Over-Compute Rate Avoidable LLM episodes / comparable episodes

Avoidable means the task appears solvable by normal search or a deterministic tool without losing the user's likely goal. Small-model-sufficient is reported separately.

More Granular Routing

Separate capability from where the cost happens

A local calculator and a hosted reasoning model may both answer a question, but they do not have the same sustainability profile. The pilot now records cloud inference, direct API cost, and human time as separate quantities.

Local deterministic

Calculator, spreadsheet, calendar, local script, dictionary.

Cloud cost: 0 marginal

Lookup

Reference page, documentation, FAQ, StackOverflow-style lookup.

Cost: pages + time

Web search

Simple query or multi-source reading and synthesis.

Cost: query + reading

Small model

Short rewrite, translation, tone change, simple generation.

Cost: small tokens

Standard LLM

Tutoring, synthesis, open writing, contextual explanation.

Cost: input + output

Reasoning / agent

Long reasoning, browsing, code execution, files, APIs.

Cost: reasoning + tools

Two labels matter most

Actual routeWhat the user used: standard LLM, reasoning LLM, or LLM with tools
Lowest sufficient routeWhat likely would have been enough: local tool, lookup, search, small model, LLM, or expert

We calculate three costs

Cloud inference energy Direct API money cost Human minutes

This avoids the misleading claim that every non-LLM path is free. Local deterministic tools are zero only for marginal cloud inference; search and reading still have time costs.

Sampling Design

3,000 labels, but not a naive random draw

Stratification makes the pilot useful for sustainability analysis: long outputs, multi-turn retries, non-English chats, and reasoning-model chats are exactly where over-compute may hide.

WildChat pilot allocation

ShareChat validation panel

Sustainability Model

We estimate routing cost, not a single magic Wh number

Search baseline ~0.3 Wh Historical reported Google search figure cited in de Vries 2023

Standard LLM prompt ~0.24-0.31 Wh Recent production-scale estimates and disclosures for ordinary text prompts

Long reasoning query ~3.91 Wh Illustrative test-time-scaling estimate for long reasoning-style queries

Actual route

Input tokens + output tokens + reasoning tokens + retries + search/tool calls

Alternative route

Local tools + search queries + pages read + small-model tokens + verification time

Potential reduction

Actual route energy minus lowest sufficient route energy, reported as sensitivity bands

Decision Matrix

Examples of what each route means

User task Likely lower path When LLM is justified

Simple fact lookup Search When explanation, context, or citation synthesis matters

Arithmetic or date conversion Deterministic tool When embedded in a broader reasoning or planning task

Email rewrite or translation Small model When style, context, and iteration carry high value

Multi-source comparison Search plus reading When synthesis and tradeoff reasoning are the main work

Code debugging Docs, search, execution When repo context or test execution is needed

Medical, legal, financial advice Human expert LLM can support preparation, not replace judgment

Research Pipeline

From raw chats to publishable evidence

Stratify

Construct the 3,000-episode WildChat pilot and 800-episode ShareChat validation panel.

Seed Labels

Human-label 300-500 episodes and measure agreement on intent, route, and offload class.

LLM Classifier

Run strict JSON classification with short evidence fields and confidence scores.

Audit

Review low-confidence, high-stakes, reasoning-justified, disagreement, and random high-confidence cases.

Calibrate

Compare aggregate patterns against OpenAI/NBER, Microsoft, Anthropic, BIDD, and search baselines.

Evidence Base