Research design, not final estimates
How much everyday cognitive work is routed to high-compute LLMs when lower-compute paths would be enough?
We classify real LLM conversations by the cheapest sufficient path: search, deterministic tools, small models, standard LLMs, tool agents, reasoning models, or human experts.
Publication Claim
Not “AI is bad.” The question is whether the task deserved that much machine thinking.
The public message is routing discipline: use search when search is enough, tools when tools are enough, small models when small models are enough, and frontier reasoning models only when the task requires them.
Classification Core
Every chat becomes a route decision
Four labels per episode
- IntentWhat the user is trying to do
- Cheapest sufficient pathThe lowest-compute path likely good enough
- Offload classPublic-facing grouping for the headline metric
- Risk featuresCurrent info, citations, long output, retry, exact math, high stakes
Main metric
Avoidable means the task appears solvable by normal search or a deterministic tool without losing the user's likely goal. Small-model-sufficient is reported separately.
More Granular Routing
Separate capability from where the cost happens
A local calculator and a hosted reasoning model may both answer a question, but they do not have the same sustainability profile. The pilot now records cloud inference, direct API cost, and human time as separate quantities.
Local deterministic
Calculator, spreadsheet, calendar, local script, dictionary.
Cloud cost: 0 marginalLookup
Reference page, documentation, FAQ, StackOverflow-style lookup.
Cost: pages + timeWeb search
Simple query or multi-source reading and synthesis.
Cost: query + readingSmall model
Short rewrite, translation, tone change, simple generation.
Cost: small tokensStandard LLM
Tutoring, synthesis, open writing, contextual explanation.
Cost: input + outputReasoning / agent
Long reasoning, browsing, code execution, files, APIs.
Cost: reasoning + toolsTwo labels matter most
- Actual routeWhat the user used: standard LLM, reasoning LLM, or LLM with tools
- Lowest sufficient routeWhat likely would have been enough: local tool, lookup, search, small model, LLM, or expert
We calculate three costs
This avoids the misleading claim that every non-LLM path is free. Local deterministic tools are zero only for marginal cloud inference; search and reading still have time costs.
Sampling Design
3,000 labels, but not a naive random draw
Stratification makes the pilot useful for sustainability analysis: long outputs, multi-turn retries, non-English chats, and reasoning-model chats are exactly where over-compute may hide.
WildChat pilot allocation
ShareChat validation panel
Sustainability Model
We estimate routing cost, not a single magic Wh number
Actual route
Input tokens + output tokens + reasoning tokens + retries + search/tool calls
Alternative route
Local tools + search queries + pages read + small-model tokens + verification time
Potential reduction
Actual route energy minus lowest sufficient route energy, reported as sensitivity bands
Decision Matrix
Examples of what each route means
Research Pipeline
From raw chats to publishable evidence
Stratify
Construct the 3,000-episode WildChat pilot and 800-episode ShareChat validation panel.
Seed Labels
Human-label 300-500 episodes and measure agreement on intent, route, and offload class.
LLM Classifier
Run strict JSON classification with short evidence fields and confidence scores.
Audit
Review low-confidence, high-stakes, reasoning-justified, disagreement, and random high-confidence cases.
Calibrate
Compare aggregate patterns against OpenAI/NBER, Microsoft, Anthropic, BIDD, and search baselines.
Evidence Base