Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

· 2025 · cs.LG · arXiv 2509.20241

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

As AI inference scales to billions of queries and emerging reasoning and agentic workflows increase token demand, reliable estimates of per-query energy use are increasingly important for capacity planning, emissions accounting, and efficiency prioritization. Many public estimates are inconsistent and overstate energy use, because they extrapolate from limited benchmarks and fail to reflect efficiency gains achievable at scale. In this perspective, we introduce a bottom-up methodology to estimate the per-query energy of large-scale LLM systems based on token throughput. For models running on an H100 node under realistic workloads, GPU utilization and PUE constraints, we estimate a median energy per query of 0.34 Wh (IQR: 0.18-0.67) for frontier-scale models (>200 billion parameters). These results are consistent with measurements using production-scale configurations and show that non-production estimates and assumptions can overstate energy use by 4-20x. Extending to test-time scaling scenarios with 15x more tokens per typical query, the median energy rises 13x to 4.32 Wh, indicating that targeting efficiency in this regime will deliver the largest fleet-wide savings. We quantify achievable efficiency gains at the model, serving platform, and hardware levels, finding individual median reductions of 1.5-3.5x in energy per query, while combined advances can plausibly deliver 8-20x reductions. To illustrate the system-level impact, we estimate the baseline daily energy use of a deployment serving 1 billion queries to be 0.8 GWh/day. If 10% are long queries, demand could grow to 1.8 GWh/day. With targeted efficiency interventions, it falls to 0.9 GWh/day, similar to the energy footprint of web search at that scale. This echoes how data centers historically tempered energy growth through efficiency gains during the internet and cloud build-up.

representative citing papers

Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems

cs.AI · 2026-05-20 · unverdicted · novelty 7.0

Proposes EpG and OOI metrics showing agentic workflows use 4.33x more energy per successful goal than linear baselines due to orchestration structure.

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

cs.DC · 2025-11-11 · unverdicted · novelty 6.0

Local LLMs answer 88.7% of 1M real-world queries with IPW improving 5.3x from 2023-2025, indicating local inference can handle most queries efficiently on power-constrained devices.

citing papers explorer

Showing 2 of 2 citing papers.

Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems cs.AI · 2026-05-20 · unverdicted · none · ref 29 · internal anchor
Proposes EpG and OOI metrics showing agentic workflows use 4.33x more energy per successful goal than linear baselines due to orchestration structure.
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI cs.DC · 2025-11-11 · unverdicted · none · ref 3 · internal anchor
Local LLMs answer 88.7% of 1M real-world queries with IPW improving 5.3x from 2023-2025, indicating local inference can handle most queries efficiently on power-constrained devices.

Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

fields

years

verdicts

representative citing papers

citing papers explorer