Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

Allen Kim; Amy Luers; Esha Choukse; Felipe Oviedo; Fiodar Kazhamiaka; Juan M. Lavista Ferres; Melanie Nakagawa; Ricardo Bianchini

arxiv: 2509.20241 · v1 · pith:CG27KS5Fnew · submitted 2025-09-24 · 💻 cs.LG · cs.DC

Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

Felipe Oviedo , Fiodar Kazhamiaka , Esha Choukse , Allen Kim , Amy Luers , Melanie Nakagawa , Ricardo Bianchini , Juan M. Lavista Ferres This is my paper

Pith reviewed 2026-05-21 21:13 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords AI inferenceenergy consumptionLLM efficiencytest-time computeGPU utilizationPUEtoken throughput

0 comments

The pith

A bottom-up calculation shows frontier AI models use a median 0.34 Wh per query on production H100 hardware, 4-20 times below many public estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a bottom-up method to estimate energy per query for large language models by starting from token throughput rather than isolated benchmarks. Under realistic GPU utilization, power usage effectiveness, and workload conditions on H100 nodes, it arrives at a median of 0.34 Wh per query for models larger than 200 billion parameters. The authors show these figures align with production measurements while many earlier estimates overstate consumption because they ignore efficiency gains at scale. They further examine test-time scaling that uses 15 times more tokens and find energy rises 13-fold to 4.32 Wh, yet targeted improvements at the model, serving, and hardware layers can still cut energy per query by 8-20 times overall. At fleet scale, serving one billion queries per day would consume roughly 0.8 GWh without intervention and could stay near 0.9 GWh even with some longer reasoning queries once efficiencies are applied.

Core claim

The authors introduce a bottom-up methodology to estimate the per-query energy of large-scale LLM systems based on token throughput. For models running on an H100 node under realistic workloads, GPU utilization and PUE constraints, they estimate a median energy per query of 0.34 Wh (IQR: 0.18-0.67) for frontier-scale models (>200 billion parameters). These results are consistent with measurements using production-scale configurations and show that non-production estimates and assumptions can overstate energy use by 4-20x. Extending to test-time scaling scenarios with 15x more tokens per typical query, the median energy rises 13x to 4.32 Wh.

What carries the argument

Bottom-up energy estimation from token throughput combined with measured GPU utilization rates, power usage effectiveness, and realistic serving throughput on H100 nodes.

Load-bearing premise

The chosen GPU utilization rates, PUE values, and token-throughput numbers must accurately represent actual large-scale production H100 deployments.

What would settle it

Direct power metering of a production H100 cluster serving real user queries at scale would show whether the median energy per query falls inside or outside the reported 0.18-0.67 Wh interquartile range.

read the original abstract

As AI inference scales to billions of queries and emerging reasoning and agentic workflows increase token demand, reliable estimates of per-query energy use are increasingly important for capacity planning, emissions accounting, and efficiency prioritization. Many public estimates are inconsistent and overstate energy use, because they extrapolate from limited benchmarks and fail to reflect efficiency gains achievable at scale. In this perspective, we introduce a bottom-up methodology to estimate the per-query energy of large-scale LLM systems based on token throughput. For models running on an H100 node under realistic workloads, GPU utilization and PUE constraints, we estimate a median energy per query of 0.34 Wh (IQR: 0.18-0.67) for frontier-scale models (>200 billion parameters). These results are consistent with measurements using production-scale configurations and show that non-production estimates and assumptions can overstate energy use by 4-20x. Extending to test-time scaling scenarios with 15x more tokens per typical query, the median energy rises 13x to 4.32 Wh, indicating that targeting efficiency in this regime will deliver the largest fleet-wide savings. We quantify achievable efficiency gains at the model, serving platform, and hardware levels, finding individual median reductions of 1.5-3.5x in energy per query, while combined advances can plausibly deliver 8-20x reductions. To illustrate the system-level impact, we estimate the baseline daily energy use of a deployment serving 1 billion queries to be 0.8 GWh/day. If 10% are long queries, demand could grow to 1.8 GWh/day. With targeted efficiency interventions, it falls to 0.9 GWh/day, similar to the energy footprint of web search at that scale. This echoes how data centers historically tempered energy growth through efficiency gains during the internet and cloud build-up.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bottom-up energy estimates for AI inference provide useful baselines but hinge on the representativeness of utilization and PUE assumptions for production H100 systems.

read the letter

The one or two things to know: this paper gives a median energy use of 0.34 Wh per query for large models on H100 nodes, with an interquartile range of 0.18-0.67 Wh, and shows that test-time compute scaling can multiply that by 13x. It also maps out efficiency improvements that could reduce energy by up to 20x when combined across levels. What the paper does well is provide a bottom-up calculation based on token throughput instead of loose extrapolations. This leads to specific numbers for frontier models over 200 billion parameters under realistic workloads, GPU utilization, and PUE constraints. The extension to test-time scaling scenarios is particularly relevant now that reasoning models are using more tokens per query. They quantify gains at model, serving platform, and hardware levels separately and together, and illustrate the impact with daily energy use for a billion queries, dropping from 0.8 GWh to 0.9 GWh with interventions. This kind of system-level view helps with capacity planning and emissions accounting. The soft spots are in the assumptions behind the numbers. The estimates use ranges for GPU utilization rates, PUE factors, and token throughput that may or may not reflect actual production H100 fleets. The abstract claims consistency with production-scale measurements, but without detailed sourcing or sensitivity analysis in the visible parts, it's difficult to assess how robust the 4-20x overstatement claim is. If real-world averages differ, the per-query figures and the multipliers would change by the same proportion. The IQR helps show variability, but more on how the parameters were chosen or validated would make the results more convincing. This work is aimed at readers involved in AI infrastructure, data center operations, or sustainability efforts who need practical baselines as query volumes grow to billions. It offers value to those looking for efficiency pathways rather than just alarmist estimates. Given the timeliness and the attempt at reproducible methodology, it deserves a serious referee to check the methods and data sources. I would recommend engaging with it in peer review, focusing on verifying the input parameters against real telemetry.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a bottom-up methodology for estimating per-query energy consumption in large-scale LLM inference systems, focusing on token throughput for models on H100 nodes. For frontier-scale models (>200B parameters) under realistic GPU utilization and PUE constraints, it reports a median energy per query of 0.34 Wh (IQR: 0.18-0.67 Wh). The authors claim consistency with production measurements and that alternative estimates overstate energy use by 4-20x. They further analyze test-time scaling (15x more tokens leading to 13x energy increase to 4.32 Wh median), quantify efficiency gains (1.5-3.5x per category, up to 8-20x combined), and estimate system-level impacts for deployments serving 1 billion queries daily (baseline 0.8 GWh/day, reducible to 0.9 GWh/day with interventions).

Significance. If the parameter choices prove representative of production fleets, this perspective offers a timely correction to overstated public estimates of AI inference energy use and useful quantification of efficiency pathways at model, serving, and hardware levels. The system-level projections for billion-query deployments and the historical analogy to data-center efficiency gains during cloud growth could inform capacity planning. The significance hinges on whether the bottom-up inputs are grounded in verifiable production data rather than selected ranges.

major comments (3)

[Section 3] Section 3: The bottom-up derivation of the 0.34 Wh median multiplies assumed per-token energy (from H100 TDP, 60-80% utilization, PUE 1.2-1.4, and 100-300 tokens/s throughput) by query length; these specific ranges are presented without citations to production telemetry, fleet-wide averages, or a sensitivity analysis demonstrating robustness of the IQR and 4-20x overstatement multiplier to plausible deviations in utilization or PUE.
[Table 1] Table 1: The tabulated assumptions for utilization, PUE, and throughput lack explicit justification or external data provenance showing they reflect actual large-scale H100 serving conditions; if real fleet averages differ, both the headline energy figure and the consistency claim with production measurements shift proportionally.
[Abstract] Abstract and methods description: The claim that results are 'consistent with measurements using production-scale configurations' and the IQR ranges are reported without raw throughput numbers, derivation equations, exclusion criteria, or the specific external datasets used for validation, preventing assessment of whether the 0.34 Wh value is parameter-free or post-hoc tuned.

minor comments (2)

[Section 3] Add an explicit equation or step-by-step derivation for per-token energy (including units) in Section 3 to improve reproducibility.
Consider including a supplementary sensitivity figure showing how the median and IQR vary across the full plausible range of utilization (e.g., 40-90%) and PUE (1.1-1.6).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important opportunities to improve the transparency of our assumptions and methodology. We address each major comment below and will revise the manuscript accordingly to add citations, provenance details, derivation equations, and sensitivity analysis.

read point-by-point responses

Referee: [Section 3] Section 3: The bottom-up derivation of the 0.34 Wh median multiplies assumed per-token energy (from H100 TDP, 60-80% utilization, PUE 1.2-1.4, and 100-300 tokens/s throughput) by query length; these specific ranges are presented without citations to production telemetry, fleet-wide averages, or a sensitivity analysis demonstrating robustness of the IQR and 4-20x overstatement multiplier to plausible deviations in utilization or PUE.

Authors: We agree that explicit citations and a sensitivity analysis would strengthen the section. The chosen ranges reflect a synthesis of publicly available hardware specifications (NVIDIA H100 TDP and performance guides), industry reports on typical inference utilization (60-80% is commonly cited for sustained workloads), and data-center PUE values from sources such as the Uptime Institute. In revision we will add these references and include a sensitivity table or figure varying utilization (50-90%), PUE (1.1-1.5), and throughput (80-400 tokens/s) to show that the median energy and the 4-20x overstatement factor remain within the same order of magnitude across plausible deviations. revision: yes
Referee: [Table 1] Table 1: The tabulated assumptions for utilization, PUE, and throughput lack explicit justification or external data provenance showing they reflect actual large-scale H100 serving conditions; if real fleet averages differ, both the headline energy figure and the consistency claim with production measurements shift proportionally.

Authors: We will expand Table 1 with footnotes and a new methods subsection that cites the provenance for each parameter: NVIDIA documentation for TDP and peak throughput, aggregated public reports on GPU-cluster utilization in production inference, and PUE ranges drawn from hyperscale data-center benchmarks. While we cannot publish proprietary fleet telemetry, the selected midpoints are consistent with the range of values reported in open literature and vendor case studies. We will also qualify the consistency claim to emphasize order-of-magnitude agreement rather than exact numerical equivalence. revision: yes
Referee: [Abstract] Abstract and methods description: The claim that results are 'consistent with measurements using production-scale configurations' and the IQR ranges are reported without raw throughput numbers, derivation equations, exclusion criteria, or the specific external datasets used for validation, preventing assessment of whether the 0.34 Wh value is parameter-free or post-hoc tuned.

Authors: We will revise the abstract and add a dedicated methods paragraph that states the core equation (energy per query = (TDP × utilization × PUE / throughput) × tokens per query), lists the exact parameter bounds used to generate the median and IQR, and describes how the IQR is obtained by taking the interquartile range across the uniform parameter space rather than from empirical sampling. We will also reference the public sources used for cross-validation. The 0.34 Wh figure is the direct midpoint of the stated ranges and is not the result of post-hoc tuning; this will be made explicit in the revision. revision: yes

Circularity Check

0 steps flagged

Bottom-up energy estimation uses independent parameter assumptions with no reduction to inputs by construction

full rationale

The paper derives per-query energy via a bottom-up calculation from token throughput, stated GPU utilization rates, PUE values, and hardware TDP under realistic workloads. These inputs are presented as representative choices for production H100 deployments rather than fitted to or defined by the resulting 0.34 Wh median; the output is a direct multiplication of per-token energy by query length with no self-referential loop. Claims of consistency with external production-scale measurements are offered as corroboration outside the derivation itself. No self-definitional, fitted-input-as-prediction, or self-citation load-bearing patterns appear in the described chain, making the methodology self-contained against its stated assumptions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central estimates rest on domain assumptions about realistic production utilization and PUE rather than new physical constants or invented entities; no free parameters are explicitly fitted in the abstract, but the methodology implicitly selects representative values for GPU utilization and overhead.

free parameters (2)

GPU utilization rate
Chosen to reflect realistic large-scale serving workloads on H100 nodes; value not numerically stated in abstract but used to derive 0.34 Wh median.
PUE factor
Power usage effectiveness assumed for production data centers; enters the bottom-up energy calculation.

axioms (1)

domain assumption Energy consumption scales directly with token throughput under fixed hardware and utilization constraints
Core premise of the bottom-up methodology stated in the abstract.

pith-pipeline@v0.9.0 · 5903 in / 1467 out tokens · 45529 ms · 2026-05-21T21:13:43.557387+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
cs.DC 2025-11 unverdicted novelty 6.0

Local LLMs answer 88.7% of 1M real-world queries with IPW improving 5.3x from 2023-2025, indicating local inference can handle most queries efficiently on power-constrained devices.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

Frontier language models have become much smaller,

E. Erdil, “Frontier language models have become much smaller,” Epoch AI. Accessed: Sept. 16, 2025. [Online]. Available: https://epoch.ai/gradient-updates/frontier-language-models-have-become-much-smaller [24] U. Hölzle, S. V. President, and Operations, “Powering a Google search,” Official Google Blog. Accessed: Sept. 23, 2025. [Online]. Available: https:/...

work page doi:10.48550/arxiv.2504.21233 2025
[2]

Net zero needs AI — five actions to realize its promise,

A. Luers, “Net zero needs AI — five actions to realize its promise,” Nature, vol. 644, no. 8078, pp. 871–873, Aug. 2025, doi: 10.1038/d41586-025-02641-4. SUPPLEMENTAL I. TPS Benchmark Model TP Size Quantization Tokens per Second (TPS) Input Length Output Length Source Llama 3.1 405B 8 FP8 2050.00 500 2000 llamaNemotron Article Llama 3.1 405B 8 FP8 480.00 ...

work page doi:10.1038/d41586-025-02641-4 2025

[1] [1]

Frontier language models have become much smaller,

E. Erdil, “Frontier language models have become much smaller,” Epoch AI. Accessed: Sept. 16, 2025. [Online]. Available: https://epoch.ai/gradient-updates/frontier-language-models-have-become-much-smaller [24] U. Hölzle, S. V. President, and Operations, “Powering a Google search,” Official Google Blog. Accessed: Sept. 23, 2025. [Online]. Available: https:/...

work page doi:10.48550/arxiv.2504.21233 2025

[2] [2]

Net zero needs AI — five actions to realize its promise,

A. Luers, “Net zero needs AI — five actions to realize its promise,” Nature, vol. 644, no. 8078, pp. 871–873, Aug. 2025, doi: 10.1038/d41586-025-02641-4. SUPPLEMENTAL I. TPS Benchmark Model TP Size Quantization Tokens per Second (TPS) Input Length Output Length Source Llama 3.1 405B 8 FP8 2050.00 500 2000 llamaNemotron Article Llama 3.1 405B 8 FP8 480.00 ...

work page doi:10.1038/d41586-025-02641-4 2025