pith. sign in

arxiv: 2509.20241 · v1 · pith:CG27KS5Fnew · submitted 2025-09-24 · 💻 cs.LG · cs.DC

Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

Pith reviewed 2026-05-21 21:13 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords AI inferenceenergy consumptionLLM efficiencytest-time computeGPU utilizationPUEtoken throughput
0
0 comments X

The pith

A bottom-up calculation shows frontier AI models use a median 0.34 Wh per query on production H100 hardware, 4-20 times below many public estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a bottom-up method to estimate energy per query for large language models by starting from token throughput rather than isolated benchmarks. Under realistic GPU utilization, power usage effectiveness, and workload conditions on H100 nodes, it arrives at a median of 0.34 Wh per query for models larger than 200 billion parameters. The authors show these figures align with production measurements while many earlier estimates overstate consumption because they ignore efficiency gains at scale. They further examine test-time scaling that uses 15 times more tokens and find energy rises 13-fold to 4.32 Wh, yet targeted improvements at the model, serving, and hardware layers can still cut energy per query by 8-20 times overall. At fleet scale, serving one billion queries per day would consume roughly 0.8 GWh without intervention and could stay near 0.9 GWh even with some longer reasoning queries once efficiencies are applied.

Core claim

The authors introduce a bottom-up methodology to estimate the per-query energy of large-scale LLM systems based on token throughput. For models running on an H100 node under realistic workloads, GPU utilization and PUE constraints, they estimate a median energy per query of 0.34 Wh (IQR: 0.18-0.67) for frontier-scale models (>200 billion parameters). These results are consistent with measurements using production-scale configurations and show that non-production estimates and assumptions can overstate energy use by 4-20x. Extending to test-time scaling scenarios with 15x more tokens per typical query, the median energy rises 13x to 4.32 Wh.

What carries the argument

Bottom-up energy estimation from token throughput combined with measured GPU utilization rates, power usage effectiveness, and realistic serving throughput on H100 nodes.

Load-bearing premise

The chosen GPU utilization rates, PUE values, and token-throughput numbers must accurately represent actual large-scale production H100 deployments.

What would settle it

Direct power metering of a production H100 cluster serving real user queries at scale would show whether the median energy per query falls inside or outside the reported 0.18-0.67 Wh interquartile range.

read the original abstract

As AI inference scales to billions of queries and emerging reasoning and agentic workflows increase token demand, reliable estimates of per-query energy use are increasingly important for capacity planning, emissions accounting, and efficiency prioritization. Many public estimates are inconsistent and overstate energy use, because they extrapolate from limited benchmarks and fail to reflect efficiency gains achievable at scale. In this perspective, we introduce a bottom-up methodology to estimate the per-query energy of large-scale LLM systems based on token throughput. For models running on an H100 node under realistic workloads, GPU utilization and PUE constraints, we estimate a median energy per query of 0.34 Wh (IQR: 0.18-0.67) for frontier-scale models (>200 billion parameters). These results are consistent with measurements using production-scale configurations and show that non-production estimates and assumptions can overstate energy use by 4-20x. Extending to test-time scaling scenarios with 15x more tokens per typical query, the median energy rises 13x to 4.32 Wh, indicating that targeting efficiency in this regime will deliver the largest fleet-wide savings. We quantify achievable efficiency gains at the model, serving platform, and hardware levels, finding individual median reductions of 1.5-3.5x in energy per query, while combined advances can plausibly deliver 8-20x reductions. To illustrate the system-level impact, we estimate the baseline daily energy use of a deployment serving 1 billion queries to be 0.8 GWh/day. If 10% are long queries, demand could grow to 1.8 GWh/day. With targeted efficiency interventions, it falls to 0.9 GWh/day, similar to the energy footprint of web search at that scale. This echoes how data centers historically tempered energy growth through efficiency gains during the internet and cloud build-up.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a bottom-up methodology for estimating per-query energy consumption in large-scale LLM inference systems, focusing on token throughput for models on H100 nodes. For frontier-scale models (>200B parameters) under realistic GPU utilization and PUE constraints, it reports a median energy per query of 0.34 Wh (IQR: 0.18-0.67 Wh). The authors claim consistency with production measurements and that alternative estimates overstate energy use by 4-20x. They further analyze test-time scaling (15x more tokens leading to 13x energy increase to 4.32 Wh median), quantify efficiency gains (1.5-3.5x per category, up to 8-20x combined), and estimate system-level impacts for deployments serving 1 billion queries daily (baseline 0.8 GWh/day, reducible to 0.9 GWh/day with interventions).

Significance. If the parameter choices prove representative of production fleets, this perspective offers a timely correction to overstated public estimates of AI inference energy use and useful quantification of efficiency pathways at model, serving, and hardware levels. The system-level projections for billion-query deployments and the historical analogy to data-center efficiency gains during cloud growth could inform capacity planning. The significance hinges on whether the bottom-up inputs are grounded in verifiable production data rather than selected ranges.

major comments (3)
  1. [Section 3] Section 3: The bottom-up derivation of the 0.34 Wh median multiplies assumed per-token energy (from H100 TDP, 60-80% utilization, PUE 1.2-1.4, and 100-300 tokens/s throughput) by query length; these specific ranges are presented without citations to production telemetry, fleet-wide averages, or a sensitivity analysis demonstrating robustness of the IQR and 4-20x overstatement multiplier to plausible deviations in utilization or PUE.
  2. [Table 1] Table 1: The tabulated assumptions for utilization, PUE, and throughput lack explicit justification or external data provenance showing they reflect actual large-scale H100 serving conditions; if real fleet averages differ, both the headline energy figure and the consistency claim with production measurements shift proportionally.
  3. [Abstract] Abstract and methods description: The claim that results are 'consistent with measurements using production-scale configurations' and the IQR ranges are reported without raw throughput numbers, derivation equations, exclusion criteria, or the specific external datasets used for validation, preventing assessment of whether the 0.34 Wh value is parameter-free or post-hoc tuned.
minor comments (2)
  1. [Section 3] Add an explicit equation or step-by-step derivation for per-token energy (including units) in Section 3 to improve reproducibility.
  2. Consider including a supplementary sensitivity figure showing how the median and IQR vary across the full plausible range of utilization (e.g., 40-90%) and PUE (1.1-1.6).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important opportunities to improve the transparency of our assumptions and methodology. We address each major comment below and will revise the manuscript accordingly to add citations, provenance details, derivation equations, and sensitivity analysis.

read point-by-point responses
  1. Referee: [Section 3] Section 3: The bottom-up derivation of the 0.34 Wh median multiplies assumed per-token energy (from H100 TDP, 60-80% utilization, PUE 1.2-1.4, and 100-300 tokens/s throughput) by query length; these specific ranges are presented without citations to production telemetry, fleet-wide averages, or a sensitivity analysis demonstrating robustness of the IQR and 4-20x overstatement multiplier to plausible deviations in utilization or PUE.

    Authors: We agree that explicit citations and a sensitivity analysis would strengthen the section. The chosen ranges reflect a synthesis of publicly available hardware specifications (NVIDIA H100 TDP and performance guides), industry reports on typical inference utilization (60-80% is commonly cited for sustained workloads), and data-center PUE values from sources such as the Uptime Institute. In revision we will add these references and include a sensitivity table or figure varying utilization (50-90%), PUE (1.1-1.5), and throughput (80-400 tokens/s) to show that the median energy and the 4-20x overstatement factor remain within the same order of magnitude across plausible deviations. revision: yes

  2. Referee: [Table 1] Table 1: The tabulated assumptions for utilization, PUE, and throughput lack explicit justification or external data provenance showing they reflect actual large-scale H100 serving conditions; if real fleet averages differ, both the headline energy figure and the consistency claim with production measurements shift proportionally.

    Authors: We will expand Table 1 with footnotes and a new methods subsection that cites the provenance for each parameter: NVIDIA documentation for TDP and peak throughput, aggregated public reports on GPU-cluster utilization in production inference, and PUE ranges drawn from hyperscale data-center benchmarks. While we cannot publish proprietary fleet telemetry, the selected midpoints are consistent with the range of values reported in open literature and vendor case studies. We will also qualify the consistency claim to emphasize order-of-magnitude agreement rather than exact numerical equivalence. revision: yes

  3. Referee: [Abstract] Abstract and methods description: The claim that results are 'consistent with measurements using production-scale configurations' and the IQR ranges are reported without raw throughput numbers, derivation equations, exclusion criteria, or the specific external datasets used for validation, preventing assessment of whether the 0.34 Wh value is parameter-free or post-hoc tuned.

    Authors: We will revise the abstract and add a dedicated methods paragraph that states the core equation (energy per query = (TDP × utilization × PUE / throughput) × tokens per query), lists the exact parameter bounds used to generate the median and IQR, and describes how the IQR is obtained by taking the interquartile range across the uniform parameter space rather than from empirical sampling. We will also reference the public sources used for cross-validation. The 0.34 Wh figure is the direct midpoint of the stated ranges and is not the result of post-hoc tuning; this will be made explicit in the revision. revision: yes

Circularity Check

0 steps flagged

Bottom-up energy estimation uses independent parameter assumptions with no reduction to inputs by construction

full rationale

The paper derives per-query energy via a bottom-up calculation from token throughput, stated GPU utilization rates, PUE values, and hardware TDP under realistic workloads. These inputs are presented as representative choices for production H100 deployments rather than fitted to or defined by the resulting 0.34 Wh median; the output is a direct multiplication of per-token energy by query length with no self-referential loop. Claims of consistency with external production-scale measurements are offered as corroboration outside the derivation itself. No self-definitional, fitted-input-as-prediction, or self-citation load-bearing patterns appear in the described chain, making the methodology self-contained against its stated assumptions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central estimates rest on domain assumptions about realistic production utilization and PUE rather than new physical constants or invented entities; no free parameters are explicitly fitted in the abstract, but the methodology implicitly selects representative values for GPU utilization and overhead.

free parameters (2)
  • GPU utilization rate
    Chosen to reflect realistic large-scale serving workloads on H100 nodes; value not numerically stated in abstract but used to derive 0.34 Wh median.
  • PUE factor
    Power usage effectiveness assumed for production data centers; enters the bottom-up energy calculation.
axioms (1)
  • domain assumption Energy consumption scales directly with token throughput under fixed hardware and utilization constraints
    Core premise of the bottom-up methodology stated in the abstract.

pith-pipeline@v0.9.0 · 5903 in / 1467 out tokens · 45529 ms · 2026-05-21T21:13:43.557387+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

    cs.DC 2025-11 unverdicted novelty 6.0

    Local LLMs answer 88.7% of 1M real-world queries with IPW improving 5.3x from 2023-2025, indicating local inference can handle most queries efficiently on power-constrained devices.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper

  1. [1]

    Frontier language models have become much smaller,

    E. Erdil, “Frontier language models have become much smaller,” Epoch AI. Accessed: Sept. 16, 2025. [Online]. Available: https://epoch.ai/gradient-updates/frontier-language-models-have-become-much-smaller [24] U. Hölzle, S. V. President, and Operations, “Powering a Google search,” Official Google Blog. Accessed: Sept. 23, 2025. [Online]. Available: https:/...

  2. [2]

    Net zero needs AI — five actions to realize its promise,

    A. Luers, “Net zero needs AI — five actions to realize its promise,” Nature, vol. 644, no. 8078, pp. 871–873, Aug. 2025, doi: 10.1038/d41586-025-02641-4. SUPPLEMENTAL I. TPS Benchmark Model TP Size Quantization Tokens per Second (TPS) Input Length Output Length Source Llama 3.1 405B 8 FP8 2050.00 500 2000 llamaNemotron Article Llama 3.1 405B 8 FP8 480.00 ...