pith. sign in

arxiv: 2607.02391 · v1 · pith:HPXGMTM4new · submitted 2026-07-02 · 💻 cs.DC · cs.LG

WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs

Pith reviewed 2026-07-03 05:40 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords LLM inferencepower predictionlatency predictionGPU energy efficiencymachine learning for systemscross-validationunseen hardwaregeneralization
0
0 comments X

The pith

WattGPU models predict power draw and latency for LLM inference on unseen GPUs using only public metadata and specs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that mean GPU power consumption and inter-token latency during large language model inference can be forecasted accurately without any hardware access or profiling runs. This would matter to operators because matching specific models to efficient graphics cards currently requires exhaustive testing of every pairing, which becomes impractical as workloads grow. The method builds two models that take only publicly released details about the language models and the graphics processing units as input. Rigorous leave-one-GPU-out and leave-one-LLM-out tests on 42 models and 8 cards show the power predictions stay within 3.4 percent median error for offline cases and 13.5 percent for server cases on completely new hardware, while latency stays within 8.5 percent in server mode. These results also preserve useful ranking order among GPUs and cut error rates by two to four times relative to simple physical baselines.

Core claim

WattGPU introduces two predictive models for mean GPU power draw and inter-token latency that rely exclusively on publicly available LLM metadata and GPU specifications. Leave-one-GPU-out and leave-one-LLM-out cross-validation on 42 open-source models (0.1B to 27B parameters) and 8 GPUs demonstrates generalization to unseen NVIDIA server-grade hardware in both offline and server inference scenarios. The power model reaches median absolute percentage errors of at most 3.4 percent offline and 13.5 percent server-side on unseen GPUs; the latency model reaches 8.5 percent in server mode, with Kendall tau correlations of at least 0.76 for GPU rankings. These figures represent error reductions of

What carries the argument

Two machine learning models that map public LLM metadata and GPU specifications directly to mean power draw and inter-token latency.

If this is right

  • Operators gain the ability to evaluate many LLM-GPU pairings for energy efficiency without running profiling workloads on each combination.
  • GPU rankings for server scenarios remain reliable enough for selection decisions even when absolute errors are present.
  • Energy estimates improve by factors of two to four over load-scaled TDP and roofline methods on unseen combinations.
  • The same public-data approach applies across both offline batch and online server inference workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar models could be trained for non-NVIDIA accelerators if equivalent public specification data become available.
  • Embedding these predictors into cluster schedulers could automate energy-aware placement of inference jobs at scale.
  • Testing the models on models larger than 27B parameters would reveal whether prediction accuracy holds as model size increases.
  • The technique opens a path to system-level simulations that combine many such predictions to forecast data-center power demand.

Load-bearing premise

The cross-validation performance observed on the 42 LLMs and 8 GPUs will generalize to arbitrary new GPUs and LLMs in actual server and offline deployments.

What would settle it

Collecting real power and latency measurements on an NVIDIA GPU and LLM pair absent from the original dataset and observing median absolute percentage errors substantially above 13.5 percent for power or 8.5 percent for latency.

Figures

Figures reproduced from arXiv: 2607.02391 by Jonathan F\"urst, Marta Pati\~no-Mart\'inez, Mauricio Fadel Argerich.

Figure 1
Figure 1. Figure 1: GPU selection is a key lever for energy savings in LLM inference. For deploying Llama 3.1 8B with a Service Level Agreement (SLA) of a mean ITL < 25ms, it is possible to reduce power draw by 43% by choosing an A30 instead of an H100. Data from the Watt Counts dataset showing only GPUs where the LLM fits in memory. costs but also mitigates the growing carbon footprint of LLM serving, which currently raises … view at source ↗
Figure 2
Figure 2. Figure 2: Power draw distribution across the 42 LLMs, per GPU and scenario. Each box represents the distribution of the per-LLM mean power draw samples averaged over iterations with the same (GPU, LLM, scenario) configuration. Error bars show ±1 standard deviation across samples. therefore the memory bandwidth and the model’s size in memory determine the GPU utilization which influences its power draw. Moreover, we … view at source ↗
Figure 3
Figure 3. Figure 3: Feature importances of the trained XGBoost regressors for mean power draw and ITL. Both importances are averaged over all folds for LOGO validation. Error bars show ±1 standard deviation across folds. 5.3. ITL Prediction Results We evaluate the predictive model for ITL against a roofline-derived baseline that approximates ITL as the time to stream the LLM’s FP16 weights from GPU memory once per token. This… view at source ↗
read the original abstract

Large Language Model (LLM) inference workloads are a rapidly growing contributor to data center energy consumption. Optimizing these deployments requires matching specific LLMs to the most efficient GPUs, but operators currently lack the tools to do so without exhaustively profiling each combination. While some predictive models exist, they still require profiling data and struggle to generalize to hardware unseen during training. To address this, we introduce \textit{WattGPU}, featuring two predictive models for mean GPU power draw and Inter-Token Latency (ITL). Our approach leverages only publicly available LLM metadata and GPU specifications, eliminating the need for hardware access or profiling while enabling generalization to unseen NVIDIA server-grade GPUs and LLMs. We evaluate our models using rigorous leave-one-GPU-out and leave-one-LLM-out cross-validation on a dataset of 42 open-source LLMs (0.1B--27B parameters) and 8 GPUs under both offline and server scenarios. The mean power draw model achieves a median absolute percentage error of $\leq3.4\%$ for offline and $\leq13.5\%$ for server scenarios on unseen GPUs, while the latency model achieves $\leq8.5\%$ in server mode, both maintaining strong GPU ranking correlations for server scenarios (Kendall $\tau\geq0.76$). Compared to standard physically grounded baselines -- Load-Scaled Thermal Design Power (TDP) for power draw and roofline for latency -- our models reduce median absolute percentage error by approximately 4$\times$ on unseen LLM-GPU combinations for server scenarios or approximately 2$\times$ for completely unseen GPUs. WattGPU's data and code are publicly available at https://github.com/maufadel/wattgpu.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces WattGPU, consisting of two ML models that predict mean GPU power draw and inter-token latency (ITL) for LLM inference workloads. The models use only publicly available LLM metadata and GPU specifications (no profiling required) and are evaluated via leave-one-GPU-out and leave-one-LLM-out cross-validation on 42 LLMs (0.1B–27B parameters) and 8 GPUs under offline and server scenarios. Reported results include median absolute percentage error ≤3.4% (offline power), ≤13.5% (server power), and ≤8.5% (server latency) on held-out items, with Kendall τ ≥0.76 for rankings and 2–4× error reduction versus TDP and roofline baselines.

Significance. If the generalization claims hold, the work offers a practical, profiling-free method for matching LLMs to efficient GPUs, which could meaningfully reduce data-center energy consumption for inference. Public release of data and code at the cited GitHub repository is a clear strength for reproducibility and further testing.

major comments (2)
  1. [Abstract / evaluation] Abstract and evaluation description: leave-one-GPU-out CV is conducted on only 8 GPUs total (training on 7 per fold). This sample size is load-bearing for the central claim of generalization to arbitrary unseen NVIDIA server-grade GPUs, because any shared traits (e.g., similar SM counts, memory hierarchies, or TDP scaling among the chosen cards) could be implicitly captured by the feature-based regressor rather than learned from public specs alone; the reported 2–4× error reductions versus baselines may therefore be specific to this narrow distribution.
  2. [Abstract] Abstract: the claim that the models 'enable generalization to unseen NVIDIA server-grade GPUs and LLMs' rests on the public-spec features being sufficient; however, with N=8 GPUs the paper must demonstrate (via feature ablation or diversity analysis) that the regressor is not simply memorizing the sampled hardware cluster.
minor comments (1)
  1. [Abstract] The abstract states Kendall τ ≥0.76 for server scenarios but does not clarify whether this holds for both power and latency models or only one; a table or explicit breakdown would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on the evaluation setup and generalization claims. We address each major comment below, acknowledging the inherent limitations of our dataset while outlining concrete revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / evaluation] Abstract and evaluation description: leave-one-GPU-out CV is conducted on only 8 GPUs total (training on 7 per fold). This sample size is load-bearing for the central claim of generalization to arbitrary unseen NVIDIA server-grade GPUs, because any shared traits (e.g., similar SM counts, memory hierarchies, or TDP scaling among the chosen cards) could be implicitly captured by the feature-based regressor rather than learned from public specs alone; the reported 2–4× error reductions versus baselines may therefore be specific to this narrow distribution.

    Authors: We agree that N=8 GPUs represents a genuine limitation for claiming generalization to arbitrary unseen NVIDIA server-grade GPUs, as shared architectural traits within this set could influence results. The GPUs span multiple generations (Ampere, Ada, Hopper) with substantial variation in SM counts, memory capacity/bandwidth, and TDP, as listed in the manuscript's GPU table. The feature set is restricted to public specifications precisely to capture these differences. Nevertheless, the small sample size weakens the strength of the central claim, and we will add an explicit discussion of this limitation plus a diversity analysis of the GPU feature space (e.g., pairwise distances and clustering on public specs) in the revised manuscript to show the set is not narrowly clustered. revision: partial

  2. Referee: [Abstract] Abstract: the claim that the models 'enable generalization to unseen NVIDIA server-grade GPUs and LLMs' rests on the public-spec features being sufficient; however, with N=8 GPUs the paper must demonstrate (via feature ablation or diversity analysis) that the regressor is not simply memorizing the sampled hardware cluster.

    Authors: We will add both requested analyses in the revision. First, a feature ablation study will quantify the contribution of GPU-specific public features (e.g., SM count, memory bandwidth, TDP) by retraining with subsets removed and reporting the resulting error increases. Second, we will include a diversity analysis section that examines the distribution and variance of the 8 GPUs across public specifications to demonstrate they do not form a tight cluster. These additions will provide direct evidence that performance derives from the public features rather than memorization of the sampled set. revision: yes

Circularity Check

0 steps flagged

No circularity; standard ML training + CV on public features

full rationale

The paper trains supervised regressors on profiling data collected from 42 LLMs and 8 GPUs, using only public metadata and spec features as inputs, then reports leave-one-GPU-out and leave-one-LLM-out CV errors. This is ordinary empirical modeling with held-out evaluation; no equation reduces to its own fitted parameters by construction, no self-citation chain carries the central claim, and no ansatz or uniqueness result is smuggled in. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the performance of trained ML models; specific free parameters beyond the general training are not detailed in the abstract.

free parameters (1)
  • ML model parameters
    The predictive models are trained, implying parameters fitted to the dataset of LLM and GPU combinations.

pith-pipeline@v0.9.1-grok · 5854 in / 1038 out tokens · 32972 ms · 2026-07-03T05:40:17.487870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    URL: https://www.iea.org/reports/energy-and-ai, license: CC BY 4.0

    International Energy Agency, Energy and AI, Technical Report, International Energy Agency, Paris, 2025. URL: https://www.iea.org/reports/energy-and-ai, license: CC BY 4.0

  2. [2]

    Y. Lei, J. Fernandez, V. Kypriotis, D. Skarlatos, E. Strubell, J. Sherry, D. Vosler, The energy cost of execution-idle in gpu clusters, arXiv preprint arXiv:2604.04745 (2026)

  3. [3]

    Theodorou, S

    G. Theodorou, S. Karagiorgou, C. Kotronis, On energy-aware and verifiable benchmarking of big data processing targeting ai pipelines, in: 2024 IEEE International Conference on Big Data (BigData), IEEE, 2024, pp. 3788–3798

  4. [4]

    Burian, A

    V. Burian, A. Stalla -Bourdillon, The increasing energy demand of artificial intelligence and its impact on commodity prices, Economic Bulletin,FocusBox 2/2025, European Cen- tral Bank, 2025. URL: https://www.ecb.europa.eu/press/economic-bulletin/focus/2025/html/ecb. ebbox202502_03~8eba688e29.en.html, accessed: 2025-07-31

  5. [5]

    Strubell, A

    E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations for modern deep learning research, in: Proceedings of the AAAI conference on artificial intelligence, 09, 2020, pp. 13693– 13696

  6. [6]

    Samsi, D

    S. Samsi, D. Zhao, J. McDonald, B. Li, A. Michaleas, M. Jones, W. Bergeron, J. Kepner, D. Tiwari, V. Gadepally, From words to watts: Benchmarking the energy costs of large language model inference, in: 2023 IEEE High Performance Extreme Computing Conference (HPEC), IEEE, 2023, pp. 1–9

  7. [7]

    M. F. Argerich, M. Patiño-Martínez, Measuring and improving the energy efficiency of large language models inference, IEEE Access (2024)

  8. [8]

    Tschand, A

    A. Tschand, A. T. R. Rajan, S. Idgunji, A. Ghosh, J. Holleman, C. Kiraly, P. Ambalkar, R. Borkar, R. Chukka, T. Cockrell, et al., Mlperf power: Benchmarking the energy efficiency of machine learning systems from 𝜇watts to mwatts for sustainable ai, in: 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), IEEE, 2025, pp. 1201–1216

  9. [9]

    Luccioni, B

    S. Luccioni, B. Gamazaychikov, E. Strubell, S. Hooker, Y. Jernite, M. Mitchell, S. Chamberlin, Ai energy score leaderboard - december 2025, https://huggingface.co/spaces/AIEnergyScore/ Leaderboard, 2025

  10. [10]

    Chung, J

    J.-W. Chung, J. J. Ma, R. Wu, J. Liu, O. J. Kweon, Y. Xia, Z. Wu, M. Chowdhury, The ML.ENERGY benchmark: Toward automated inference energy measurement and optimization, in: NeurIPS Datasets and Benchmarks, 2025

  11. [11]

    C. Niu, W. Zhang, J. Li, Y. Zhao, T. Wang, X. Wang, Y. Chen, Tokenpowerbench: Benchmarking the power consumption of llm inference, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, 2026, pp. 32582–32590

  12. [12]

    Krupp, D

    L. Krupp, D. Geißler, F. M. Calatrava-Nicolas, V. Banwari, P. Lukowicz, J. Karolus, This is taking too long-investigating time as a proxy for energy consumption of llms, arXiv preprint arXiv:2603.15699 (2026)

  13. [13]

    Z. Fu, F. Chen, S. Zhou, H. Li, L. Jiang, Llmco2: Advancing accurate carbon footprint prediction for llm inferences, ACM SIGENERGY Energy Informatics Review 5 (2025) 63–68

  14. [14]

    Wilkins, S

    G. Wilkins, S. Keshav, R. Mortier, Offline energy-optimal llm serving: Workload-based energy models for llm inference on heterogeneous systems, ACM SIGENERGY Energy Informatics Review 4 (2024) 113–119

  15. [15]

    A. K. Kakolyris, D. Masouros, P. Vavaroutsos, S. Xydis, D. Soudris, throttll’em: Predictive gpu throttling for energy efficient llm inference serving, in: 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), IEEE, 2025, pp. 1363–1378

  16. [16]

    Patwari, A

    R. Patwari, A. Sirasao, D. Das, Forecasting llm inference performance via hardware-agnostic analytical modeling, arXiv preprint arXiv:2508.00904 (2025)

  17. [17]

    Agrawal, N

    A. Agrawal, N. Kedia, J. Mohan, A. Panwar, N. Kwatra, B. S. Gulavani, R. Ramjee, A. Tumanov, Vidur: A large-scale simulation framework for llm inference, Proceedings of Machine Learning and Systems 6 (2024) 351–366

  18. [18]

    Özcan, P

    M. Özcan, P. Wiesner, P. Weiß, O. Kao, Quantifying the energy consumption and carbon emissions of llm inference via simulations, arXiv preprint arXiv:2507.11417 (2025)

  19. [19]

    S. Imai, R. Nakazawa, M. Amaral, S. Choochotkaew, T. Chiba, Predicting llm inference latency: A roofline-driven ml method, in: Annual Conference on Neural Information Processing Systems, 2024

  20. [20]

    Rincé, A

    S. Rincé, A. Banse, Ecologits: Evaluating the environmental impacts of generative ai, Journal of Open Source Software 10 (2025) 7471

  21. [21]

    Caravaca, Á

    F. Caravaca, Á. Cuevas, R. Cuevas, From prompts to power: Measuring the energy footprint of llm inference, arXiv preprint arXiv:2511.05597 (2025)

  22. [22]

    M. F. Argerich, J. Fürst, M. Patiño-Martínez, Watt counts: Energy-aware benchmark for sustainable llm inference on heterogeneous gpu architectures, arXiv preprint arXiv:2604.09048 (2026)