pith. machine review for the scientific record. sign in

arxiv: 2604.04745 · v1 · submitted 2026-04-06 · 💻 cs.DC · cs.PF

Recognition: no theorem link

The Energy Cost of Execution-Idle in GPU Clusters

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:59 UTC · model grok-4.3

classification 💻 cs.DC cs.PF
keywords GPU energy efficiencyexecution-idledata center powerAI cluster telemetrypower managementGPU idle statesenergy waste
0
0 comments X

The pith

GPUs spend nearly 20% of execution time in a high-power low-activity state that wastes 10.7% of cluster energy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies execution-idle as a distinct operating mode in which GPUs draw high power while showing near-zero visible activity. Analysis of per-second telemetry from a large academic AI cluster shows this state occurs consistently across workloads and GPU generations, consuming 19.7% of in-execution time and 10.7% of energy. Two mitigation prototypes demonstrate that downscaling power during these periods or reducing exposure through load balancing can cut energy use, albeit with performance costs. The central argument is that energy-efficient GPU systems must treat execution-idle as an explicit first-class state rather than assuming low activity implies low power.

Core claim

Execution-idle is a recurring low-activity yet high-power state in real GPU deployments that accounts for 19.7% of in-execution time and 10.7% of energy across diverse workloads and multiple GPU generations, indicating that future systems should reduce both its cost and the time spent in it.

What carries the argument

Execution-idle: the operating state in which GPUs remain at high power levels even when visible activity is near zero, measured via per-second telemetry.

If this is right

  • Automatic downscaling of GPUs during detected execution-idle periods can reduce energy consumption at the cost of some performance overhead.
  • Scheduling policies that deliberately introduce or exploit load imbalance can shorten the total time spent in execution-idle.
  • Energy accounting models for GPU clusters must separately track execution-idle rather than assuming power scales directly with visible utilization.
  • Hardware and software co-design should prioritize faster or lower-cost transitions out of execution-idle to improve overall efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Workload schedulers in production clusters could incorporate execution-idle detection to dynamically rebalance tasks and shrink exposure.
  • Similar high-power low-activity states may appear in other accelerators, suggesting the need for cross-device power studies.
  • Cloud billing could be adjusted to reflect actual energy drawn during execution-idle rather than assuming idle periods are cheap.
  • Longer-term, GPU firmware might expose finer-grained power states that allow quicker exits from execution-idle without full downscaling.

Load-bearing premise

Per-second telemetry from the academic cluster accurately captures execution state boundaries and power draw without significant measurement error or workload-specific bias.

What would settle it

Controlled experiments that directly measure instantaneous power and activity counters on the same GPUs during periods labeled execution-idle versus true idle or full compute, checking whether power truly stays high while activity stays low.

Figures

Figures reproduced from arXiv: 2604.04745 by Daniel Vosler, Dimitrios Skarlatos, Emma Strubell, Jared Fernandez, Justine Sherry, Vasilis Kypriotis, Yiran Lei.

Figure 1
Figure 1. Figure 1: CPU power falls with idle time, but GPU power remains elevated even when a loaded program is fully idle. tracks inactivity more closely, whereas GPU power can re￾main elevated even when a loaded program is fully idle. Sec￾ond, with few exceptions [39, 48], prior work analyzes GPU energy at a coarse granularity, for example through end-to￾end summaries or comparisons across different load levels [2– 5, 9, 1… view at source ↗
Figure 2
Figure 2. Figure 2: Time-aligned power, SM and DRAM utilization, and normalized frequency for a job on an L40S GPU, illus￾trating the execution-idle state. a smaller but still fixed pool while preserving burstiness, and replay the resulting per-GPU streams for 30 minutes. This setup preserves the fixed-provisioning assumption of the original traced deployments (e.g., 32 GPUs in Qwen [53] and 96 GPUs in Azure [49]); autoscalin… view at source ↗
Figure 4
Figure 4. Figure 4: Power in the execution-idle state remains substan￾tially above deep idle across all GPU models in our study. execution-idle intervals, which last about 10 s, the job re￾mains resident while visible compute, memory, and PCIe activity all drop to near zero, yet power stays around 110 W. Only after the program terminates does the GPU enter deep idle, where power drops to the baseline level of roughly 35 W. Th… view at source ↗
Figure 5
Figure 5. Figure 5: Execution-idle time and energy fractions across academic workload categories and replayed industry serving traces. Execution-idle across GPU generations [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: CDF of per-GPU inter-request intervals for re￾played industry serving traces. The denominator therefore consists only of execution-idle and active execution. This lets us ask: once a program is on the GPU, what fraction of execution time and energy is spent idle but still drawing elevated power? We call these in-execution fractions. 4.1 Variability of Execution-Idle Across Workloads Execution-idle states a… view at source ↗
Figure 7
Figure 7. Figure 7: CDF of per-job execution-idle time and energy fractions. Serving is especially exposed because GPUs remain resident under bursty demand. Serving keeps model state loaded on the GPU to preserve responsiveness, but request arrivals are uneven over time. As a result, GPUs often remain allocated and ready while little or no request work is actively executing, creating long loaded-but-low-activity intervals tha… view at source ↗
Figure 8
Figure 8. Figure 8: CDF of execution-idle interval durations. state takes some time, and hence a request arriving while the GPU is in a low energy state will suffer increased service time. Studies report that GPUs take roughly 1–500 ms to ad￾just frequency [52]. Hence, engineers wish to avoid dropping the GPU to a low-frequency state only to immediately return to a high-frequency state: the energy savings do not justify the p… view at source ↗
Figure 10
Figure 10. Figure 10: Energy, p95 latency, average GPU utilization at different load imbalance level, normalized to 8-active-GPU baseline. • Overall cluster utilization should not be used as a proxy metric for power draw.(§5.1) • Ongoing research to improve utilization within a GPU are likely to improve energy efficiency.(§5.2) • Simple, manual overrides to frequency scaling can, predictably, save energy at some latency cost. … view at source ↗
Figure 11
Figure 11. Figure 11: Power over time under SM-only and SM+memory execution-idle-aware frequency control. These systems improve utilization by packing complemen￾tary workloads onto fewer devices, for example by co-serving online and offline jobs [42], serving multiple LLMs concur￾rently [60], or co-serving fine-tuning and inference [38]. On the other hand, autoscaling systems such as BlitzS￾cale [61], ServerlessLLM [10], and I… view at source ↗
Figure 12
Figure 12. Figure 12: Power–latency trade-off of execution-idle-aware frequency downscaling. clocks; finer-grained component-specific controls are not exposed. Because replay duration is fixed, we use average GPU power as a proxy for total energy. Downscaling more aggressively reduces the cost of deep idle states [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
read the original abstract

GPUs are becoming a major contributor to data center power, yet unlike CPUs, they can remain at high power even when visible activity is near zero. We call this state execution-idle. Using per-second telemetry from a large academic AI cluster, we characterize execution-idle as a recurring low-activity yet high-power state in real deployments. Across diverse workloads and multiple GPU generations, it accounts for 19.7% of in-execution time and 10.7% of energy. This suggests a need to both reduce the cost of execution-idle and reduce exposure to it. We therefore build two prototypes: one uses automatic downscaling during execution-idle, and the other uses load imbalance to reduce exposure, both with performance trade-offs. These findings suggest that future energy-efficient GPU systems should treat execution-idle as a first-class operating state.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the 'execution-idle' state in GPUs, where devices exhibit near-zero visible activity yet sustain high power draw. Using per-second telemetry from a large academic AI cluster, it reports that this state accounts for 19.7% of in-execution time and 10.7% of energy across diverse workloads and multiple GPU generations. The authors present two prototypes—one for automatic downscaling during execution-idle and one leveraging load imbalance to reduce exposure—both with performance trade-offs, and argue that future GPU systems should treat execution-idle as a first-class operating state.

Significance. If the measurements prove robust, the work identifies a substantial, recurring energy overhead in real GPU deployments that has received limited prior attention, with direct implications for data-center power optimization. The empirical scope across workloads and GPU generations, combined with concrete prototype implementations, strengthens the practical relevance. The study benefits from access to production cluster telemetry, enabling claims grounded in operational data rather than synthetic benchmarks.

major comments (3)
  1. [Methods] Methods section: The classification of execution-idle periods and attribution of power draw depends on per-second telemetry thresholds for utilization, kernel activity, and power without reported validation against higher-resolution tools (e.g., Nsight) or direct wattmeter traces. This measurement pipeline is load-bearing for the central 19.7% time and 10.7% energy claims; sampling granularity, lag, or workload-specific bias could systematically alter the reported fractions.
  2. [Results] Results section: The headline percentages are given as point estimates without error bars, confidence intervals, sensitivity analysis on classification thresholds, or statistical tests for consistency across workloads and GPU generations. This omission undermines evaluation of whether the figures are robust or sensitive to the exact definition of execution periods.
  3. [Prototype Evaluation] Prototype evaluation: The energy savings and performance trade-offs of the two prototypes are described qualitatively but lack quantitative metrics (e.g., measured energy reduction percentages, latency overheads, or comparison to baselines) that would allow assessment of their effectiveness relative to the identified overhead.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly state the exact telemetry fields and thresholds used to delimit execution-idle states for improved reproducibility.
  2. [Figures/Tables] Figure captions and table legends should include the precise definitions of 'in-execution time' and how idle periods were aggregated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to incorporate additional analyses and clarifications.

read point-by-point responses
  1. Referee: [Methods] Methods section: The classification of execution-idle periods and attribution of power draw depends on per-second telemetry thresholds for utilization, kernel activity, and power without reported validation against higher-resolution tools (e.g., Nsight) or direct wattmeter traces. This measurement pipeline is load-bearing for the central 19.7% time and 10.7% energy claims; sampling granularity, lag, or workload-specific bias could systematically alter the reported fractions.

    Authors: We acknowledge that validation against higher-resolution tools like Nsight or direct wattmeter measurements would provide additional confidence in the classification. However, our study relies on production cluster telemetry at per-second granularity, which is the standard monitoring available across the entire cluster. Instrumenting individual workloads with Nsight is not feasible at this scale. To strengthen the analysis, we have conducted a sensitivity study varying the utilization and power thresholds and will include the results in the revised manuscript, demonstrating the stability of the reported fractions. We also add a discussion of potential limitations due to sampling granularity. revision: partial

  2. Referee: [Results] Results section: The headline percentages are given as point estimates without error bars, confidence intervals, sensitivity analysis on classification thresholds, or statistical tests for consistency across workloads and GPU generations. This omission undermines evaluation of whether the figures are robust or sensitive to the exact definition of execution periods.

    Authors: We agree that including measures of uncertainty and robustness would improve the presentation of the results. In the revised manuscript, we will report bootstrapped 95% confidence intervals for the 19.7% and 10.7% figures, computed across the diverse workloads and GPU generations. Additionally, we include sensitivity analysis on the classification thresholds and statistical tests to confirm consistency across subsets of the data. revision: yes

  3. Referee: [Prototype Evaluation] Prototype evaluation: The energy savings and performance trade-offs of the two prototypes are described qualitatively but lack quantitative metrics (e.g., measured energy reduction percentages, latency overheads, or comparison to baselines) that would allow assessment of their effectiveness relative to the identified overhead.

    Authors: The prototype evaluations were performed in a testbed environment. We recognize the value of quantitative metrics for assessing effectiveness. We will expand the evaluation section to include quantitative metrics from our experiments, such as measured energy reduction percentages, latency overheads, and comparisons to baseline configurations without the interventions. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with no circular derivation chain

full rationale

The paper reports observed fractions of time and energy spent in an execution-idle state directly from per-second cluster telemetry across workloads and GPU generations. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs; the two prototypes are built as engineering responses to the measurements rather than as derivations. Self-citations, if present, are not load-bearing for the central percentages or claims. The work is self-contained as an observational study against external benchmarks (real cluster logs), satisfying the criteria for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that the collected telemetry faithfully reflects GPU activity and power states; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Per-second telemetry accurately distinguishes execution periods from true idle and correctly measures instantaneous power.
    Invoked when the authors compute the 19.7% and 10.7% figures from cluster logs.
invented entities (1)
  • execution-idle state no independent evidence
    purpose: Label for the observed high-power low-activity GPU regime
    Newly named category derived from telemetry observations; no independent falsifiable prediction supplied.

pith-pipeline@v0.9.0 · 5461 in / 1298 out tokens · 45631 ms · 2026-05-10T18:59:50.476624+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 2 internal anchors

  1. [1]

    Luiz André Barroso and Urs Hölzle. 2007. The Case for Energy- Proportional Computing.Computer40, 12 (2007), 33–37. doi:10.1109/ MC.2007.443

  2. [2]

    Sangjin Choi, Inhoe Koo, Jeongseob Ahn, Myeongjae Jeon, and Youngjin Kwon. 2023. EnvPipe: Performance-preserving DNN Train- ing Framework for Saving Energy. In2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 851– 864.https://www.usenix.org/conference/atc23/presentation/choi

  3. [3]

    The ml. energy benchmark: Toward automated inference energy measurement and optimization,

    Jae-Won Chung, Jeff J. Ma, Ruofan Wu, Jiachen Liu, Oh Jun Kweon, Yux- uan Xia, Zhiyu Wu, and Mosharaf Chowdhury. 2025. The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization. arXiv:2505.06371 [cs.LG]https://arxiv.org/abs/2505. 06371

  4. [4]

    Where do the joules go? diagnosing inference energy consumption,

    Jae-Won Chung, Ruofan Wu, Jeff J. Ma, and Mosharaf Chowdhury. 2026. Where Do the Joules Go? Diagnosing Inference Energy Consumption. arXiv:2601.22076 [cs.LG]https://arxiv.org/abs/2601.22076

  5. [5]

    Mariana Toledo Costa, Antigoni Georgiadou, III White, James B., Bruno Villasenor Alvarez, Jordà Polo, Woong Shin, Philippe Olivier Alexandre Navaux, Bronson Messer, and Arthur Francisco Lorenzon. 2025. Characterizing the Impact of GPU Power Man- agement on an Exascale System. InProceedings of the SC ’25 Work- shops of the International Conference for High...

  6. [6]

    Advanced power flow controllers (apfc),

    Electric Power Research Institute. 2026.Powering Intelligence: An- alyzing Artificial Intelligence and Data Center Energy Consumption. Technical Report 3002034696. Electric Power Research Institute (EPRI). https://www.epri.com/research/products/000000003002034696

  7. [7]

    Luke Emberson and Ben Cottier. 2025. GPUs Account for About 40% of Power Usage in AI Data Centers.https://epoch.ai/data-insights/gpus- power-usage-in-ai-data-centersEpoch AI analysis. 12

  8. [8]

    Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. 2007. Power provisioning for a warehouse-sized computer.SIGARCH Comput. Archit. News35, 2 (June 2007), 13–23. doi:10.1145/1273440.1250665

  9. [9]

    Jared Fernandez, Clara Na, Vashisth Tiwari, Yonatan Bisk, Sasha Luc- cioni, and Emma Strubell. 2025. Energy Considerations of Large Language Model Inference and Efficiency Optimizations. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and M...

  10. [10]

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 24). USENIX Association, Santa Clara, CA, 135–153.https: //www.usenix.org/conference/osdi24/pr...

  11. [11]

    Sébastien Godard. 2025. sysstat: Performance Monitoring Tools for Linux.https://github.com/sysstat/sysstatIncludes the pidstat utility; accessed 2026-03-30

  12. [12]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  13. [13]

    Alan Gray. 2024. Maximizing Energy and Power Efficiency in Applications with NVIDIA GPUs.https://developer.nvidia.com/blog/ maximizing-energy-and-power-efficiency-in-applications-with- nvidia-gpus/NVIDIA Technical Blog

  14. [14]

    Alastair Green, Humayun Tai, Jesse Noffsinger, Pankaj Sachdeva, Arjita Bhan, and Raman Sharma. 2024. How Data Centers and the Energy Sector Can Sate AI’s Hunger for Power. https://www.mckinsey.com/industries/private-capital/our- insights/how-data-centers-and-the-energy-sector-can-sate- ais-hunger-for-powerMcKinsey & Company article

  15. [15]

    Hewlett Packard Enterprise. 2026. Workload profiles | HPE iLO 5 User Guide. HPC profile disables power management to optimize sustained bandwidth and compute capacity

  16. [16]

    Ali Jahanshahi, Hadi Zamani Sabzi, Chester Lau, and Daniel Wong

  17. [17]

    doi:10.1109/LCA.2020.3023723

    GPU-NEST: Characterizing Energy Efficiency of Multi-GPU Inference Servers.IEEE Computer Architecture Letters19, 2 (2020), 139–142. doi:10.1109/LCA.2020.3023723

  18. [18]

    M Jette, C Dunlap, J Garlick, and M Grondona. 2002. SLURM: Simple Linux Utility for Resource Management. (07 2002).https://www.osti. gov/biblio/15002962

  19. [19]

    Gonzalez, Hao Zhang, and Ion Sto- ica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  20. [20]

    Newkirk, Matthew R

    Imran Latif, Alex C. Newkirk, Matthew R. Carbone, Arslan Munir, Yuewei Lin, Jonathan Koomey, Xi Yu, and Zhihua Dong. 2025. Single- Node Power Demand During AI Training: Measurements on an 8-GPU NVIDIA H100 System.IEEE Access13 (2025), 61740–61747. doi:10. 1109/ACCESS.2025.3554728

  21. [21]

    Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat

  22. [22]

    Estimating the carbon footprint of BLOOM, a 176B parameter language model.J. Mach. Learn. Res.24, 1, Article 253 (Jan. 2023), 15 pages

  23. [23]

    Sasha Luccioni, Yacine Jernite, and Emma Strubell. 2024. Power Hun- gry Processing: Watts Driving the Cost of AI Deployment?. InProceed- ings of the 2024 ACM Conference on Fairness, Accountability, and Trans- parency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computing Machinery, New York, NY, USA, 85–99. doi:10.1145/3630106.3658542

  24. [24]

    Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierar- chical density based clustering.Journal of Open Source Software2, 11 (2017), 205. doi:10.21105/joss.00205

  25. [25]

    Xinxin Mei, Qiang Wang, and Xiaowen Chu. 2016. A Survey and Measurement Study of GPU DVFS on Energy Conservation. arXiv:1610.01784 [cs.DC]https://arxiv.org/abs/1610.01784

  26. [26]

    Sreedhar Narayanaswamy, Pratikkumar Dilipkumar Patel, Ian Karlin, Apoorv Gupta, Sudhir Saripalli, and Janey Guo. 2025. Datacenter Energy Optimized Power Profiles. arXiv:2510.03872 [cs.DC]https: //arxiv.org/abs/2510.03872

  27. [27]

    Chenxu Niu, Wei Zhang, Jie Li, Yongjian Zhao, Tongyang Wang, Xi Wang, and Yong Chen. 2025. TokenPowerBench: Benchmarking the Power Consumption of LLM Inference. arXiv:2512.03024 [cs.LG]https: //arxiv.org/abs/2512.03024

  28. [28]

    Chenxu Niu, Wei Zhang, Yongjian Zhao, and Yong Chen. 2025. Energy Efficient or Exhaustive? Benchmarking Power Consumption of LLM Inference Engines.SIGENERGY Energy Inform. Rev.5, 2 (Aug. 2025), 56–62. doi:10.1145/3757892.3757900

  29. [29]

    NVIDIA. 2025. Driver Persistence.https://docs.nvidia.com/deploy/ driver-persistence/index.html

  30. [30]

    NVIDIA. 2025. NVIDIA Blackwell B200 GPU.https://images.nvidia. com/aem-dam/Solutions/documents/HGX-B200-PCF-Summary.pdf

  31. [31]

    NVIDIA Corporation. 2025. NVIDIA Multi-Instance GPU User Guide. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

  32. [32]

    NVIDIA Corporation. 2026. NVIDIA A100 Tensor Core GPU.https: //www.nvidia.com/en-us/data-center/a100/

  33. [33]

    2026.NVIDIA Data Center GPU Manager (DCGM) Documentation.https://docs.nvidia.com/datacenter/dcgm/ latest/index.html

    NVIDIA Corporation. 2026.NVIDIA Data Center GPU Manager (DCGM) Documentation.https://docs.nvidia.com/datacenter/dcgm/ latest/index.html

  34. [34]

    NVIDIA Corporation. 2026. NVIDIA H100 GPU.https://www.nvidia. com/en-us/data-center/h100/

  35. [35]

    NVIDIA Corporation. 2026. NVIDIA L40S GPU for AI and Graphics Performance.https://www.nvidia.com/en-us/data-center/l40s/

  36. [36]

    2026.NVIDIA Management Library (NVML) API Reference Guide.https://docs.nvidia.com/deploy/nvml-api/index.html

    NVIDIA Corporation. 2026.NVIDIA Management Library (NVML) API Reference Guide.https://docs.nvidia.com/deploy/nvml-api/index.html

  37. [37]

    NVIDIA Corporation. 2026. NVIDIA RTX 6000 Ada Generation Graph- ics Card.https://www.nvidia.com/en-us/products/workstations/rtx- 6000/

  38. [38]

    NVIDIA Corporation. 2026. NVIDIA RTX A6000.https://www.nvidia. com/en-us/products/workstations/rtx-a6000/

  39. [39]

    NVIDIA Corporation. 2026. NVIDIA System Management Interface (nvidia-smi).https://docs.nvidia.com/deploy/nvidia-smi/

  40. [40]

    Gabriele Oliaro, Xupeng Miao, Xinhao Cheng, Vineeth Kada, Mengdi Wu, Ruohan Gao, Yingyi Huang, Remi Delacourt, April Yang, Yingcheng Wang, Colin Unger, and Zhihao Jia. 2025. FlexLLM: Token- Level Co-Serving of LLM Inference and Finetuning with SLO Guaran- tees. arXiv:2402.18789 [cs.DC]https://arxiv.org/abs/2402.18789

  41. [41]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh War- rier, Nithish Mahalingam, and Ricardo Bianchini. 2024. Characterizing Power Management Opportunities for LLMs in the Cloud. InProceed- ings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3(La Jolla, CA, USA)(ASP...

  42. [42]

    Pratikkumar Patel and Sreedhar Narayanaswamy. 2025. Optimize Data Center Efficiency for AI and HPC Workloads with Power Profiles.https://developer.nvidia.com/blog/optimize-data-center- efficiency-for-ai-and-hpc-workloads-with-power-profiles/

  43. [43]

    Patterson, Joseph Gonzalez, Urs Hölzle, Quoc V

    David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. 2022. The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink.Computer55, 7 (2022), 18–28. doi:10.1109/MC.2022.3148714 13

  44. [44]

    Gon- zalez, Ion Stoica, and Harry Xu

    Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, and Harry Xu. 2025. ConServe: Fine-Grained GPU Harvest- ing for LLM Online and Offline Co-Serving. arXiv:2410.01228 [cs.DC] https://arxiv.org/abs/2410.01228

  45. [45]

    Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer. 2024. Power-aware Deep Learning Model Serving with 𝜇-Serve. In2024 USENIX Annual Technical Conference (USENIX ATC 24). USENIX Association, Santa Clara, CA, 75–93.https: //www.usenix.org/conference/atc24/p...

  46. [46]

    Giampaolo Rodolà. 2026. psutil: Cross-platform lib for process and system monitoring in Python.https://github.com/giampaolo/psutil

  47. [47]

    Yadwadkar, and Christos Kozyrakis

    Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serv- ing. In2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 397–411.https://www.usenix.org/conference/ atc21/presentation/romero

  48. [48]

    Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, De- vesh Tiwari, and Vijay Gadepally. 2023. From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. arXiv:2310.03003 [cs.CL]https://arxiv.org/abs/2310.03003

  49. [49]

    Koomey, Eric R

    Arman Shehabi, Sarah Josephine Smith, Alex Hubbard, Alexander Newkirk, Nuoa Lei, Md AbuBakar Siddik, Billie Holecek, Jonathan G. Koomey, Eric R. Masanet, and Dale A. Sartor. 2024.2024 United States Data Center Energy Usage Report. Technical Report. Lawrence Berkeley National Laboratory. doi:10.71468/P1WC7Q

  50. [50]

    Varsha Singhania, Shaizeen Aga, and Mohamed Assem Ibrahim. 2025. FinGraV: Methodology for Fine-Grain GPU Power Visibility and In- sights. arXiv:2412.12426 [cs.AR]https://arxiv.org/abs/2412.12426

  51. [51]

    Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. In2025 IEEE International Sympo- sium on High Performance Computer Architecture (HPCA). 1348–1362. doi:10.1109/HPCA61900.2025.00102

  52. [52]

    Zhenheng Tang, Yuxin Wang, Qiang Wang, and Xiaowen Chu. 2019. The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study. InProceedings of the Tenth ACM In- ternational Conference on Future Energy Systems(Phoenix, AZ, USA) (e-Energy ’19). Association for Computing Machinery, New York, NY, USA, 315–325. doi:10.1145/3307772.3328315

  53. [53]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]https://arxiv. org/abs/2302.13971

  54. [54]

    Daniel Velicka, Ondrej Vysocky, and Lubomir Riha. 2025. Method- ology for GPU Frequency Switching Latency Measurement. arXiv:2502.20075 [cs.DC]https://arxiv.org/abs/2502.20075

  55. [55]

    Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. 2025. KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider. In2025 USENIX Annual Technical Conference (USENIX ATC 25). USENIX Association.https://www. usenix.org/conference/atc25/presentation/wang-jiahao

  56. [56]

    Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Dis- covery and Data Mining V.2 (KDD ’25). ACM, T...

  57. [57]

    Grant Wilkins, Srinivasan Keshav, and Richard Mortier. 2024. Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems. arXiv:2407.04014 [cs.DC] https://arxiv.org/abs/2407.04014

  58. [58]

    Lee, Bugra Akyildiz, Maximilian Balandat, Joe Spisak, Ravi Jain, Mike Rabbat, and Kim Hazelwood

    Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, New- sha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga Behram, James Huang, Charles Bai, Michael Gschwind, Anurag Gupta, Myle Ott, Anas- tasia Melnikov, Salvatore Candido, David Brooks, Geeta Chauhan, Ben- jamin Lee, Hsien-Hsin S. Lee, Bugra Akyildiz, Maximilian Balandat, Joe Spisak, Ravi Jain, M...

  59. [59]

    xAI. 2026. Colossus: The World’s Largest AI Supercomputer.https: //x.ai/colossus

  60. [60]

    Jie You, Jae-Won Chung, and Mosharaf Chowdhury. 2023. Zeus: Under- standing and Optimizing GPU Energy Consumption of DNN Training. In20th USENIX Symposium on Networked Systems Design and Im- plementation (NSDI 23). USENIX Association, Boston, MA, 119–139. https://www.usenix.org/conference/nsdi23/presentation/you

  61. [61]

    Junyeol Yu, Jongseok Kim, and Euiseong Seo. 2023. Know Your Enemy To Save Cloud Energy: Energy-Performance Characteriza- tion of Machine Learning Serving. In2023 IEEE International Sympo- sium on High-Performance Computer Architecture (HPCA). 842–854. doi:10.1109/HPCA56546.2023.10070943

  62. [62]

    Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yangmin Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, Ion Stoica, Harry Xu, and Ying Sheng. 2025. Prism: Unleashing GPU Sharing for Cost- Efficient Multi-LLM Serving. arXiv:2505.04021 [cs.DC]https://arxiv. org/abs/2505.04021

  63. [63]

    Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, and Haibo Chen. 2025. BLITZSCALE: fast and live large model autoscaling with O(1) host caching. InProceedings of the 19th USENIX Conference on Operating Systems Design and Implementation (Boston, MA, USA)(OSDI ’25). USENIX Association, USA, Article 16, 19 pages

  64. [64]

    Yijia Zhang, Qiang Wang, Zhe Lin, Pengxiang Xu, and Bingqiang Wang. 2024. Improving GPU Energy Efficiency through an Application- transparent Frequency Scaling Policy with Performance Assurance. In Proceedings of the Nineteenth European Conference on Computer Systems (Athens, Greece)(EuroSys ’24). Association for Computing Machinery, New York, NY, USA, 76...