arxiv: 2604.04745 · v1 · submitted 2026-04-06 · 💻 cs.DC · cs.PF

Recognition: no theorem link

The Energy Cost of Execution-Idle in GPU Clusters

Yiran Lei , Jared Fernandez , Vasilis Kypriotis , Dimitrios Skarlatos , Emma Strubell , Justine Sherry , Daniel Vosler

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:59 UTC · model grok-4.3

classification 💻 cs.DC cs.PF

keywords GPU energy efficiencyexecution-idledata center powerAI cluster telemetrypower managementGPU idle statesenergy waste

0 comments

The pith

GPUs spend nearly 20% of execution time in a high-power low-activity state that wastes 10.7% of cluster energy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies execution-idle as a distinct operating mode in which GPUs draw high power while showing near-zero visible activity. Analysis of per-second telemetry from a large academic AI cluster shows this state occurs consistently across workloads and GPU generations, consuming 19.7% of in-execution time and 10.7% of energy. Two mitigation prototypes demonstrate that downscaling power during these periods or reducing exposure through load balancing can cut energy use, albeit with performance costs. The central argument is that energy-efficient GPU systems must treat execution-idle as an explicit first-class state rather than assuming low activity implies low power.

Core claim

Execution-idle is a recurring low-activity yet high-power state in real GPU deployments that accounts for 19.7% of in-execution time and 10.7% of energy across diverse workloads and multiple GPU generations, indicating that future systems should reduce both its cost and the time spent in it.

What carries the argument

Execution-idle: the operating state in which GPUs remain at high power levels even when visible activity is near zero, measured via per-second telemetry.

If this is right

Automatic downscaling of GPUs during detected execution-idle periods can reduce energy consumption at the cost of some performance overhead.
Scheduling policies that deliberately introduce or exploit load imbalance can shorten the total time spent in execution-idle.
Energy accounting models for GPU clusters must separately track execution-idle rather than assuming power scales directly with visible utilization.
Hardware and software co-design should prioritize faster or lower-cost transitions out of execution-idle to improve overall efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Workload schedulers in production clusters could incorporate execution-idle detection to dynamically rebalance tasks and shrink exposure.
Similar high-power low-activity states may appear in other accelerators, suggesting the need for cross-device power studies.
Cloud billing could be adjusted to reflect actual energy drawn during execution-idle rather than assuming idle periods are cheap.
Longer-term, GPU firmware might expose finer-grained power states that allow quicker exits from execution-idle without full downscaling.

Load-bearing premise

Per-second telemetry from the academic cluster accurately captures execution state boundaries and power draw without significant measurement error or workload-specific bias.

What would settle it

Controlled experiments that directly measure instantaneous power and activity counters on the same GPUs during periods labeled execution-idle versus true idle or full compute, checking whether power truly stays high while activity stays low.

Figures

Figures reproduced from arXiv: 2604.04745 by Daniel Vosler, Dimitrios Skarlatos, Emma Strubell, Jared Fernandez, Justine Sherry, Vasilis Kypriotis, Yiran Lei.

**Figure 1.** Figure 1: CPU power falls with idle time, but GPU power remains elevated even when a loaded program is fully idle. tracks inactivity more closely, whereas GPU power can remain elevated even when a loaded program is fully idle. Second, with few exceptions [39, 48], prior work analyzes GPU energy at a coarse granularity, for example through end-toend summaries or comparisons across different load levels [2– 5, 9, 1… view at source ↗

**Figure 2.** Figure 2: Time-aligned power, SM and DRAM utilization, and normalized frequency for a job on an L40S GPU, illustrating the execution-idle state. a smaller but still fixed pool while preserving burstiness, and replay the resulting per-GPU streams for 30 minutes. This setup preserves the fixed-provisioning assumption of the original traced deployments (e.g., 32 GPUs in Qwen [53] and 96 GPUs in Azure [49]); autoscalin… view at source ↗

**Figure 4.** Figure 4: Power in the execution-idle state remains substantially above deep idle across all GPU models in our study. execution-idle intervals, which last about 10 s, the job remains resident while visible compute, memory, and PCIe activity all drop to near zero, yet power stays around 110 W. Only after the program terminates does the GPU enter deep idle, where power drops to the baseline level of roughly 35 W. Th… view at source ↗

**Figure 5.** Figure 5: Execution-idle time and energy fractions across academic workload categories and replayed industry serving traces. Execution-idle across GPU generations [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: CDF of per-GPU inter-request intervals for replayed industry serving traces. The denominator therefore consists only of execution-idle and active execution. This lets us ask: once a program is on the GPU, what fraction of execution time and energy is spent idle but still drawing elevated power? We call these in-execution fractions. 4.1 Variability of Execution-Idle Across Workloads Execution-idle states a… view at source ↗

**Figure 7.** Figure 7: CDF of per-job execution-idle time and energy fractions. Serving is especially exposed because GPUs remain resident under bursty demand. Serving keeps model state loaded on the GPU to preserve responsiveness, but request arrivals are uneven over time. As a result, GPUs often remain allocated and ready while little or no request work is actively executing, creating long loaded-but-low-activity intervals tha… view at source ↗

**Figure 8.** Figure 8: CDF of execution-idle interval durations. state takes some time, and hence a request arriving while the GPU is in a low energy state will suffer increased service time. Studies report that GPUs take roughly 1–500 ms to adjust frequency [52]. Hence, engineers wish to avoid dropping the GPU to a low-frequency state only to immediately return to a high-frequency state: the energy savings do not justify the p… view at source ↗

**Figure 10.** Figure 10: Energy, p95 latency, average GPU utilization at different load imbalance level, normalized to 8-active-GPU baseline. • Overall cluster utilization should not be used as a proxy metric for power draw.(§5.1) • Ongoing research to improve utilization within a GPU are likely to improve energy efficiency.(§5.2) • Simple, manual overrides to frequency scaling can, predictably, save energy at some latency cost. … view at source ↗

**Figure 11.** Figure 11: Power over time under SM-only and SM+memory execution-idle-aware frequency control. These systems improve utilization by packing complementary workloads onto fewer devices, for example by co-serving online and offline jobs [42], serving multiple LLMs concurrently [60], or co-serving fine-tuning and inference [38]. On the other hand, autoscaling systems such as BlitzScale [61], ServerlessLLM [10], and I… view at source ↗

**Figure 12.** Figure 12: Power–latency trade-off of execution-idle-aware frequency downscaling. clocks; finer-grained component-specific controls are not exposed. Because replay duration is fixed, we use average GPU power as a proxy for total energy. Downscaling more aggressively reduces the cost of deep idle states [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

read the original abstract

GPUs are becoming a major contributor to data center power, yet unlike CPUs, they can remain at high power even when visible activity is near zero. We call this state execution-idle. Using per-second telemetry from a large academic AI cluster, we characterize execution-idle as a recurring low-activity yet high-power state in real deployments. Across diverse workloads and multiple GPU generations, it accounts for 19.7% of in-execution time and 10.7% of energy. This suggests a need to both reduce the cost of execution-idle and reduce exposure to it. We therefore build two prototypes: one uses automatic downscaling during execution-idle, and the other uses load imbalance to reduce exposure, both with performance trade-offs. These findings suggest that future energy-efficient GPU systems should treat execution-idle as a first-class operating state.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures a recurring high-power low-activity state in real GPU clusters that takes up roughly 20% of execution time and 11% of energy, then shows two simple prototypes to cut exposure.

read the letter

The core contribution here is the concrete quantification of execution-idle from per-second telemetry on an academic AI cluster. Across workloads and GPU generations the numbers come out to 19.7% of in-execution time and 10.7% of energy. That framing and those percentages are new relative to earlier CPU idle-state work and generic GPU power models. The authors also ship two working prototypes—one that downscales during detected idle periods and one that exploits load imbalance to reduce exposure—which moves the paper beyond pure measurement into a first cut at mitigation. That combination of cluster-scale data plus implemented fixes is the part that stands out.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the 'execution-idle' state in GPUs, where devices exhibit near-zero visible activity yet sustain high power draw. Using per-second telemetry from a large academic AI cluster, it reports that this state accounts for 19.7% of in-execution time and 10.7% of energy across diverse workloads and multiple GPU generations. The authors present two prototypes—one for automatic downscaling during execution-idle and one leveraging load imbalance to reduce exposure—both with performance trade-offs, and argue that future GPU systems should treat execution-idle as a first-class operating state.

Significance. If the measurements prove robust, the work identifies a substantial, recurring energy overhead in real GPU deployments that has received limited prior attention, with direct implications for data-center power optimization. The empirical scope across workloads and GPU generations, combined with concrete prototype implementations, strengthens the practical relevance. The study benefits from access to production cluster telemetry, enabling claims grounded in operational data rather than synthetic benchmarks.

major comments (3)

[Methods] Methods section: The classification of execution-idle periods and attribution of power draw depends on per-second telemetry thresholds for utilization, kernel activity, and power without reported validation against higher-resolution tools (e.g., Nsight) or direct wattmeter traces. This measurement pipeline is load-bearing for the central 19.7% time and 10.7% energy claims; sampling granularity, lag, or workload-specific bias could systematically alter the reported fractions.
[Results] Results section: The headline percentages are given as point estimates without error bars, confidence intervals, sensitivity analysis on classification thresholds, or statistical tests for consistency across workloads and GPU generations. This omission undermines evaluation of whether the figures are robust or sensitive to the exact definition of execution periods.
[Prototype Evaluation] Prototype evaluation: The energy savings and performance trade-offs of the two prototypes are described qualitatively but lack quantitative metrics (e.g., measured energy reduction percentages, latency overheads, or comparison to baselines) that would allow assessment of their effectiveness relative to the identified overhead.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly state the exact telemetry fields and thresholds used to delimit execution-idle states for improved reproducibility.
[Figures/Tables] Figure captions and table legends should include the precise definitions of 'in-execution time' and how idle periods were aggregated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to incorporate additional analyses and clarifications.

read point-by-point responses

Referee: [Methods] Methods section: The classification of execution-idle periods and attribution of power draw depends on per-second telemetry thresholds for utilization, kernel activity, and power without reported validation against higher-resolution tools (e.g., Nsight) or direct wattmeter traces. This measurement pipeline is load-bearing for the central 19.7% time and 10.7% energy claims; sampling granularity, lag, or workload-specific bias could systematically alter the reported fractions.

Authors: We acknowledge that validation against higher-resolution tools like Nsight or direct wattmeter measurements would provide additional confidence in the classification. However, our study relies on production cluster telemetry at per-second granularity, which is the standard monitoring available across the entire cluster. Instrumenting individual workloads with Nsight is not feasible at this scale. To strengthen the analysis, we have conducted a sensitivity study varying the utilization and power thresholds and will include the results in the revised manuscript, demonstrating the stability of the reported fractions. We also add a discussion of potential limitations due to sampling granularity. revision: partial
Referee: [Results] Results section: The headline percentages are given as point estimates without error bars, confidence intervals, sensitivity analysis on classification thresholds, or statistical tests for consistency across workloads and GPU generations. This omission undermines evaluation of whether the figures are robust or sensitive to the exact definition of execution periods.

Authors: We agree that including measures of uncertainty and robustness would improve the presentation of the results. In the revised manuscript, we will report bootstrapped 95% confidence intervals for the 19.7% and 10.7% figures, computed across the diverse workloads and GPU generations. Additionally, we include sensitivity analysis on the classification thresholds and statistical tests to confirm consistency across subsets of the data. revision: yes
Referee: [Prototype Evaluation] Prototype evaluation: The energy savings and performance trade-offs of the two prototypes are described qualitatively but lack quantitative metrics (e.g., measured energy reduction percentages, latency overheads, or comparison to baselines) that would allow assessment of their effectiveness relative to the identified overhead.

Authors: The prototype evaluations were performed in a testbed environment. We recognize the value of quantitative metrics for assessing effectiveness. We will expand the evaluation section to include quantitative metrics from our experiments, such as measured energy reduction percentages, latency overheads, and comparisons to baseline configurations without the interventions. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with no circular derivation chain

full rationale

The paper reports observed fractions of time and energy spent in an execution-idle state directly from per-second cluster telemetry across workloads and GPU generations. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs; the two prototypes are built as engineering responses to the measurements rather than as derivations. Self-citations, if present, are not load-bearing for the central percentages or claims. The work is self-contained as an observational study against external benchmarks (real cluster logs), satisfying the criteria for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that the collected telemetry faithfully reflects GPU activity and power states; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Per-second telemetry accurately distinguishes execution periods from true idle and correctly measures instantaneous power.
Invoked when the authors compute the 19.7% and 10.7% figures from cluster logs.

invented entities (1)

execution-idle state no independent evidence
purpose: Label for the observed high-power low-activity GPU regime
Newly named category derived from telemetry observations; no independent falsifiable prediction supplied.

pith-pipeline@v0.9.0 · 5461 in / 1298 out tokens · 45631 ms · 2026-05-10T18:59:50.476624+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 2 internal anchors

[1]

Luiz André Barroso and Urs Hölzle. 2007. The Case for Energy- Proportional Computing.Computer40, 12 (2007), 33–37. doi:10.1109/ MC.2007.443

work page 2007
[2]

Sangjin Choi, Inhoe Koo, Jeongseob Ahn, Myeongjae Jeon, and Youngjin Kwon. 2023. EnvPipe: Performance-preserving DNN Train- ing Framework for Saving Energy. In2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 851– 864.https://www.usenix.org/conference/atc23/presentation/choi

work page 2023
[3]

The ml. energy benchmark: Toward automated inference energy measurement and optimization,

Jae-Won Chung, Jeff J. Ma, Ruofan Wu, Jiachen Liu, Oh Jun Kweon, Yux- uan Xia, Zhiyu Wu, and Mosharaf Chowdhury. 2025. The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization. arXiv:2505.06371 [cs.LG]https://arxiv.org/abs/2505. 06371

work page arXiv 2025
[4]

Where do the joules go? diagnosing inference energy consumption,

Jae-Won Chung, Ruofan Wu, Jeff J. Ma, and Mosharaf Chowdhury. 2026. Where Do the Joules Go? Diagnosing Inference Energy Consumption. arXiv:2601.22076 [cs.LG]https://arxiv.org/abs/2601.22076

work page arXiv 2026
[5]

Mariana Toledo Costa, Antigoni Georgiadou, III White, James B., Bruno Villasenor Alvarez, Jordà Polo, Woong Shin, Philippe Olivier Alexandre Navaux, Bronson Messer, and Arthur Francisco Lorenzon. 2025. Characterizing the Impact of GPU Power Man- agement on an Exascale System. InProceedings of the SC ’25 Work- shops of the International Conference for High...

work page doi:10.1145/3731599.3767702 2025
[6]

Advanced power flow controllers (apfc),

Electric Power Research Institute. 2026.Powering Intelligence: An- alyzing Artificial Intelligence and Data Center Energy Consumption. Technical Report 3002034696. Electric Power Research Institute (EPRI). https://www.epri.com/research/products/000000003002034696

work page arXiv 2026
[7]

Luke Emberson and Ben Cottier. 2025. GPUs Account for About 40% of Power Usage in AI Data Centers.https://epoch.ai/data-insights/gpus- power-usage-in-ai-data-centersEpoch AI analysis. 12

work page 2025
[8]

Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. 2007. Power provisioning for a warehouse-sized computer.SIGARCH Comput. Archit. News35, 2 (June 2007), 13–23. doi:10.1145/1273440.1250665

work page doi:10.1145/1273440.1250665 2007
[9]

Jared Fernandez, Clara Na, Vashisth Tiwari, Yonatan Bisk, Sasha Luc- cioni, and Emma Strubell. 2025. Energy Considerations of Large Language Model Inference and Efficiency Optimizations. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and M...

work page doi:10.18653/v1/2025.acl-long.1563 2025
[10]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 24). USENIX Association, Santa Clara, CA, 135–153.https: //www.usenix.org/conference/osdi24/pr...

work page 2024
[11]

Sébastien Godard. 2025. sysstat: Performance Monitoring Tools for Linux.https://github.com/sysstat/sysstatIncludes the pidstat utility; accessed 2026-03-30

work page 2025
[12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Alan Gray. 2024. Maximizing Energy and Power Efficiency in Applications with NVIDIA GPUs.https://developer.nvidia.com/blog/ maximizing-energy-and-power-efficiency-in-applications-with- nvidia-gpus/NVIDIA Technical Blog

work page 2024
[14]

Alastair Green, Humayun Tai, Jesse Noffsinger, Pankaj Sachdeva, Arjita Bhan, and Raman Sharma. 2024. How Data Centers and the Energy Sector Can Sate AI’s Hunger for Power. https://www.mckinsey.com/industries/private-capital/our- insights/how-data-centers-and-the-energy-sector-can-sate- ais-hunger-for-powerMcKinsey & Company article

work page 2024
[15]

Hewlett Packard Enterprise. 2026. Workload profiles | HPE iLO 5 User Guide. HPC profile disables power management to optimize sustained bandwidth and compute capacity

work page 2026
[16]

Ali Jahanshahi, Hadi Zamani Sabzi, Chester Lau, and Daniel Wong

work page
[17]

doi:10.1109/LCA.2020.3023723

GPU-NEST: Characterizing Energy Efficiency of Multi-GPU Inference Servers.IEEE Computer Architecture Letters19, 2 (2020), 139–142. doi:10.1109/LCA.2020.3023723

work page doi:10.1109/lca.2020.3023723 2020
[18]

M Jette, C Dunlap, J Garlick, and M Grondona. 2002. SLURM: Simple Linux Utility for Resource Management. (07 2002).https://www.osti. gov/biblio/15002962

work page arXiv 2002
[19]

Gonzalez, Hao Zhang, and Ion Sto- ica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023
[20]

Newkirk, Matthew R

Imran Latif, Alex C. Newkirk, Matthew R. Carbone, Arslan Munir, Yuewei Lin, Jonathan Koomey, Xi Yu, and Zhihua Dong. 2025. Single- Node Power Demand During AI Training: Measurements on an 8-GPU NVIDIA H100 System.IEEE Access13 (2025), 61740–61747. doi:10. 1109/ACCESS.2025.3554728

work page arXiv 2025
[21]

Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat

work page
[22]

Estimating the carbon footprint of BLOOM, a 176B parameter language model.J. Mach. Learn. Res.24, 1, Article 253 (Jan. 2023), 15 pages

work page 2023
[23]

Sasha Luccioni, Yacine Jernite, and Emma Strubell. 2024. Power Hun- gry Processing: Watts Driving the Cost of AI Deployment?. InProceed- ings of the 2024 ACM Conference on Fairness, Accountability, and Trans- parency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computing Machinery, New York, NY, USA, 85–99. doi:10.1145/3630106.3658542

work page doi:10.1145/3630106.3658542 2024
[24]

Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierar- chical density based clustering.Journal of Open Source Software2, 11 (2017), 205. doi:10.21105/joss.00205

work page doi:10.21105/joss.00205 2017
[25]

Xinxin Mei, Qiang Wang, and Xiaowen Chu. 2016. A Survey and Measurement Study of GPU DVFS on Energy Conservation. arXiv:1610.01784 [cs.DC]https://arxiv.org/abs/1610.01784

work page arXiv 2016
[26]

Sreedhar Narayanaswamy, Pratikkumar Dilipkumar Patel, Ian Karlin, Apoorv Gupta, Sudhir Saripalli, and Janey Guo. 2025. Datacenter Energy Optimized Power Profiles. arXiv:2510.03872 [cs.DC]https: //arxiv.org/abs/2510.03872

work page arXiv 2025
[27]

Chenxu Niu, Wei Zhang, Jie Li, Yongjian Zhao, Tongyang Wang, Xi Wang, and Yong Chen. 2025. TokenPowerBench: Benchmarking the Power Consumption of LLM Inference. arXiv:2512.03024 [cs.LG]https: //arxiv.org/abs/2512.03024

work page arXiv 2025
[28]

Chenxu Niu, Wei Zhang, Yongjian Zhao, and Yong Chen. 2025. Energy Efficient or Exhaustive? Benchmarking Power Consumption of LLM Inference Engines.SIGENERGY Energy Inform. Rev.5, 2 (Aug. 2025), 56–62. doi:10.1145/3757892.3757900

work page doi:10.1145/3757892.3757900 2025
[29]

NVIDIA. 2025. Driver Persistence.https://docs.nvidia.com/deploy/ driver-persistence/index.html

work page 2025
[30]

NVIDIA. 2025. NVIDIA Blackwell B200 GPU.https://images.nvidia. com/aem-dam/Solutions/documents/HGX-B200-PCF-Summary.pdf

work page 2025
[31]

NVIDIA Corporation. 2025. NVIDIA Multi-Instance GPU User Guide. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

work page 2025
[32]

NVIDIA Corporation. 2026. NVIDIA A100 Tensor Core GPU.https: //www.nvidia.com/en-us/data-center/a100/

work page 2026
[33]

2026.NVIDIA Data Center GPU Manager (DCGM) Documentation.https://docs.nvidia.com/datacenter/dcgm/ latest/index.html

NVIDIA Corporation. 2026.NVIDIA Data Center GPU Manager (DCGM) Documentation.https://docs.nvidia.com/datacenter/dcgm/ latest/index.html

work page 2026
[34]

NVIDIA Corporation. 2026. NVIDIA H100 GPU.https://www.nvidia. com/en-us/data-center/h100/

work page 2026
[35]

NVIDIA Corporation. 2026. NVIDIA L40S GPU for AI and Graphics Performance.https://www.nvidia.com/en-us/data-center/l40s/

work page 2026
[36]

2026.NVIDIA Management Library (NVML) API Reference Guide.https://docs.nvidia.com/deploy/nvml-api/index.html

NVIDIA Corporation. 2026.NVIDIA Management Library (NVML) API Reference Guide.https://docs.nvidia.com/deploy/nvml-api/index.html

work page 2026
[37]

NVIDIA Corporation. 2026. NVIDIA RTX 6000 Ada Generation Graph- ics Card.https://www.nvidia.com/en-us/products/workstations/rtx- 6000/

work page 2026
[38]

NVIDIA Corporation. 2026. NVIDIA RTX A6000.https://www.nvidia. com/en-us/products/workstations/rtx-a6000/

work page 2026
[39]

NVIDIA Corporation. 2026. NVIDIA System Management Interface (nvidia-smi).https://docs.nvidia.com/deploy/nvidia-smi/

work page 2026
[40]

Gabriele Oliaro, Xupeng Miao, Xinhao Cheng, Vineeth Kada, Mengdi Wu, Ruohan Gao, Yingyi Huang, Remi Delacourt, April Yang, Yingcheng Wang, Colin Unger, and Zhihao Jia. 2025. FlexLLM: Token- Level Co-Serving of LLM Inference and Finetuning with SLO Guaran- tees. arXiv:2402.18789 [cs.DC]https://arxiv.org/abs/2402.18789

work page arXiv 2025
[41]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh War- rier, Nithish Mahalingam, and Ricardo Bianchini. 2024. Characterizing Power Management Opportunities for LLMs in the Cloud. InProceed- ings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3(La Jolla, CA, USA)(ASP...

work page doi:10.1145/3620666.3651329 2024
[42]

Pratikkumar Patel and Sreedhar Narayanaswamy. 2025. Optimize Data Center Efficiency for AI and HPC Workloads with Power Profiles.https://developer.nvidia.com/blog/optimize-data-center- efficiency-for-ai-and-hpc-workloads-with-power-profiles/

work page 2025
[43]

Patterson, Joseph Gonzalez, Urs Hölzle, Quoc V

David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. 2022. The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink.Computer55, 7 (2022), 18–28. doi:10.1109/MC.2022.3148714 13

work page doi:10.1109/mc.2022.3148714 2022
[44]

Gon- zalez, Ion Stoica, and Harry Xu

Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, and Harry Xu. 2025. ConServe: Fine-Grained GPU Harvest- ing for LLM Online and Offline Co-Serving. arXiv:2410.01228 [cs.DC] https://arxiv.org/abs/2410.01228

work page arXiv 2025
[45]

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer. 2024. Power-aware Deep Learning Model Serving with 𝜇-Serve. In2024 USENIX Annual Technical Conference (USENIX ATC 24). USENIX Association, Santa Clara, CA, 75–93.https: //www.usenix.org/conference/atc24/p...

work page 2024
[46]

Giampaolo Rodolà. 2026. psutil: Cross-platform lib for process and system monitoring in Python.https://github.com/giampaolo/psutil

work page 2026
[47]

Yadwadkar, and Christos Kozyrakis

Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serv- ing. In2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 397–411.https://www.usenix.org/conference/ atc21/presentation/romero

work page 2021
[48]

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, De- vesh Tiwari, and Vijay Gadepally. 2023. From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. arXiv:2310.03003 [cs.CL]https://arxiv.org/abs/2310.03003

work page arXiv 2023
[49]

Koomey, Eric R

Arman Shehabi, Sarah Josephine Smith, Alex Hubbard, Alexander Newkirk, Nuoa Lei, Md AbuBakar Siddik, Billie Holecek, Jonathan G. Koomey, Eric R. Masanet, and Dale A. Sartor. 2024.2024 United States Data Center Energy Usage Report. Technical Report. Lawrence Berkeley National Laboratory. doi:10.71468/P1WC7Q

work page doi:10.71468/p1wc7q 2024
[50]

Varsha Singhania, Shaizeen Aga, and Mohamed Assem Ibrahim. 2025. FinGraV: Methodology for Fine-Grain GPU Power Visibility and In- sights. arXiv:2412.12426 [cs.AR]https://arxiv.org/abs/2412.12426

work page arXiv 2025
[51]

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. In2025 IEEE International Sympo- sium on High Performance Computer Architecture (HPCA). 1348–1362. doi:10.1109/HPCA61900.2025.00102

work page doi:10.1109/hpca61900.2025.00102 2025
[52]

Zhenheng Tang, Yuxin Wang, Qiang Wang, and Xiaowen Chu. 2019. The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study. InProceedings of the Tenth ACM In- ternational Conference on Future Energy Systems(Phoenix, AZ, USA) (e-Energy ’19). Association for Computing Machinery, New York, NY, USA, 315–325. doi:10.1145/3307772.3328315

work page doi:10.1145/3307772.3328315 2019
[53]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]https://arxiv. org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Daniel Velicka, Ondrej Vysocky, and Lubomir Riha. 2025. Method- ology for GPU Frequency Switching Latency Measurement. arXiv:2502.20075 [cs.DC]https://arxiv.org/abs/2502.20075

work page arXiv 2025
[55]

Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. 2025. KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider. In2025 USENIX Annual Technical Conference (USENIX ATC 25). USENIX Association.https://www. usenix.org/conference/atc25/presentation/wang-jiahao

work page 2025
[56]

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Dis- covery and Data Mining V.2 (KDD ’25). ACM, T...

work page doi:10.1145/3711896.3737413 2025
[57]

Grant Wilkins, Srinivasan Keshav, and Richard Mortier. 2024. Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems. arXiv:2407.04014 [cs.DC] https://arxiv.org/abs/2407.04014

work page arXiv 2024
[58]

Lee, Bugra Akyildiz, Maximilian Balandat, Joe Spisak, Ravi Jain, Mike Rabbat, and Kim Hazelwood

Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, New- sha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga Behram, James Huang, Charles Bai, Michael Gschwind, Anurag Gupta, Myle Ott, Anas- tasia Melnikov, Salvatore Candido, David Brooks, Geeta Chauhan, Ben- jamin Lee, Hsien-Hsin S. Lee, Bugra Akyildiz, Maximilian Balandat, Joe Spisak, Ravi Jain, M...

work page arXiv 2022
[59]

xAI. 2026. Colossus: The World’s Largest AI Supercomputer.https: //x.ai/colossus

work page 2026
[60]

Jie You, Jae-Won Chung, and Mosharaf Chowdhury. 2023. Zeus: Under- standing and Optimizing GPU Energy Consumption of DNN Training. In20th USENIX Symposium on Networked Systems Design and Im- plementation (NSDI 23). USENIX Association, Boston, MA, 119–139. https://www.usenix.org/conference/nsdi23/presentation/you

work page 2023
[61]

Junyeol Yu, Jongseok Kim, and Euiseong Seo. 2023. Know Your Enemy To Save Cloud Energy: Energy-Performance Characteriza- tion of Machine Learning Serving. In2023 IEEE International Sympo- sium on High-Performance Computer Architecture (HPCA). 842–854. doi:10.1109/HPCA56546.2023.10070943

work page doi:10.1109/hpca56546.2023.10070943 2023
[62]

Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yangmin Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, Ion Stoica, Harry Xu, and Ying Sheng. 2025. Prism: Unleashing GPU Sharing for Cost- Efficient Multi-LLM Serving. arXiv:2505.04021 [cs.DC]https://arxiv. org/abs/2505.04021

work page arXiv 2025
[63]

Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, and Haibo Chen. 2025. BLITZSCALE: fast and live large model autoscaling with O(1) host caching. InProceedings of the 19th USENIX Conference on Operating Systems Design and Implementation (Boston, MA, USA)(OSDI ’25). USENIX Association, USA, Article 16, 19 pages

work page 2025
[64]

Yijia Zhang, Qiang Wang, Zhe Lin, Pengxiang Xu, and Bingqiang Wang. 2024. Improving GPU Energy Efficiency through an Application- transparent Frequency Scaling Policy with Performance Assurance. In Proceedings of the Nineteenth European Conference on Computer Systems (Athens, Greece)(EuroSys ’24). Association for Computing Machinery, New York, NY, USA, 76...

work page doi:10.1145/3627703.3629584 2024