pith. sign in

arxiv: 2604.17635 · v1 · submitted 2026-04-19 · 💻 cs.DC

EcoShift: Performance-Aware Power Management for Power-Constrained Heterogeneous Systems

Pith reviewed 2026-05-10 04:59 UTC · model grok-4.3

classification 💻 cs.DC
keywords power managementheterogeneous computingCPU-GPU systemsperformance predictiondynamic programmingHPCpower constraintscluster management
0
0 comments X

The pith

EcoShift predicts each application's sensitivity to CPU and GPU power caps and uses dynamic programming to allocate reclaimed power for up to 6% average performance gains while respecting cluster power limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

EcoShift addresses power management in heterogeneous CPU-GPU high-performance computing systems operating under strict cluster-wide power constraints. Current approaches rely on fair-share or utilization-based heuristics that fail to account for how different applications perform differently when power is limited on CPUs versus GPUs. The framework uses online performance prediction to model these sensitivities and a dynamic-programming allocator to optimally distribute available power. Evaluations on platforms with Intel CPUs and NVIDIA A100 and H100 GPUs demonstrate consistent outperformance of existing policies with up to 6% better average performance. This approach matters because it turns power constraints into opportunities for higher overall system throughput without additional hardware.

Core claim

EcoShift is a performance-aware power management framework for power-constrained heterogeneous systems. It integrates online performance prediction of application sensitivity to CPU and GPU power caps with a dynamic-programming-based allocator that distributes reclaimed power to maximize the average performance improvement across workloads, all while maintaining the cluster-wide power limit.

What carries the argument

Online performance prediction combined with a dynamic-programming allocator for power distribution across heterogeneous applications.

If this is right

  • Achieves up to 6% average performance improvement compared to state-of-the-art policies
  • Maintains the cluster power constraint across diverse CPU-GPU workloads
  • Outperforms on two different hardware platforms with Intel CPUs and NVIDIA A100/H100 GPUs
  • Improves efficiency by better utilizing power reclaimed from application power caps

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could be adapted for other types of heterogeneous computing environments beyond HPC, such as data centers with mixed accelerators.
  • If the prediction model is extended to more hardware configurations, it might enable more granular power management in large-scale deployments.
  • Testing with real-time job arrivals rather than emulated workloads would validate its practicality in production clusters.

Load-bearing premise

The method assumes that online performance predictions can accurately capture each application's sensitivity to CPU and GPU power caps quickly enough, and that the dynamic programming allocator incurs low enough overhead to be used repeatedly.

What would settle it

An experiment showing that the performance predictor has high error rates for certain workloads, resulting in either power limit violations or no net performance gain over simpler allocation methods.

Figures

Figures reproduced from arXiv: 2604.17635 by Michael E. Papka, Zhiling Lan, Zhong Zheng.

Figure 1
Figure 1. Figure 1: Heatmaps of normalized application performance on a node with an Intel Xeon Platinum 8468 CPU and NVIDIA [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow of EcoShift. EcoShift begins with a brief [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dynamic Programming Search Space. Given a fixed [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance improvement of different power dis [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance improvement of different power dis [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Violin plots showing the distributions of performance improvement across different type of workloads under different [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cumulative distribution function of the perfor [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Jain’s fairness index of the mixed workloads. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
read the original abstract

Power-constrained HPC systems increasingly run heterogeneous CPU--GPU applications under strict cluster-wide power limits. Existing cluster-wide power management policies rely on fair-share or utilization heuristics and do not capture application-specific sensitivity to CPU and GPU power caps, leading to inefficient use of reclaimed power. We present EcoShift, a performance-aware cluster-wide power management framework. EcoShift combines online performance prediction with a dynamic-programming-based allocator to distribute reclaimed power across CPU--GPU applications for maximum average performance improvement. Through emulation-based evaluation on two heterogeneous Intel CPU and NVIDIA A100/H100 GPU platforms with diverse CPU--GPU workloads, EcoShift consistently outperforms state-of-the-art policies, achieving up to 6% average performance improvement while preserving the cluster-wide power constraint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents EcoShift, a performance-aware cluster-wide power management framework for heterogeneous CPU-GPU HPC systems under strict power limits. It combines online performance prediction of application sensitivity to CPU/GPU power caps with a dynamic-programming allocator that redistributes reclaimed power to maximize average performance improvement while enforcing the global power constraint. Emulation-based evaluation on two platforms (Intel CPUs paired with NVIDIA A100/H100 GPUs) using diverse workloads reports consistent outperformance over state-of-the-art fair-share and utilization policies, with up to 6% average performance gain.

Significance. If the online predictor and allocator prove accurate and low-overhead in live settings, EcoShift could meaningfully improve throughput in power-constrained heterogeneous clusters by moving beyond heuristic power allocation. The dynamic-programming formulation is a clear technical contribution if its runtime cost remains negligible at the required invocation frequency.

major comments (2)
  1. [Evaluation section] Evaluation section: the reported 6% average improvement is obtained via emulation that replays traces but supplies no measured accuracy of the online performance prediction model (e.g., error in sensitivity estimates) nor any timing data for the dynamic-programming allocator. Without these quantities it is impossible to determine whether the claimed gains survive the prediction error and scheduling jitter that would appear in a real cluster.
  2. [§3 and Evaluation] §3 (Approach) and Evaluation: the central claim that the DP allocator can be invoked frequently enough to track power-cap changes rests on an untested assumption about its computational cost. Emulation does not expose cache/TLB/thermal feedback or OS scheduler effects that could inflate allocator latency and invalidate the outperformance result.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'diverse CPU-GPU workloads' is used without enumerating the applications, their mix of CPU/GPU intensity, or the total number of runs; this detail belongs in the evaluation description.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and detailed comments on our manuscript. We address each major comment below, clarifying the emulation-based nature of our evaluation while committing to strengthen the presentation with additional quantitative details on model accuracy and allocator overhead.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section: the reported 6% average improvement is obtained via emulation that replays traces but supplies no measured accuracy of the online performance prediction model (e.g., error in sensitivity estimates) nor any timing data for the dynamic-programming allocator. Without these quantities it is impossible to determine whether the claimed gains survive the prediction error and scheduling jitter that would appear in a real cluster.

    Authors: We agree that explicit quantification of prediction accuracy and allocator runtime would improve the evaluation. Our traces were collected from real executions on the target Intel+NVIDIA platforms under varying power caps; we will add a new subsection reporting the online predictor's mean absolute percentage error on sensitivity estimates via leave-one-out cross-validation on these traces. We will also include wall-clock timing measurements for the DP allocator (O(N·P) complexity) collected on the same hardware, showing sub-millisecond execution for typical cluster sizes (N≤32). These additions will allow assessment of robustness to prediction error. A production live-cluster deployment with full OS/hardware feedback lies outside the current scope and is noted as future work. revision: partial

  2. Referee: [§3 and Evaluation] §3 (Approach) and Evaluation: the central claim that the DP allocator can be invoked frequently enough to track power-cap changes rests on an untested assumption about its computational cost. Emulation does not expose cache/TLB/thermal feedback or OS scheduler effects that could inflate allocator latency and invalidate the outperformance result.

    Authors: The DP allocator's computational cost is analytically bounded and we will add direct timing benchmarks in the revised evaluation to confirm it remains negligible relative to typical power-cap change intervals (seconds to minutes). While emulation necessarily abstracts certain micro-architectural and OS effects, the performance traces already embed measured application responses to power caps, and the allocator decisions depend only on the predicted sensitivities derived from those traces. We will add a concise discussion of potential real-system latency inflation and why the low invocation frequency required for cluster power management makes such effects unlikely to overturn the reported gains. revision: partial

standing simulated objections not resolved
  • Direct measurement of predictor accuracy and allocator latency inside a live, production heterogeneous cluster under real cache, thermal, and OS-scheduler feedback.

Circularity Check

0 steps flagged

No circularity: derivation relies on external models and optimization, not self-referential fits

full rationale

The abstract and description present EcoShift as combining online performance prediction (sensitivity to CPU/GPU caps) with a DP allocator for power distribution. Evaluation is emulation-based on Intel CPU + NVIDIA A100/H100 platforms, claiming up to 6% average improvement over state-of-the-art policies while meeting power constraints. No equations, self-citations, or steps are shown that reduce a 'prediction' or result to a fitted parameter or prior self-work by construction. The approach treats performance models and DP as independent components whose accuracy is an external assumption, not a definitional tautology. This matches the default expectation of non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text. The central claim implicitly rests on the domain assumption that application performance under power caps is predictable online.

axioms (1)
  • domain assumption Application performance sensitivity to CPU and GPU power caps can be predicted online with sufficient accuracy to guide allocation decisions.
    This assumption underpins both the prediction component and the claim that the allocator can maximize average performance improvement.

pith-pipeline@v0.9.0 · 5424 in / 1227 out tokens · 33910 ms · 2026-05-10T04:59:13.878103+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    ECP proxy apps suite

    2025. ECP proxy apps suite. https://proxyapps.exascaleproject.org/ ecp- proxy- apps- suite/

  2. [2]

    https://en.wikipedia.org/wiki/Fairness_measure

    2025. Jain’s fairness index. "https://en.wikipedia.org/wiki/Fairness_measure"

  3. [3]

    https://www.amd.com/en/products/accelerators/ instinct/mi300/mi300a.html

    AMD. 2025. AMD MI300A. "https://www.amd.com/en/products/accelerators/ instinct/mi300/mi300a.html"

  4. [4]

    Large-scale Atomic and Molecular Massively Parallel Simulator. 2013. Lammps. available at: http:/lammps. sandia. gov(2013)

  5. [5]

    Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, et al. 2008. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep15 (2008), 181

  6. [6]

    Melanie Cornelius, Greg Cross, Shilpika Shilpika, Matthew T Dearing, and Zhiling Lan. 2025. Extracting Practical, Actionable Energy Insights from Supercomputer Telemetry and Logs.arXiv preprint arXiv:2505.14796(2025)

  7. [7]

    Howard David, Chris Fallin, Eugene Gorbatov, Ulf R Hanebutte, and Onur Mutlu

  8. [8]

    In Proceedings of the 8th ACM international conference on Autonomic computing

    Memory power management via dynamic voltage/frequency scaling. In Proceedings of the 8th ACM international conference on Autonomic computing. 31–40

  9. [9]

    Howard David, Eugene Gorbatov, Ulf R Hanebutte, Rahul Khanna, and Christian Le. 2010. RAPL: Memory power estimation and capping. InProceedings of the 16th ACM/IEEE international symposium on Low power electronics and design. 189–194

  10. [10]

    Jianru Ding and Henry Hoffmann. 2023. DPS: Adaptive Power Management for Overprovisioned Systems. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14

  11. [11]

    Kaivalya M Dixit. 1991. The SPEC benchmarks.Parallel computing17, 10-11 (1991), 1195–1209

  12. [12]

    Bishwajit Dutta, Vignesh Adhinarayanan, and Wu-chun Feng. 2018. GPU power prediction via ensemble machine learning for DVFS space exploration. InProceed- ings of the 15th ACM International Conference on Computing Frontiers. 240–243

  13. [13]

    Kaijie Fan, Biagio Cosenza, and Ben Juurlink. 2019. Predictable GPUs Frequency Scaling for Energy and Performance. InProceedings of the 48th International Conference on Parallel Processing. ACM, 1–10

  14. [14]

    Steven Farrell, Murali Emani, Jacob Balma, Lukas Drescher, Aleksandr Drozd, Andreas Fink, Geoffrey Fox, David Kanter, Thorsten Kurth, Peter Mattson, et al

  15. [15]

    In2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)

    MLPerf™HPC: A holistic benchmark suite for scientific machine learning on HPC systems. In2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC). IEEE, 33–45

  16. [16]

    Rong Ge, Xizhou Feng, Yangyang He, and Pengfei Zou. 2016. The case for cross-component power coordination on power bounded systems. In2016 45th International Conference on Parallel Processing (ICPP). IEEE, 516–525

  17. [17]

    Neha Gholkar, Frank Mueller, Barry Rountree, and Aniruddha Marathe. 2018. Pshifter: Feedback-based dynamic power shifting within hpc jobs for performance. InProceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. 106–117

  18. [18]

    Joao Guerreiro, Aleksandar Ilic, Nuno Roma, and Pedro Tomas. 2018. GPGPU power modeling for multi-domain voltage-frequency scaling. In2018 IEEE Inter- national Symposium on High Performance Computer Architecture (HPCA). IEEE, 789–800

  19. [19]

    João Guerreiro, Aleksandar Ilic, Nuno Roma, and Pedro Tomás. 2019. DVFS- aware application classification to improve GPGPUs energy efficiency.Parallel Comput.83 (2019), 93–117

  20. [20]

    Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and per- formance model. InProceedings of the 37th annual international symposium on Computer architecture. 280–289

  21. [21]

    Bodun Hu and Christopher J Rossbach. 2020. Altis: Modernizing gpgpu bench- marks. In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 1–11

  22. [22]

    Zheming Jin and Jeffrey S Vetter. 2023. A benchmark suite for improving per- formance portability of the sycl programming model. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 325–327

  23. [23]

    Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K Nurminen, and Zhonghong Ou. 2018. Rapl in action: Experiences in using rapl for power measure- ments.ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS)3, 2 (2018), 1–26

  24. [24]

    Jungseob Lee, Vijay Sathisha, Michael Schulte, Katherine Compton, and Nam Sung Kim. 2011. Improving throughput of power-constrained GPUs using dynamic voltage/frequency and core scaling. In2011 International Conference on Parallel Architectures and Compilation Techniques. IEEE, 111–120

  25. [25]

    Charles Lefurgy, Xiaorui Wang, and Malcolm Ware. 2008. Power capping: a prelude to power shifting.Cluster Computing11 (2008), 183–195

  26. [26]

    https://www.nist.gov/system/files/documents/2017/02/21/messina_nist_ 20170214.final_.pdf

    Paul Messina. 2017. The USDOE Exascale Computing Project–Goals and Chal- lenges. "https://www.nist.gov/system/files/documents/2017/02/21/messina_nist_ 20170214.final_.pdf". Conference’17, July 2017, Washington, DC, USA Zhong Zheng, Michael E. Papka, and Zhiling Lan

  27. [27]

    https://github.com/NVIDIA/ DCGM

    NVIDIA. 2025. NVIDIA Data Center GPU Manager. "https://github.com/NVIDIA/ DCGM"

  28. [28]

    https://developer.nvidia.com/management-library-nvml

    Nvidia. 2025. NVML. "https://developer.nvidia.com/management-library-nvml"

  29. [29]

    Cristobal Ortega, Lluc Alvarez, Alper Buyuktosunoglu, Ramon Bertran, Todd Rosedahl, Pradip Bose, and Miquel Moreto. 2022. Adaptive power shifting for power-constrained heterogeneous systems.IEEE Trans. Comput.72, 3 (2022), 627–640

  30. [30]

    Tapasya Patki, David K Lowenthal, Barry Rountree, Martin Schulz, and Bronis R De Supinski. 2013. Exploring hardware overprovisioning in power-constrained, high performance computing. InProceedings of the 27th international ACM con- ference on International conference on supercomputing. 173–182

  31. [31]

    Srinivasan Ramesh, Swann Perarnau, Sridutt Bhalachandra, Allen D Malony, and Pete Beckman. 2019. Understanding the impact of dynamic power capping on application progress. In2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 793–804

  32. [32]

    Varun Sakalkar, Vasileios Kontorinis, David Landhuis, Shaohong Li, Darren De Ronde, Thomas Blooming, Anand Ramesh, James Kennedy, Christopher Malone, Jimmy Clidaras, et al. 2020. Data center power oversubscription with a medium voltage power plane and priority-aware capping. InProceedings of the Twenty-Fifth International Conference on Architectural Suppo...

  33. [33]

    Tapan Srivastava, Huazhe Zhang, and Henry Hoffmann. 2022. Penelope: peer-to- peer power management. InProceedings of the 51st International Conference on Parallel Processing. 1–11

  34. [34]

    David Van Der Spoel, Erik Lindahl, Berk Hess, Gerrit Groenhof, Alan E Mark, and Herman JC Berendsen. 2005. GROMACS: fast, flexible, and free.Journal of computational chemistry26, 16 (2005), 1701–1718

  35. [35]

    Qiang Wang and Xiaowen Chu. 2020. GPGPU performance estimation with core and memory frequency scaling.IEEE Transactions on Parallel and Distributed Systems31, 12 (2020), 2865–2881

  36. [36]

    Yiming Wang, Weizhe Zhang, Meng Hao, Weizhi Kong, and Yuan Wen. 2025. Dynamic Power Management Through Multi-agent Deep Reinforcement Learn- ing for Heterogeneous Systems.ACM Transactions on Architecture and Code Optimization(2025)

  37. [37]

    Daniel C Wilson, Siddhartha Jana, Aniruddha Marathe, Stephanie Brink, Christo- pher M Cantalupo, Diana R Guttman, Brad Geltz, Lowren H Lawson, Asma H Al-Rawi, Ali Mohammad, et al. 2021. Introducing application awareness into a unified power management stack. In2021 IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS). IEEE, 320–329

  38. [38]

    Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simple linux utility for resource management. InWorkshop on job scheduling strategies for parallel processing. Springer, 44–60

  39. [39]

    Huazhe Zhang and Henry Hoffmann. 2019. PoDD: power-capping dependent distributed applications. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–23

  40. [40]

    Yijia Zhang, Qiang Wang, Zhe Lin, Pengxiang Xu, and Bingqiang Wang. 2024. Improving gpu energy efficiency through an application-transparent frequency scaling policy with performance assurance. InProceedings of the Nineteenth European Conference on Computer Systems. 769–785

  41. [41]

    Zhong Zheng, Zhiling Lan, Xingfu Wu, Valerie E Taylor, and Michael E Papka

  42. [42]

    Coordinated power management on heterogeneous systems.arXiv preprint arXiv:2508.07605(2025)

  43. [43]

    Zhong Zheng, Seyfal Sultanov, Michael E Papka, and Zhiling Lan. 2025. Mini- mizing Power Waste in Heterogenous Computing via Adaptive Uncore Scaling. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 505–518

  44. [44]

    Pengfei Zou, Ang Li, Kevin Barker, and Rong Ge. 2020. Indicator-directed dynamic power management for iterative workloads on GPU-accelerated systems. In2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 559–568