EcoShift: Performance-Aware Power Management for Power-Constrained Heterogeneous Systems
Pith reviewed 2026-05-10 04:59 UTC · model grok-4.3
The pith
EcoShift predicts each application's sensitivity to CPU and GPU power caps and uses dynamic programming to allocate reclaimed power for up to 6% average performance gains while respecting cluster power limits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EcoShift is a performance-aware power management framework for power-constrained heterogeneous systems. It integrates online performance prediction of application sensitivity to CPU and GPU power caps with a dynamic-programming-based allocator that distributes reclaimed power to maximize the average performance improvement across workloads, all while maintaining the cluster-wide power limit.
What carries the argument
Online performance prediction combined with a dynamic-programming allocator for power distribution across heterogeneous applications.
If this is right
- Achieves up to 6% average performance improvement compared to state-of-the-art policies
- Maintains the cluster power constraint across diverse CPU-GPU workloads
- Outperforms on two different hardware platforms with Intel CPUs and NVIDIA A100/H100 GPUs
- Improves efficiency by better utilizing power reclaimed from application power caps
Where Pith is reading between the lines
- This method could be adapted for other types of heterogeneous computing environments beyond HPC, such as data centers with mixed accelerators.
- If the prediction model is extended to more hardware configurations, it might enable more granular power management in large-scale deployments.
- Testing with real-time job arrivals rather than emulated workloads would validate its practicality in production clusters.
Load-bearing premise
The method assumes that online performance predictions can accurately capture each application's sensitivity to CPU and GPU power caps quickly enough, and that the dynamic programming allocator incurs low enough overhead to be used repeatedly.
What would settle it
An experiment showing that the performance predictor has high error rates for certain workloads, resulting in either power limit violations or no net performance gain over simpler allocation methods.
Figures
read the original abstract
Power-constrained HPC systems increasingly run heterogeneous CPU--GPU applications under strict cluster-wide power limits. Existing cluster-wide power management policies rely on fair-share or utilization heuristics and do not capture application-specific sensitivity to CPU and GPU power caps, leading to inefficient use of reclaimed power. We present EcoShift, a performance-aware cluster-wide power management framework. EcoShift combines online performance prediction with a dynamic-programming-based allocator to distribute reclaimed power across CPU--GPU applications for maximum average performance improvement. Through emulation-based evaluation on two heterogeneous Intel CPU and NVIDIA A100/H100 GPU platforms with diverse CPU--GPU workloads, EcoShift consistently outperforms state-of-the-art policies, achieving up to 6% average performance improvement while preserving the cluster-wide power constraint.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents EcoShift, a performance-aware cluster-wide power management framework for heterogeneous CPU-GPU HPC systems under strict power limits. It combines online performance prediction of application sensitivity to CPU/GPU power caps with a dynamic-programming allocator that redistributes reclaimed power to maximize average performance improvement while enforcing the global power constraint. Emulation-based evaluation on two platforms (Intel CPUs paired with NVIDIA A100/H100 GPUs) using diverse workloads reports consistent outperformance over state-of-the-art fair-share and utilization policies, with up to 6% average performance gain.
Significance. If the online predictor and allocator prove accurate and low-overhead in live settings, EcoShift could meaningfully improve throughput in power-constrained heterogeneous clusters by moving beyond heuristic power allocation. The dynamic-programming formulation is a clear technical contribution if its runtime cost remains negligible at the required invocation frequency.
major comments (2)
- [Evaluation section] Evaluation section: the reported 6% average improvement is obtained via emulation that replays traces but supplies no measured accuracy of the online performance prediction model (e.g., error in sensitivity estimates) nor any timing data for the dynamic-programming allocator. Without these quantities it is impossible to determine whether the claimed gains survive the prediction error and scheduling jitter that would appear in a real cluster.
- [§3 and Evaluation] §3 (Approach) and Evaluation: the central claim that the DP allocator can be invoked frequently enough to track power-cap changes rests on an untested assumption about its computational cost. Emulation does not expose cache/TLB/thermal feedback or OS scheduler effects that could inflate allocator latency and invalidate the outperformance result.
minor comments (1)
- [Abstract] Abstract: the phrase 'diverse CPU-GPU workloads' is used without enumerating the applications, their mix of CPU/GPU intensity, or the total number of runs; this detail belongs in the evaluation description.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments on our manuscript. We address each major comment below, clarifying the emulation-based nature of our evaluation while committing to strengthen the presentation with additional quantitative details on model accuracy and allocator overhead.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: the reported 6% average improvement is obtained via emulation that replays traces but supplies no measured accuracy of the online performance prediction model (e.g., error in sensitivity estimates) nor any timing data for the dynamic-programming allocator. Without these quantities it is impossible to determine whether the claimed gains survive the prediction error and scheduling jitter that would appear in a real cluster.
Authors: We agree that explicit quantification of prediction accuracy and allocator runtime would improve the evaluation. Our traces were collected from real executions on the target Intel+NVIDIA platforms under varying power caps; we will add a new subsection reporting the online predictor's mean absolute percentage error on sensitivity estimates via leave-one-out cross-validation on these traces. We will also include wall-clock timing measurements for the DP allocator (O(N·P) complexity) collected on the same hardware, showing sub-millisecond execution for typical cluster sizes (N≤32). These additions will allow assessment of robustness to prediction error. A production live-cluster deployment with full OS/hardware feedback lies outside the current scope and is noted as future work. revision: partial
-
Referee: [§3 and Evaluation] §3 (Approach) and Evaluation: the central claim that the DP allocator can be invoked frequently enough to track power-cap changes rests on an untested assumption about its computational cost. Emulation does not expose cache/TLB/thermal feedback or OS scheduler effects that could inflate allocator latency and invalidate the outperformance result.
Authors: The DP allocator's computational cost is analytically bounded and we will add direct timing benchmarks in the revised evaluation to confirm it remains negligible relative to typical power-cap change intervals (seconds to minutes). While emulation necessarily abstracts certain micro-architectural and OS effects, the performance traces already embed measured application responses to power caps, and the allocator decisions depend only on the predicted sensitivities derived from those traces. We will add a concise discussion of potential real-system latency inflation and why the low invocation frequency required for cluster power management makes such effects unlikely to overturn the reported gains. revision: partial
- Direct measurement of predictor accuracy and allocator latency inside a live, production heterogeneous cluster under real cache, thermal, and OS-scheduler feedback.
Circularity Check
No circularity: derivation relies on external models and optimization, not self-referential fits
full rationale
The abstract and description present EcoShift as combining online performance prediction (sensitivity to CPU/GPU caps) with a DP allocator for power distribution. Evaluation is emulation-based on Intel CPU + NVIDIA A100/H100 platforms, claiming up to 6% average improvement over state-of-the-art policies while meeting power constraints. No equations, self-citations, or steps are shown that reduce a 'prediction' or result to a fitted parameter or prior self-work by construction. The approach treats performance models and DP as independent components whose accuracy is an external assumption, not a definitional tautology. This matches the default expectation of non-circular papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Application performance sensitivity to CPU and GPU power caps can be predicted online with sufficient accuracy to guide allocation decisions.
Reference graph
Works this paper leans on
-
[1]
ECP proxy apps suite
2025. ECP proxy apps suite. https://proxyapps.exascaleproject.org/ ecp- proxy- apps- suite/
2025
-
[2]
https://en.wikipedia.org/wiki/Fairness_measure
2025. Jain’s fairness index. "https://en.wikipedia.org/wiki/Fairness_measure"
2025
-
[3]
https://www.amd.com/en/products/accelerators/ instinct/mi300/mi300a.html
AMD. 2025. AMD MI300A. "https://www.amd.com/en/products/accelerators/ instinct/mi300/mi300a.html"
2025
-
[4]
Large-scale Atomic and Molecular Massively Parallel Simulator. 2013. Lammps. available at: http:/lammps. sandia. gov(2013)
2013
-
[5]
Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, et al. 2008. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep15 (2008), 181
2008
- [6]
-
[7]
Howard David, Chris Fallin, Eugene Gorbatov, Ulf R Hanebutte, and Onur Mutlu
-
[8]
In Proceedings of the 8th ACM international conference on Autonomic computing
Memory power management via dynamic voltage/frequency scaling. In Proceedings of the 8th ACM international conference on Autonomic computing. 31–40
-
[9]
Howard David, Eugene Gorbatov, Ulf R Hanebutte, Rahul Khanna, and Christian Le. 2010. RAPL: Memory power estimation and capping. InProceedings of the 16th ACM/IEEE international symposium on Low power electronics and design. 189–194
2010
-
[10]
Jianru Ding and Henry Hoffmann. 2023. DPS: Adaptive Power Management for Overprovisioned Systems. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14
2023
-
[11]
Kaivalya M Dixit. 1991. The SPEC benchmarks.Parallel computing17, 10-11 (1991), 1195–1209
1991
-
[12]
Bishwajit Dutta, Vignesh Adhinarayanan, and Wu-chun Feng. 2018. GPU power prediction via ensemble machine learning for DVFS space exploration. InProceed- ings of the 15th ACM International Conference on Computing Frontiers. 240–243
2018
-
[13]
Kaijie Fan, Biagio Cosenza, and Ben Juurlink. 2019. Predictable GPUs Frequency Scaling for Energy and Performance. InProceedings of the 48th International Conference on Parallel Processing. ACM, 1–10
2019
-
[14]
Steven Farrell, Murali Emani, Jacob Balma, Lukas Drescher, Aleksandr Drozd, Andreas Fink, Geoffrey Fox, David Kanter, Thorsten Kurth, Peter Mattson, et al
-
[15]
In2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)
MLPerf™HPC: A holistic benchmark suite for scientific machine learning on HPC systems. In2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC). IEEE, 33–45
-
[16]
Rong Ge, Xizhou Feng, Yangyang He, and Pengfei Zou. 2016. The case for cross-component power coordination on power bounded systems. In2016 45th International Conference on Parallel Processing (ICPP). IEEE, 516–525
2016
-
[17]
Neha Gholkar, Frank Mueller, Barry Rountree, and Aniruddha Marathe. 2018. Pshifter: Feedback-based dynamic power shifting within hpc jobs for performance. InProceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. 106–117
2018
-
[18]
Joao Guerreiro, Aleksandar Ilic, Nuno Roma, and Pedro Tomas. 2018. GPGPU power modeling for multi-domain voltage-frequency scaling. In2018 IEEE Inter- national Symposium on High Performance Computer Architecture (HPCA). IEEE, 789–800
2018
-
[19]
João Guerreiro, Aleksandar Ilic, Nuno Roma, and Pedro Tomás. 2019. DVFS- aware application classification to improve GPGPUs energy efficiency.Parallel Comput.83 (2019), 93–117
2019
-
[20]
Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and per- formance model. InProceedings of the 37th annual international symposium on Computer architecture. 280–289
2010
-
[21]
Bodun Hu and Christopher J Rossbach. 2020. Altis: Modernizing gpgpu bench- marks. In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 1–11
2020
-
[22]
Zheming Jin and Jeffrey S Vetter. 2023. A benchmark suite for improving per- formance portability of the sycl programming model. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 325–327
2023
-
[23]
Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K Nurminen, and Zhonghong Ou. 2018. Rapl in action: Experiences in using rapl for power measure- ments.ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS)3, 2 (2018), 1–26
2018
-
[24]
Jungseob Lee, Vijay Sathisha, Michael Schulte, Katherine Compton, and Nam Sung Kim. 2011. Improving throughput of power-constrained GPUs using dynamic voltage/frequency and core scaling. In2011 International Conference on Parallel Architectures and Compilation Techniques. IEEE, 111–120
2011
-
[25]
Charles Lefurgy, Xiaorui Wang, and Malcolm Ware. 2008. Power capping: a prelude to power shifting.Cluster Computing11 (2008), 183–195
2008
-
[26]
https://www.nist.gov/system/files/documents/2017/02/21/messina_nist_ 20170214.final_.pdf
Paul Messina. 2017. The USDOE Exascale Computing Project–Goals and Chal- lenges. "https://www.nist.gov/system/files/documents/2017/02/21/messina_nist_ 20170214.final_.pdf". Conference’17, July 2017, Washington, DC, USA Zhong Zheng, Michael E. Papka, and Zhiling Lan
2017
-
[27]
https://github.com/NVIDIA/ DCGM
NVIDIA. 2025. NVIDIA Data Center GPU Manager. "https://github.com/NVIDIA/ DCGM"
2025
-
[28]
https://developer.nvidia.com/management-library-nvml
Nvidia. 2025. NVML. "https://developer.nvidia.com/management-library-nvml"
2025
-
[29]
Cristobal Ortega, Lluc Alvarez, Alper Buyuktosunoglu, Ramon Bertran, Todd Rosedahl, Pradip Bose, and Miquel Moreto. 2022. Adaptive power shifting for power-constrained heterogeneous systems.IEEE Trans. Comput.72, 3 (2022), 627–640
2022
-
[30]
Tapasya Patki, David K Lowenthal, Barry Rountree, Martin Schulz, and Bronis R De Supinski. 2013. Exploring hardware overprovisioning in power-constrained, high performance computing. InProceedings of the 27th international ACM con- ference on International conference on supercomputing. 173–182
2013
-
[31]
Srinivasan Ramesh, Swann Perarnau, Sridutt Bhalachandra, Allen D Malony, and Pete Beckman. 2019. Understanding the impact of dynamic power capping on application progress. In2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 793–804
2019
-
[32]
Varun Sakalkar, Vasileios Kontorinis, David Landhuis, Shaohong Li, Darren De Ronde, Thomas Blooming, Anand Ramesh, James Kennedy, Christopher Malone, Jimmy Clidaras, et al. 2020. Data center power oversubscription with a medium voltage power plane and priority-aware capping. InProceedings of the Twenty-Fifth International Conference on Architectural Suppo...
2020
-
[33]
Tapan Srivastava, Huazhe Zhang, and Henry Hoffmann. 2022. Penelope: peer-to- peer power management. InProceedings of the 51st International Conference on Parallel Processing. 1–11
2022
-
[34]
David Van Der Spoel, Erik Lindahl, Berk Hess, Gerrit Groenhof, Alan E Mark, and Herman JC Berendsen. 2005. GROMACS: fast, flexible, and free.Journal of computational chemistry26, 16 (2005), 1701–1718
2005
-
[35]
Qiang Wang and Xiaowen Chu. 2020. GPGPU performance estimation with core and memory frequency scaling.IEEE Transactions on Parallel and Distributed Systems31, 12 (2020), 2865–2881
2020
-
[36]
Yiming Wang, Weizhe Zhang, Meng Hao, Weizhi Kong, and Yuan Wen. 2025. Dynamic Power Management Through Multi-agent Deep Reinforcement Learn- ing for Heterogeneous Systems.ACM Transactions on Architecture and Code Optimization(2025)
2025
-
[37]
Daniel C Wilson, Siddhartha Jana, Aniruddha Marathe, Stephanie Brink, Christo- pher M Cantalupo, Diana R Guttman, Brad Geltz, Lowren H Lawson, Asma H Al-Rawi, Ali Mohammad, et al. 2021. Introducing application awareness into a unified power management stack. In2021 IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS). IEEE, 320–329
2021
-
[38]
Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simple linux utility for resource management. InWorkshop on job scheduling strategies for parallel processing. Springer, 44–60
2003
-
[39]
Huazhe Zhang and Henry Hoffmann. 2019. PoDD: power-capping dependent distributed applications. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–23
2019
-
[40]
Yijia Zhang, Qiang Wang, Zhe Lin, Pengxiang Xu, and Bingqiang Wang. 2024. Improving gpu energy efficiency through an application-transparent frequency scaling policy with performance assurance. InProceedings of the Nineteenth European Conference on Computer Systems. 769–785
2024
-
[41]
Zhong Zheng, Zhiling Lan, Xingfu Wu, Valerie E Taylor, and Michael E Papka
-
[42]
Coordinated power management on heterogeneous systems.arXiv preprint arXiv:2508.07605(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Zhong Zheng, Seyfal Sultanov, Michael E Papka, and Zhiling Lan. 2025. Mini- mizing Power Waste in Heterogenous Computing via Adaptive Uncore Scaling. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 505–518
2025
-
[44]
Pengfei Zou, Ang Li, Kevin Barker, and Rong Ge. 2020. Indicator-directed dynamic power management for iterative workloads on GPU-accelerated systems. In2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 559–568
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.