Towards Energy Efficient Co-Scheduling in HPC
Pith reviewed 2026-05-10 04:49 UTC · model grok-4.3
The pith
EcoSched jointly selects per-application GPU counts and coschedules jobs on multi-GPU HPC systems to cut energy waste from nonlinear scaling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EcoSched is an online scheduler that jointly optimizes GPU count selection and application coscheduling to improve workload level efficiency on multi GPU systems. It uses lightweight runtime profiling to estimate relative performance across GPU counts, applies a score based policy to balance energy efficiency and idle resources, and incorporates NUMA aware placement to mitigate interference. Implementation and evaluation on heterogeneous CPU-GPU platforms with diverse workloads on H100, A100, and V100 systems show up to 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction over baseline schedulers with modest performance overhead.
What carries the argument
The score-based policy that consumes lightweight runtime profiling estimates of relative performance across GPU counts to decide allocations while balancing energy efficiency against idle resources.
If this is right
- Applications that do not scale linearly can be assigned fewer GPUs than the maximum available without sacrificing overall system throughput.
- NUMA-aware placement reduces interference enough to support coscheduling of multiple jobs on the same nodes.
- Energy, makespan, and EDP improvements hold across GPU generations from V100 to H100 with only modest profiling overhead.
- The method shows that joint optimization of count selection and coscheduling is required; neither alone suffices for the reported gains.
Where Pith is reading between the lines
- The same lightweight profiling approach could be extended to other accelerators whose scaling behavior is also nonlinear.
- Integrating EcoSched with existing batch schedulers would let users request energy-aware rather than purely performance-oriented allocations.
- Cumulative savings could grow larger on long-running production workloads where the per-job energy reductions compound over many hours.
Load-bearing premise
Lightweight runtime profiling can reliably estimate relative performance across GPU counts for heterogeneous applications, and the score-based policy generalizes without needing per-workload tuning or introducing unacceptable overhead.
What would settle it
A workload in which the short profiling estimates deviate substantially from full-run scaling behavior, causing the scheduler to pick GPU counts that increase rather than decrease energy or makespan compared with a static all-GPUs baseline.
Figures
read the original abstract
Modern multi GPU HPC systems expose substantial computational capacity, yet inefficient GPU allocation often leads to wasted energy and underutilization. In practice, GPU applications exhibit heterogeneous and nonlinear scaling, making it inefficient to always use all available GPUs. We present EcoSched, an online scheduler that jointly optimizes GPU count selection and application coscheduling to improve workload level efficiency on multi GPU systems. EcoSched uses lightweight runtime profiling to estimate relative performance across GPU counts, applies a score based policy to balance energy efficiency and idle resources, and incorporates NUMA aware placement to mitigate interference. We implement EcoSched on heterogeneous CPU GPU platforms and evaluate it with diverse workloads on H100, A100, and V100 systems. EcoSched achieves up to 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction over baseline schedulers, with modest performance overhead. These results show that jointly selecting GPU counts and coscheduling actions is essential for efficient multi GPU workload execution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents EcoSched, an online scheduler for multi-GPU HPC systems that uses lightweight runtime profiling to estimate relative performance across GPU counts, applies a score-based policy to balance energy efficiency against idle resources, and incorporates NUMA-aware placement for coscheduling. It claims up to 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction over baseline schedulers on H100/A100/V100 platforms with diverse workloads and modest overhead.
Significance. If the empirical claims hold under rigorous validation, the work could offer practical value for energy-efficient GPU allocation in modern HPC by addressing heterogeneous nonlinear scaling. The purely empirical approach with direct measurements provides concrete numbers but requires stronger support for baselines and profiling reliability to elevate its impact.
major comments (2)
- [Evaluation (abstract and §4)] The central empirical claims of 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction lack supporting details on baseline schedulers, statistical significance testing, workload selection criteria, and controls for post-hoc tuning. This directly weakens the soundness of the reported gains over baselines.
- [Section 3] Section 3 describes the lightweight runtime profiling (short runs measuring throughput/energy for GPU counts 1/2/4/8) feeding the score-based policy, but provides no validation, error bounds, or sensitivity analysis for estimation accuracy under nonlinear effects such as memory saturation or NUMA interference during coscheduling. This assumption is load-bearing for the coscheduling decisions and the claimed improvements.
minor comments (1)
- [Abstract and §4] The abstract and evaluation could more explicitly define the EDP metric and the exact composition of the 'diverse workloads' to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where the evaluation and validation sections can be strengthened without altering the core contributions or results.
read point-by-point responses
-
Referee: [Evaluation (abstract and §4)] The central empirical claims of 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction lack supporting details on baseline schedulers, statistical significance testing, workload selection criteria, and controls for post-hoc tuning. This directly weakens the soundness of the reported gains over baselines.
Authors: We agree that greater transparency in these areas would strengthen the presentation. The manuscript describes the baseline schedulers and workload selection criteria in Section 4, but we acknowledge that statistical significance testing and explicit controls for post-hoc tuning were not detailed sufficiently. In the revised version, we expand Section 4 to include paired statistical tests with p-values for the reported improvements, explicit workload selection rationale based on scaling diversity, and a clarification that all policy parameters were fixed from separate profiling runs with no post-hoc adjustment on the evaluation data. These additions directly support the soundness of the claims. revision: yes
-
Referee: [Section 3] Section 3 describes the lightweight runtime profiling (short runs measuring throughput/energy for GPU counts 1/2/4/8) feeding the score-based policy, but provides no validation, error bounds, or sensitivity analysis for estimation accuracy under nonlinear effects such as memory saturation or NUMA interference during coscheduling. This assumption is load-bearing for the coscheduling decisions and the claimed improvements.
Authors: We concur that validation of the profiling step is essential given its role in the decisions. The original Section 3 focuses on the method description, but we have revised it to incorporate a dedicated validation subsection. This includes direct comparisons of short-run estimates against full-execution measurements, reported error bounds across the workload set, and sensitivity analysis addressing nonlinear scaling, memory saturation, and NUMA effects. The added analysis confirms that estimation errors remain bounded and do not undermine the coscheduling policy or overall gains. revision: yes
Circularity Check
No circularity: purely empirical design and measurement
full rationale
The paper describes an online scheduler (EcoSched) whose core mechanisms—lightweight runtime profiling of throughput/energy, a score-based policy for GPU count selection, and NUMA-aware placement—are presented as practical heuristics implemented and measured on real hardware (H100/A100/V100). No equations, derivations, fitted parameters, or predictions appear; all reported gains (energy, makespan, EDP) are direct experimental outcomes rather than quantities defined in terms of themselves. Self-citations, if present, are not load-bearing for any claimed result. The work is therefore self-contained against external benchmarks with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Aurora.”
A. N. Laboratory, “Aurora.” ”https://www.alcf.anl.gov/aurora”, 2025
2025
-
[2]
Exascale computing study: Technology challenges in achieving exascale systems,
K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Den- neau, P. Franzon, W. Harrod, K. Hill, J. Hiller,et al., “Exascale computing study: Technology challenges in achieving exascale systems,” Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, vol. 15, p. 181, 2008
2008
-
[3]
Optimus: an efficient dynamic resource scheduler for deep learning clusters,
Y . Peng, Y . Bao, Y . Chen, C. Wu, and C. Guo, “Optimus: an efficient dynamic resource scheduler for deep learning clusters,” inProceedings of the Thirteenth EuroSys Conference, pp. 1–14, 2018
2018
-
[4]
Cynthia: Cost-efficient cloud resource provisioning for predictable distributed deep neural network training,
H. Zheng, F. Xu, L. Chen, Z. Zhou, and F. Liu, “Cynthia: Cost-efficient cloud resource provisioning for predictable distributed deep neural network training,” inProceedings of the 48th International Conference on Parallel Processing, pp. 1–11, 2019
2019
-
[5]
Gandiva: Introspective cluster scheduling for deep learning,
W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang,et al., “Gandiva: Introspective cluster scheduling for deep learning,” in13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 595–610, 2018
2018
-
[6]
Tiresias: A{GPU}cluster manager for distributed deep learning,
J. Gu, M. Chowdhury, K. G. Shin, Y . Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo, “Tiresias: A{GPU}cluster manager for distributed deep learning,” in16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pp. 485–500, 2019
2019
-
[7]
J. H. Park, S. Kim, J. Lee, M. Jeon, and S. H. Noh, “Accelerated training for cnn distributed deep learning through automatic resource-aware layer placement,”arXiv preprint arXiv:1901.05803, 2019
-
[8]
Scaling a convolutional neural network for classification of adjective noun pairs with tensorflow on gpu clusters,
V . Campos, F. Sastre, M. Yag¨ues, J. Torres, and X. Gir´o-i Nieto, “Scaling a convolutional neural network for classification of adjective noun pairs with tensorflow on gpu clusters,” in2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 677– 682, IEEE, 2017
2017
-
[9]
Marble: A multi-gpu aware job scheduler for deep learning on hpc systems,
J. Han, M. M. Rafique, L. Xu, A. R. Butt, S.-H. Lim, and S. S. Vazhkudai, “Marble: A multi-gpu aware job scheduler for deep learning on hpc systems,” in2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 272–281, IEEE, 2020
2020
-
[10]
Energy delay product,
J. H. Laros III, K. Pedretti, S. M. Kelly, W. Shu, K. Ferreira, J. Van Dyke, C. Vaughan, J. H. Laros III, K. Pedretti, S. M. Kelly,et al., “Energy delay product,”Energy-Efficient High Performance Computing: Measurement and Tuning, pp. 51–55, 2013
2013
-
[11]
Integrating dynamic pricing of electricity into energy aware scheduling for hpc systems,
X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. E. Papka, “Integrating dynamic pricing of electricity into energy aware scheduling for hpc systems,” inProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11, 2013
2013
-
[12]
Indicator-directed dynamic power management for iterative workloads on gpu-accelerated systems,
P. Zou, A. Li, K. Barker, and R. Ge, “Indicator-directed dynamic power management for iterative workloads on gpu-accelerated systems,” in 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 559–568, IEEE, 2020
2020
-
[13]
Improving gpu energy efficiency through an application-transparent frequency scaling policy with performance assurance,
Y . Zhang, Q. Wang, Z. Lin, P. Xu, and B. Wang, “Improving gpu energy efficiency through an application-transparent frequency scaling policy with performance assurance,” inProceedings of the Nineteenth European Conference on Computer Systems, pp. 769–785, 2024
2024
-
[14]
Frontier
O. R. N. Laboratory, “Frontier.” ”https://www.olcf.ornl.gov/frontier/”, 2026
2026
-
[15]
Amd nps
AMD, “Amd nps.” ”https://rocm.docs.amd.com/projects/rocm smi lib/ en/docs-5.6.0/.doxygen/docBin/html/group NPSMode.html”, 2026
2026
-
[16]
The cp-sat-lp solver (invited talk),
L. Perron, F. Didier, and S. Gay, “The cp-sat-lp solver (invited talk),” in 29th International Conference on Principles and Practice of Constraint Programming (CP 2023), pp. 3–1, Schloss Dagstuhl–Leibniz-Zentrum f¨ur Informatik, 2023
2023
-
[17]
NVIDIA CUDA Samples
NVIDIA, “NVIDIA CUDA Samples.” ”https://github.com/nvidia/ cuda-samples”, 2026
2026
-
[18]
The spec benchmarks,
K. M. Dixit, “The spec benchmarks,”Parallel computing, vol. 17, no. 10- 11, pp. 1195–1209, 1991
1991
-
[19]
NVML.” ”https://developer.nvidia.com/ management-library-nvml
Nvidia, “NVML.” ”https://developer.nvidia.com/ management-library-nvml”, 2025
2025
-
[20]
Nvidia data center gpu manager
NVIDIA, “Nvidia data center gpu manager.” ”https://github.com/ NVIDIA/DCGM”, 2025
2025
-
[21]
Understanding the impact of dynamic power capping on appli- cation progress,
S. Ramesh, S. Perarnau, S. Bhalachandra, A. D. Malony, and P. Beck- man, “Understanding the impact of dynamic power capping on appli- cation progress,” in2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 793–804, IEEE, 2019
2019
-
[22]
Hardware-validated cpu performance and energy modelling,
M. Walker, S. Bischoff, S. Diestelhorst, G. Merrett, and B. Al-Hashimi, “Hardware-validated cpu performance and energy modelling,” in2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 44–53, IEEE, 2018
2018
-
[23]
Application power profiling on ibm blue gene/q,
S. Wallace, Z. Zhou, V . Vishwanath, S. Coghlan, J. Tramm, Z. Lan, and M. E. Papka, “Application power profiling on ibm blue gene/q,”Parallel Computing, vol. 57, pp. 73–86, 2016
2016
-
[24]
Finding the limits of power-constrained application performance,
P. E. Bailey, A. Marathe, D. K. Lowenthal, B. Rountree, and M. Schulz, “Finding the limits of power-constrained application performance,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12, 2015
2015
-
[25]
Using multiple energy gears in mpi programs on a power-scalable cluster,
V . W. Freeh and D. K. Lowenthal, “Using multiple energy gears in mpi programs on a power-scalable cluster,” inProceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 164–173, 2005
2005
-
[26]
Cpu miser: A performance-directed, run-time system for power-aware clusters,
R. Ge, X. Feng, W.-c. Feng, and K. W. Cameron, “Cpu miser: A performance-directed, run-time system for power-aware clusters,” in 2007 International Conference on Parallel Processing (ICPP 2007), pp. 18–18, IEEE, 2007
2007
-
[27]
A power-aware run-time system for high- performance computing,
C.-h. Hsu and W.-c. Feng, “A power-aware run-time system for high- performance computing,” inSC’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, pp. 1–1, IEEE, 2005
2005
-
[28]
Adaptive, transparent fre- quency and voltage scaling of communication phases in mpi programs,
M. Y . Lim, V . W. Freeh, and D. K. Lowenthal, “Adaptive, transparent fre- quency and voltage scaling of communication phases in mpi programs,” inProceedings of the 2006 ACM/IEEE conference on Supercomputing, pp. 107–es, 2006
2006
-
[29]
Minimizing power waste in heterogenous computing via adaptive uncore scaling,
Z. Zheng, S. Sultanov, M. E. Papka, and Z. Lan, “Minimizing power waste in heterogenous computing via adaptive uncore scaling,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 505–518, 2025
2025
-
[30]
Coordinated Power Management on Heterogeneous Systems
Z. Zheng, Z. Lan, X. Wu, V . E. Taylor, and M. E. Papka, “Coordi- nated power management on heterogeneous systems,”arXiv preprint arXiv:2508.07605, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Intelligent resource scheduling for co- located latency-critical services: A{Multi-Model}collaborative learning approach,
L. Liu, X. Dou, and Y . Chen, “Intelligent resource scheduling for co- located latency-critical services: A{Multi-Model}collaborative learning approach,” in21st USENIX Conference on File and Storage Technologies (FAST 23), pp. 153–166, 2023
2023
-
[32]
Dy- namic co-scheduling driven by main memory bandwidth utilization,
J. Breitbart, S. Pickartz, S. Lankes, J. Weidendorfer, and A. Monti, “Dy- namic co-scheduling driven by main memory bandwidth utilization,” in 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 400–409, IEEE, 2017
2017
-
[33]
Effects and benefits of node sharing strategies in hpc batch systems,
A. Frank, T. S ¨uss, and A. Brinkmann, “Effects and benefits of node sharing strategies in hpc batch systems,” in2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 43–53, IEEE, 2019
2019
-
[34]
Evaluating the potential of coscheduling on high-performance computing systems,
J. Hall, A. Lathi, D. K. Lowenthal, and T. Patki, “Evaluating the potential of coscheduling on high-performance computing systems,” inWorkshop on Job Scheduling Strategies for Parallel Processing, pp. 155–172, Springer, 2023
2023
-
[35]
Intelligent colocation of hpc workloads,
F. V . Zacarias, V . Petrucci, R. Nishtala, P. Carpenter, and D. Moss ´e, “Intelligent colocation of hpc workloads,”Journal of Parallel and Distributed Computing, vol. 151, pp. 125–137, 2021
2021
-
[36]
Node sharing for increased throughput and shorter runtimes–an industrial co-scheduling case study,
A. de Blanche and T. Lundqvist, “Node sharing for increased throughput and shorter runtimes–an industrial co-scheduling case study,” inPro- ceedings of the 3rd Workshop on Co-Scheduling of HPC Applications (COSH 2018), 2018
2018
-
[37]
Pac: Preference-aware co-location scheduling on heterogeneous numa architectures to improve resource utilization,
P. Pang, Y . Li, B. Liu, Q. Chen, Z. Yu, Z. Yu, D. Zeng, J. Leng, J. Zhao, and M. Guo, “Pac: Preference-aware co-location scheduling on heterogeneous numa architectures to improve resource utilization,” inProceedings of the 37th international conference on supercomputing, pp. 75–86, 2023
2023
-
[38]
Spread-n-share: improving application performance and cluster throughput with resource-aware job placement,
X. Tang, H. Wang, X. Ma, N. El-Sayed, J. Zhai, W. Chen, and A. Aboul- naga, “Spread-n-share: improving application performance and cluster throughput with resource-aware job placement,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15, 2019
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.