pith. sign in

arxiv: 2604.17640 · v1 · submitted 2026-04-19 · 💻 cs.DC

Towards Energy Efficient Co-Scheduling in HPC

Pith reviewed 2026-05-10 04:49 UTC · model grok-4.3

classification 💻 cs.DC
keywords energy efficient schedulingmulti-GPU HPCcoschedulingGPU count selectionruntime profilingNUMA placementenergy-delay product
0
0 comments X

The pith

EcoSched jointly selects per-application GPU counts and coschedules jobs on multi-GPU HPC systems to cut energy waste from nonlinear scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern multi-GPU HPC systems waste energy and leave resources idle because many applications scale heterogeneously and nonlinearly with added GPUs. EcoSched counters this by running short runtime profiling to estimate how performance changes with different GPU counts, then feeding those estimates into a score-based policy that trades off energy use against idle resources while adding NUMA-aware placement to limit interference. The result is an online scheduler that decides both how many GPUs each job receives and which jobs run together. If the approach holds, it demonstrates that dynamic GPU allocation plus smart coscheduling can produce measurable gains in energy, completion time, and energy-delay product on existing hardware. Readers should care because the method requires only modest overhead and works across V100, A100, and H100 platforms without per-workload tuning.

Core claim

EcoSched is an online scheduler that jointly optimizes GPU count selection and application coscheduling to improve workload level efficiency on multi GPU systems. It uses lightweight runtime profiling to estimate relative performance across GPU counts, applies a score based policy to balance energy efficiency and idle resources, and incorporates NUMA aware placement to mitigate interference. Implementation and evaluation on heterogeneous CPU-GPU platforms with diverse workloads on H100, A100, and V100 systems show up to 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction over baseline schedulers with modest performance overhead.

What carries the argument

The score-based policy that consumes lightweight runtime profiling estimates of relative performance across GPU counts to decide allocations while balancing energy efficiency against idle resources.

If this is right

  • Applications that do not scale linearly can be assigned fewer GPUs than the maximum available without sacrificing overall system throughput.
  • NUMA-aware placement reduces interference enough to support coscheduling of multiple jobs on the same nodes.
  • Energy, makespan, and EDP improvements hold across GPU generations from V100 to H100 with only modest profiling overhead.
  • The method shows that joint optimization of count selection and coscheduling is required; neither alone suffices for the reported gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lightweight profiling approach could be extended to other accelerators whose scaling behavior is also nonlinear.
  • Integrating EcoSched with existing batch schedulers would let users request energy-aware rather than purely performance-oriented allocations.
  • Cumulative savings could grow larger on long-running production workloads where the per-job energy reductions compound over many hours.

Load-bearing premise

Lightweight runtime profiling can reliably estimate relative performance across GPU counts for heterogeneous applications, and the score-based policy generalizes without needing per-workload tuning or introducing unacceptable overhead.

What would settle it

A workload in which the short profiling estimates deviate substantially from full-run scaling behavior, causing the scheduler to pick GPU counts that increase rather than decrease energy or makespan compared with a static all-GPUs baseline.

Figures

Figures reproduced from arXiv: 2604.17640 by Michael E. Papka, Zhiling Lan, Zhong Zheng.

Figure 1
Figure 1. Figure 1: Application performance with different GPU counts. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of scheduling schemes. Experiments are [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: EcoSched consists of two phases: (i) an online per [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between GPU DRAM utilization and performance (runtime) across H100, A100, and V100 platforms. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scheduling strategy comparison in energy saving, [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Scheduling six applications on system 1. Marble [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-application energy breakdown for the case-study [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-application performance loss (measured by run [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

Modern multi GPU HPC systems expose substantial computational capacity, yet inefficient GPU allocation often leads to wasted energy and underutilization. In practice, GPU applications exhibit heterogeneous and nonlinear scaling, making it inefficient to always use all available GPUs. We present EcoSched, an online scheduler that jointly optimizes GPU count selection and application coscheduling to improve workload level efficiency on multi GPU systems. EcoSched uses lightweight runtime profiling to estimate relative performance across GPU counts, applies a score based policy to balance energy efficiency and idle resources, and incorporates NUMA aware placement to mitigate interference. We implement EcoSched on heterogeneous CPU GPU platforms and evaluate it with diverse workloads on H100, A100, and V100 systems. EcoSched achieves up to 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction over baseline schedulers, with modest performance overhead. These results show that jointly selecting GPU counts and coscheduling actions is essential for efficient multi GPU workload execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents EcoSched, an online scheduler for multi-GPU HPC systems that uses lightweight runtime profiling to estimate relative performance across GPU counts, applies a score-based policy to balance energy efficiency against idle resources, and incorporates NUMA-aware placement for coscheduling. It claims up to 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction over baseline schedulers on H100/A100/V100 platforms with diverse workloads and modest overhead.

Significance. If the empirical claims hold under rigorous validation, the work could offer practical value for energy-efficient GPU allocation in modern HPC by addressing heterogeneous nonlinear scaling. The purely empirical approach with direct measurements provides concrete numbers but requires stronger support for baselines and profiling reliability to elevate its impact.

major comments (2)
  1. [Evaluation (abstract and §4)] The central empirical claims of 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction lack supporting details on baseline schedulers, statistical significance testing, workload selection criteria, and controls for post-hoc tuning. This directly weakens the soundness of the reported gains over baselines.
  2. [Section 3] Section 3 describes the lightweight runtime profiling (short runs measuring throughput/energy for GPU counts 1/2/4/8) feeding the score-based policy, but provides no validation, error bounds, or sensitivity analysis for estimation accuracy under nonlinear effects such as memory saturation or NUMA interference during coscheduling. This assumption is load-bearing for the coscheduling decisions and the claimed improvements.
minor comments (1)
  1. [Abstract and §4] The abstract and evaluation could more explicitly define the EDP metric and the exact composition of the 'diverse workloads' to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where the evaluation and validation sections can be strengthened without altering the core contributions or results.

read point-by-point responses
  1. Referee: [Evaluation (abstract and §4)] The central empirical claims of 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction lack supporting details on baseline schedulers, statistical significance testing, workload selection criteria, and controls for post-hoc tuning. This directly weakens the soundness of the reported gains over baselines.

    Authors: We agree that greater transparency in these areas would strengthen the presentation. The manuscript describes the baseline schedulers and workload selection criteria in Section 4, but we acknowledge that statistical significance testing and explicit controls for post-hoc tuning were not detailed sufficiently. In the revised version, we expand Section 4 to include paired statistical tests with p-values for the reported improvements, explicit workload selection rationale based on scaling diversity, and a clarification that all policy parameters were fixed from separate profiling runs with no post-hoc adjustment on the evaluation data. These additions directly support the soundness of the claims. revision: yes

  2. Referee: [Section 3] Section 3 describes the lightweight runtime profiling (short runs measuring throughput/energy for GPU counts 1/2/4/8) feeding the score-based policy, but provides no validation, error bounds, or sensitivity analysis for estimation accuracy under nonlinear effects such as memory saturation or NUMA interference during coscheduling. This assumption is load-bearing for the coscheduling decisions and the claimed improvements.

    Authors: We concur that validation of the profiling step is essential given its role in the decisions. The original Section 3 focuses on the method description, but we have revised it to incorporate a dedicated validation subsection. This includes direct comparisons of short-run estimates against full-execution measurements, reported error bounds across the workload set, and sensitivity analysis addressing nonlinear scaling, memory saturation, and NUMA effects. The added analysis confirms that estimation errors remain bounded and do not undermine the coscheduling policy or overall gains. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical design and measurement

full rationale

The paper describes an online scheduler (EcoSched) whose core mechanisms—lightweight runtime profiling of throughput/energy, a score-based policy for GPU count selection, and NUMA-aware placement—are presented as practical heuristics implemented and measured on real hardware (H100/A100/V100). No equations, derivations, fitted parameters, or predictions appear; all reported gains (energy, makespan, EDP) are direct experimental outcomes rather than quantities defined in terms of themselves. Self-citations, if present, are not load-bearing for any claimed result. The work is therefore self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems paper; the abstract introduces no mathematical free parameters, axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5472 in / 1190 out tokens · 45312 ms · 2026-05-10T04:49:11.853117+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Aurora.”

    A. N. Laboratory, “Aurora.” ”https://www.alcf.anl.gov/aurora”, 2025

  2. [2]

    Exascale computing study: Technology challenges in achieving exascale systems,

    K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Den- neau, P. Franzon, W. Harrod, K. Hill, J. Hiller,et al., “Exascale computing study: Technology challenges in achieving exascale systems,” Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, vol. 15, p. 181, 2008

  3. [3]

    Optimus: an efficient dynamic resource scheduler for deep learning clusters,

    Y . Peng, Y . Bao, Y . Chen, C. Wu, and C. Guo, “Optimus: an efficient dynamic resource scheduler for deep learning clusters,” inProceedings of the Thirteenth EuroSys Conference, pp. 1–14, 2018

  4. [4]

    Cynthia: Cost-efficient cloud resource provisioning for predictable distributed deep neural network training,

    H. Zheng, F. Xu, L. Chen, Z. Zhou, and F. Liu, “Cynthia: Cost-efficient cloud resource provisioning for predictable distributed deep neural network training,” inProceedings of the 48th International Conference on Parallel Processing, pp. 1–11, 2019

  5. [5]

    Gandiva: Introspective cluster scheduling for deep learning,

    W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang,et al., “Gandiva: Introspective cluster scheduling for deep learning,” in13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 595–610, 2018

  6. [6]

    Tiresias: A{GPU}cluster manager for distributed deep learning,

    J. Gu, M. Chowdhury, K. G. Shin, Y . Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo, “Tiresias: A{GPU}cluster manager for distributed deep learning,” in16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pp. 485–500, 2019

  7. [7]

    Accelerated training for cnn distributed deep learning through automatic resource-aware layer placement,

    J. H. Park, S. Kim, J. Lee, M. Jeon, and S. H. Noh, “Accelerated training for cnn distributed deep learning through automatic resource-aware layer placement,”arXiv preprint arXiv:1901.05803, 2019

  8. [8]

    Scaling a convolutional neural network for classification of adjective noun pairs with tensorflow on gpu clusters,

    V . Campos, F. Sastre, M. Yag¨ues, J. Torres, and X. Gir´o-i Nieto, “Scaling a convolutional neural network for classification of adjective noun pairs with tensorflow on gpu clusters,” in2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 677– 682, IEEE, 2017

  9. [9]

    Marble: A multi-gpu aware job scheduler for deep learning on hpc systems,

    J. Han, M. M. Rafique, L. Xu, A. R. Butt, S.-H. Lim, and S. S. Vazhkudai, “Marble: A multi-gpu aware job scheduler for deep learning on hpc systems,” in2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 272–281, IEEE, 2020

  10. [10]

    Energy delay product,

    J. H. Laros III, K. Pedretti, S. M. Kelly, W. Shu, K. Ferreira, J. Van Dyke, C. Vaughan, J. H. Laros III, K. Pedretti, S. M. Kelly,et al., “Energy delay product,”Energy-Efficient High Performance Computing: Measurement and Tuning, pp. 51–55, 2013

  11. [11]

    Integrating dynamic pricing of electricity into energy aware scheduling for hpc systems,

    X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. E. Papka, “Integrating dynamic pricing of electricity into energy aware scheduling for hpc systems,” inProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11, 2013

  12. [12]

    Indicator-directed dynamic power management for iterative workloads on gpu-accelerated systems,

    P. Zou, A. Li, K. Barker, and R. Ge, “Indicator-directed dynamic power management for iterative workloads on gpu-accelerated systems,” in 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 559–568, IEEE, 2020

  13. [13]

    Improving gpu energy efficiency through an application-transparent frequency scaling policy with performance assurance,

    Y . Zhang, Q. Wang, Z. Lin, P. Xu, and B. Wang, “Improving gpu energy efficiency through an application-transparent frequency scaling policy with performance assurance,” inProceedings of the Nineteenth European Conference on Computer Systems, pp. 769–785, 2024

  14. [14]

    Frontier

    O. R. N. Laboratory, “Frontier.” ”https://www.olcf.ornl.gov/frontier/”, 2026

  15. [15]

    Amd nps

    AMD, “Amd nps.” ”https://rocm.docs.amd.com/projects/rocm smi lib/ en/docs-5.6.0/.doxygen/docBin/html/group NPSMode.html”, 2026

  16. [16]

    The cp-sat-lp solver (invited talk),

    L. Perron, F. Didier, and S. Gay, “The cp-sat-lp solver (invited talk),” in 29th International Conference on Principles and Practice of Constraint Programming (CP 2023), pp. 3–1, Schloss Dagstuhl–Leibniz-Zentrum f¨ur Informatik, 2023

  17. [17]

    NVIDIA CUDA Samples

    NVIDIA, “NVIDIA CUDA Samples.” ”https://github.com/nvidia/ cuda-samples”, 2026

  18. [18]

    The spec benchmarks,

    K. M. Dixit, “The spec benchmarks,”Parallel computing, vol. 17, no. 10- 11, pp. 1195–1209, 1991

  19. [19]

    NVML.” ”https://developer.nvidia.com/ management-library-nvml

    Nvidia, “NVML.” ”https://developer.nvidia.com/ management-library-nvml”, 2025

  20. [20]

    Nvidia data center gpu manager

    NVIDIA, “Nvidia data center gpu manager.” ”https://github.com/ NVIDIA/DCGM”, 2025

  21. [21]

    Understanding the impact of dynamic power capping on appli- cation progress,

    S. Ramesh, S. Perarnau, S. Bhalachandra, A. D. Malony, and P. Beck- man, “Understanding the impact of dynamic power capping on appli- cation progress,” in2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 793–804, IEEE, 2019

  22. [22]

    Hardware-validated cpu performance and energy modelling,

    M. Walker, S. Bischoff, S. Diestelhorst, G. Merrett, and B. Al-Hashimi, “Hardware-validated cpu performance and energy modelling,” in2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 44–53, IEEE, 2018

  23. [23]

    Application power profiling on ibm blue gene/q,

    S. Wallace, Z. Zhou, V . Vishwanath, S. Coghlan, J. Tramm, Z. Lan, and M. E. Papka, “Application power profiling on ibm blue gene/q,”Parallel Computing, vol. 57, pp. 73–86, 2016

  24. [24]

    Finding the limits of power-constrained application performance,

    P. E. Bailey, A. Marathe, D. K. Lowenthal, B. Rountree, and M. Schulz, “Finding the limits of power-constrained application performance,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12, 2015

  25. [25]

    Using multiple energy gears in mpi programs on a power-scalable cluster,

    V . W. Freeh and D. K. Lowenthal, “Using multiple energy gears in mpi programs on a power-scalable cluster,” inProceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 164–173, 2005

  26. [26]

    Cpu miser: A performance-directed, run-time system for power-aware clusters,

    R. Ge, X. Feng, W.-c. Feng, and K. W. Cameron, “Cpu miser: A performance-directed, run-time system for power-aware clusters,” in 2007 International Conference on Parallel Processing (ICPP 2007), pp. 18–18, IEEE, 2007

  27. [27]

    A power-aware run-time system for high- performance computing,

    C.-h. Hsu and W.-c. Feng, “A power-aware run-time system for high- performance computing,” inSC’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, pp. 1–1, IEEE, 2005

  28. [28]

    Adaptive, transparent fre- quency and voltage scaling of communication phases in mpi programs,

    M. Y . Lim, V . W. Freeh, and D. K. Lowenthal, “Adaptive, transparent fre- quency and voltage scaling of communication phases in mpi programs,” inProceedings of the 2006 ACM/IEEE conference on Supercomputing, pp. 107–es, 2006

  29. [29]

    Minimizing power waste in heterogenous computing via adaptive uncore scaling,

    Z. Zheng, S. Sultanov, M. E. Papka, and Z. Lan, “Minimizing power waste in heterogenous computing via adaptive uncore scaling,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 505–518, 2025

  30. [30]

    Coordinated Power Management on Heterogeneous Systems

    Z. Zheng, Z. Lan, X. Wu, V . E. Taylor, and M. E. Papka, “Coordi- nated power management on heterogeneous systems,”arXiv preprint arXiv:2508.07605, 2025

  31. [31]

    Intelligent resource scheduling for co- located latency-critical services: A{Multi-Model}collaborative learning approach,

    L. Liu, X. Dou, and Y . Chen, “Intelligent resource scheduling for co- located latency-critical services: A{Multi-Model}collaborative learning approach,” in21st USENIX Conference on File and Storage Technologies (FAST 23), pp. 153–166, 2023

  32. [32]

    Dy- namic co-scheduling driven by main memory bandwidth utilization,

    J. Breitbart, S. Pickartz, S. Lankes, J. Weidendorfer, and A. Monti, “Dy- namic co-scheduling driven by main memory bandwidth utilization,” in 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 400–409, IEEE, 2017

  33. [33]

    Effects and benefits of node sharing strategies in hpc batch systems,

    A. Frank, T. S ¨uss, and A. Brinkmann, “Effects and benefits of node sharing strategies in hpc batch systems,” in2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 43–53, IEEE, 2019

  34. [34]

    Evaluating the potential of coscheduling on high-performance computing systems,

    J. Hall, A. Lathi, D. K. Lowenthal, and T. Patki, “Evaluating the potential of coscheduling on high-performance computing systems,” inWorkshop on Job Scheduling Strategies for Parallel Processing, pp. 155–172, Springer, 2023

  35. [35]

    Intelligent colocation of hpc workloads,

    F. V . Zacarias, V . Petrucci, R. Nishtala, P. Carpenter, and D. Moss ´e, “Intelligent colocation of hpc workloads,”Journal of Parallel and Distributed Computing, vol. 151, pp. 125–137, 2021

  36. [36]

    Node sharing for increased throughput and shorter runtimes–an industrial co-scheduling case study,

    A. de Blanche and T. Lundqvist, “Node sharing for increased throughput and shorter runtimes–an industrial co-scheduling case study,” inPro- ceedings of the 3rd Workshop on Co-Scheduling of HPC Applications (COSH 2018), 2018

  37. [37]

    Pac: Preference-aware co-location scheduling on heterogeneous numa architectures to improve resource utilization,

    P. Pang, Y . Li, B. Liu, Q. Chen, Z. Yu, Z. Yu, D. Zeng, J. Leng, J. Zhao, and M. Guo, “Pac: Preference-aware co-location scheduling on heterogeneous numa architectures to improve resource utilization,” inProceedings of the 37th international conference on supercomputing, pp. 75–86, 2023

  38. [38]

    Spread-n-share: improving application performance and cluster throughput with resource-aware job placement,

    X. Tang, H. Wang, X. Ma, N. El-Sayed, J. Zhai, W. Chen, and A. Aboul- naga, “Spread-n-share: improving application performance and cluster throughput with resource-aware job placement,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15, 2019