Bi-objective Optimisation of Data-parallel Applications on Heterogeneous Platforms for Performance and Energy via Workload Distribution
Pith reviewed 2026-05-25 00:06 UTC · model grok-4.3
The pith
An exact global optimization algorithm finds all Pareto-optimal workload distributions for performance and energy on heterogeneous platforms from discrete system-level profiles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an efficient and exact global optimisation algorithm, given only discrete performance and dynamic energy profiles of heterogeneous processors, solves the bi-objective optimisation problem for performance and energy (and separately for performance and total energy) by returning every Pareto-optimal workload distribution. The algorithm is fed by a novel system-level methodology that produces accurate discrete dynamic energy profiles of individual devices in a hybrid data-parallel application.
What carries the argument
The exact global optimisation algorithm that enumerates all Pareto-optimal workload distributions directly from arbitrary discrete performance and dynamic energy profiles of the heterogeneous processors.
If this is right
- Even linear performance and energy profiles on heterogeneous processors yield a large number of Pareto-optimal workload distributions.
- The algorithm applies unchanged to both dynamic energy and total energy objectives.
- The system-level profile construction method works for real applications such as matrix multiplication and 2D-FFT on platforms mixing CPUs and accelerators.
Where Pith is reading between the lines
- The algorithm could be embedded inside runtime schedulers to select workload splits at job submission time.
- Non-smooth profile shapes imply that local search methods would miss many high-quality distributions that the global algorithm finds.
- The same profile-driven approach might extend to three-objective problems that also include monetary cost.
Load-bearing premise
System-level measurements alone can produce accurate discrete dynamic energy profiles for each computing device inside a hybrid data-parallel application without any component-level modeling or specialized instrumentation.
What would settle it
Execute matrix multiplication or 2D-FFT with the workload distributions returned by the algorithm on the target heterogeneous platform and check whether the measured performance and dynamic energy values match the input profile predictions within the reported measurement uncertainty.
Figures
read the original abstract
Performance and energy are the two most important objectives for optimisation on modern parallel platforms. Latest research demonstrated the importance of workload distribution as a decision variable in the bi-objective optimisation for performance and energy on homogeneous multicore clusters. We show in this work that bi-objective optimisation for performance and energy on heterogeneous processors results in a large number of Pareto-optimal optimal solutions (workload distributions) even in the simple case of linear performance and energy profiles. We then study performance and energy profiles of real-life data-parallel applications and find that their shapes are non-linear, complex and non-smooth. We, therefore, propose an efficient and exact global optimisation algorithm, which takes as an input most general discrete performance and dynamic energy profiles of the heterogeneous processors and solves the bi-objective optimisation problem. The algorithm is also used as a building block to solve the bi-objective optimisation problem for performance and total energy. We also propose a novel methodology to build discrete dynamic energy profiles of individual computing devices, which are input to the algorithm. The methodology is based purely on system-level measurements and addresses the fundamental challenge of accurate component-level energy modelling of a hybrid data-parallel application running on a heterogeneous platform integrating CPUs and accelerators. We experimentally validate the proposed method using two data-parallel applications, matrix multiplication and 2D fast Fourier transform (2D-FFT).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that bi-objective optimization of performance and energy on heterogeneous platforms yields many Pareto-optimal workload distributions even for linear profiles, that real applications exhibit non-linear/non-smooth profiles, and that an efficient exact global optimization algorithm exists which takes arbitrary discrete performance and dynamic-energy profiles as input and solves the problem (also extended to total energy). It further claims a novel system-level measurement methodology that constructs accurate per-device discrete dynamic energy profiles for hybrid data-parallel applications without component-level models or specialized instrumentation, validated experimentally on matrix multiplication and 2D-FFT.
Significance. If the energy-profile methodology is accurate and the algorithm is indeed exact and efficient for general discrete profiles, the work would supply a practical, profile-driven approach to bi-objective optimization on heterogeneous CPU+accelerator platforms, a topic of clear importance. The experimental validation on two representative data-parallel kernels and the use of the algorithm as a building block for total-energy optimization are concrete strengths.
major comments (2)
- [Methodology for constructing discrete dynamic energy profiles (and its experimental validation)] The central claim that the proposed system-level methodology produces accurate discrete dynamic energy profiles for individual devices rests on the assumption that total-system power and execution-time data can be decomposed without modeling cross-device interactions or shared resources. The experimental validation section reports results only for the two applications and does not quantify isolation error, repeatability, or sensitivity to unmodeled overheads; this directly affects the correctness of the input profiles supplied to the optimization algorithm.
- [Description of the global optimisation algorithm] The abstract asserts that the algorithm is 'exact' for the most general discrete non-linear, non-smooth profiles, yet the provided description supplies neither a formal proof of exactness nor a complexity analysis. Because the algorithm is the core technical contribution and is used as a building block for the total-energy case, the absence of these supporting arguments is load-bearing.
minor comments (2)
- [Abstract] The abstract states that 'a large number of Pareto-optimal solutions' exist but supplies no quantitative counts or figures; adding these would clarify the scale of the search space the algorithm must handle.
- [Algorithm section] Notation for the discrete profiles (performance and energy) should be introduced with explicit symbols and domains before the algorithm is presented.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The two major comments identify areas where additional rigor would strengthen the manuscript. We address each below and commit to revisions that directly respond to the concerns without altering the core claims.
read point-by-point responses
-
Referee: [Methodology for constructing discrete dynamic energy profiles (and its experimental validation)] The central claim that the proposed system-level methodology produces accurate discrete dynamic energy profiles for individual devices rests on the assumption that total-system power and execution-time data can be decomposed without modeling cross-device interactions or shared resources. The experimental validation section reports results only for the two applications and does not quantify isolation error, repeatability, or sensitivity to unmodeled overheads; this directly affects the correctness of the input profiles supplied to the optimization algorithm.
Authors: We agree that quantitative assessment of isolation error, repeatability, and sensitivity to shared-resource overheads would increase confidence in the profiles. The methodology isolates devices by running controlled single-device experiments for each workload point while measuring total system power, but the current validation emphasizes end-to-end consistency on the two kernels rather than these metrics. In the revision we will add a dedicated subsection reporting (i) standard deviation of energy measurements over 10 repeated runs per point, (ii) an estimate of isolation error obtained by comparing summed single-device profiles against a simultaneous multi-device run at selected points, and (iii) a brief sensitivity discussion of memory-bandwidth contention. These additions will be placed immediately after the description of the profiling procedure. revision: yes
-
Referee: [Description of the global optimisation algorithm] The abstract asserts that the algorithm is 'exact' for the most general discrete non-linear, non-smooth profiles, yet the provided description supplies neither a formal proof of exactness nor a complexity analysis. Because the algorithm is the core technical contribution and is used as a building block for the total-energy case, the absence of these supporting arguments is load-bearing.
Authors: The algorithm is a dynamic-programming procedure that fills a table of Pareto-optimal partial solutions by iterating over devices and workload points; because every feasible combination is considered exactly once and dominance is applied only after exhaustive enumeration, the result is exact for arbitrary discrete profiles. We acknowledge that an explicit inductive proof and complexity statement are missing from the manuscript. In the revision we will insert a new subsection containing (a) a proof by induction on the number of devices that every non-dominated solution is retained, and (b) a complexity analysis showing O(D · W · P) time where D is the number of devices, W the maximum number of workload points per device, and P the size of the Pareto front maintained at each step. The same analysis will be referenced when the algorithm is reused for the total-energy formulation. revision: yes
Circularity Check
No circularity: algorithm takes external profiles as input; methodology produces profiles from measurements
full rationale
The paper's central claims are an exact global optimization algorithm that accepts discrete performance and dynamic energy profiles as external inputs, plus a separate system-level measurement methodology to construct those profiles for hybrid applications. No equations, derivations, or steps in the provided text reduce the algorithm output or Pareto solutions to fitted parameters defined by the same data, nor do they rely on self-citation chains or imported uniqueness theorems. The profiles are obtained independently via system measurements and validated experimentally on matrix multiplication and 2D-FFT; the optimization step is therefore not equivalent to its inputs by construction. This is the normal case of a self-contained proposal against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A multi-objective approach for workflow scheduling in hetero- geneous environments,
H. M. Fard, R. Prodan, J. J. D. Barrionuevo, and T. Fahringer, “A multi-objective approach for workflow scheduling in hetero- geneous environments,” in Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Cc- grid 2012) , ser. CCGRID ’12. IEEE Computer Society, 2012, pp. 300–309
work page 2012
-
[2]
Y. Kessaci, N. Melab, and E.-G. Talbi, “A pareto-based meta- heuristic for scheduling HPC applications on a geographically distributed cloud federation,” Cluster Computing , vol. 16, no. 3, pp. 451–468, Sep. 2013
work page 2013
-
[3]
Multi-objective energy- efficient workflow scheduling using list-based heuristics,
J. J. Durillo, V . Nae, and R. Prodan, “Multi-objective energy- efficient workflow scheduling using list-based heuristics,” Future Generation Computer Systems, vol. 36, pp. 221 – 236, 2014
work page 2014
-
[4]
E-eco: Performance-aware energy-efficient cloud data center orchestration,
F. D. Rossi, M. G. Xavier, C. A. De Rose, R. N. Calheiros, and R. Buyya, “E-eco: Performance-aware energy-efficient cloud data center orchestration,” Journal of Network and Computer Applications, vol. 78, pp. 83–96, 2017
work page 2017
-
[5]
Quantitative modeling of power performance tradeoffs on ex- treme scale systems,
“Quantitative modeling of power performance tradeoffs on ex- treme scale systems,” Journal of Parallel and Distributed Computing , vol. 84, pp. 1 – 14, 2015
work page 2015
-
[6]
Power tuning HPC jobs on power-constrained systems,
N. Gholkar, F. Mueller, and B. Rountree, “Power tuning HPC jobs on power-constrained systems,” in Proceedings of the 2016 International Conference on Parallel Architectures and Compilation . ACM, 2016, pp. 179–191
work page 2016
-
[7]
Bounding energy consumption in large- scale MPI programs,
B. Rountree, D. K. Lowenthal, S. Funk, V . W. Freeh, B. R. de Supinski, and M. Schulz, “Bounding energy consumption in large- scale MPI programs,” in SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, Nov 2007, pp. 1–9
work page 2007
-
[8]
Energy ef- ficient genetic-based schedulers in computational grids,
J. Kołodziej, S. U. Khan, L. Wang, and A. Y. Zomaya, “Energy ef- ficient genetic-based schedulers in computational grids,” Concurr. Comput. : Pract. Exper., vol. 27, no. 4, pp. 809–829, Mar. 2015
work page 2015
-
[9]
B. Subramaniam and W. Feng, “Statistical power and performance modeling for optimizing the energy efficiency of scientific com- puting,” in 2010 IEEE/ACM Int’l Conference on Green Computing and Communications and Int’l Conference on Cyber, Physical and Social Computing, Dec 2010, pp. 139–146
work page 2010
-
[10]
Perfect strong scaling using no additional energy,
J. Demmel, A. Gearhart, B. Lipshitz, and O. Schwartz, “Perfect strong scaling using no additional energy,” in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing , 2013, pp. 649–660
work page 2013
-
[11]
J. Lang and G. Rnger, “An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration,” Journal of Parallel and Distributed Computing, vol. 74, no. 9, pp. 2884 – 2897, 2014
work page 2014
-
[12]
A. Chakrabarti, S. Parthasarathy, and C. Stewart, “A pareto frame- work for data analytics on heterogeneous systems: Implications for green energy usage and performance,” in Parallel Processing (ICPP), 2017 46th International Conference on. IEEE, 2017, pp. 533– 542
work page 2017
-
[13]
A. Lastovetsky and R. Reddy, “New model-based methods and algorithms for performance and energy optimization of data parallel applications on homogeneous multicore clusters,” IEEE Transactions on Parallel and Distributed Systems , vol. 28, no. 4, pp. 1119–1133, April 2017
work page 2017
-
[14]
R. R. Manumachu and A. Lastovetsky, “Bi-objective optimization of data-parallel applications on homogeneous multicore clusters for performance and energy,” IEEE Transactions on Computers , vol. 67, no. 2, pp. 160–177, 2018
work page 2018
-
[15]
R. Reddy Manumachu and A. L. Lastovetsky, “Design of self- adaptable data parallel applications on multicore clusters auto- matically optimized for performance and energy through load distribution,” Concurrency and Computation: Practice and Experience, vol. 0, no. 0, p. e4958
-
[16]
Data partitioning on multicore and multi-GPU platforms using functional performance models,
Z. Zhong, V . Rychkov, and A. Lastovetsky, “Data partitioning on multicore and multi-GPU platforms using functional performance models,” Computers, IEEE Transactions on, vol. 64, no. 9, pp. 2506– 2518, 2015
work page 2015
-
[17]
Evaluating the effectiveness of model-based power characterization,
J. C. McCullough, Y. Agarwal, J. Chandrashekar, S. Kuppuswamy, A. C. Snoeren, and R. K. Gupta, “Evaluating the effectiveness of model-based power characterization,” in Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference , ser. USENIXATC’11. USENIX Association, 2011
work page 2011
-
[18]
A survey of power and energy predictive models in HPC systems and applications,
K. O’Brien, I. Pietri, R. Reddy, A. Lastovetsky, and R. Sakellariou, “A survey of power and energy predictive models in HPC systems and applications,” ACM Computing Surveys, vol. 50, no. 3, 2017
work page 2017
-
[19]
Additivity: A selection criterion for performance events for reliable energy predictive modeling,
A. Shahid, M. Fahad, R. Reddy, and A. Lastovetsky, “Additivity: A selection criterion for performance events for reliable energy predictive modeling,” Supercomputing Frontiers and Innovations , vol. 4, no. 4, 2017
work page 2017
-
[20]
Out- of-core implementation for accelerator kernels on heterogeneous clouds,
H. Khaleghzadeh, Z. Zhong, R. Reddy, and A. Lastovetsky, “Out- of-core implementation for accelerator kernels on heterogeneous clouds,” The Journal of Supercomputing , vol. 74, no. 2, pp. 551–568, 2018
work page 2018
-
[21]
S. F. Piraghaj, A. V . Dastjerdi, R. N. Calheiros, and R. Buyya, “A survey and taxonomy of energy efficient resource management techniques in platform as a service cloud,” in Handbook of Research on End-to-End Cloud Computing Architecture Design . IGI Global, 2017, pp. 410–454
work page 2017
-
[22]
K. Chen, J. Lenhardt, and W. Schiffmann, “Improving energy effi- ciency of web servers by using a load distribution algorithm and shutting down idle nodes,” in Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on . IEEE, 2015, pp. 745–748
work page 2015
-
[23]
A. Benoit, L. Lef `evre, A.-C. Orgerie, and I. Ra ¨ıs, “Reducing the energy consumption of large-scale computing systems through combined shutdown policies with multiple constraints,” The Inter- national Journal of High Performance Computing Applications, vol. 32, no. 1, pp. 176–188, 2018
work page 2018
-
[24]
Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model,
K.-H. Kim, K. Kim, and Q.-H. Park, “Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model,” Computer Physics Communications, vol. 182, no. 6, pp. 1201– 1207, 2011
work page 2011
-
[25]
Work- load partitioning for accelerating applications on heterogeneous platforms,
J. Shen, A. L. Varbanescu, Y. Lu, P . Zou, and H. Sips, “Work- load partitioning for accelerating applications on heterogeneous platforms,” IEEE Transactions on Parallel and Distributed Systems , vol. 27, no. 9, pp. 2766–2780, 2016
work page 2016
-
[26]
A. Kalinov and A. Lastovetsky, “Heterogeneous distribution of computations solving linear algebra problems on networks of heterogeneous computers,” Journal of Parallel and Distributed Com- puting, vol. 61, no. 4, pp. 520 – 535, 2001
work page 2001
-
[27]
Matrix multi- plication on heterogeneous platforms,
O. Beaumont, V . Boudet, F. Rastello, and Y. Robert, “Matrix multi- plication on heterogeneous platforms,” IEEE Trans. Parallel Distrib. Syst., vol. 12, no. 10, Oct. 2001
work page 2001
-
[28]
Data partitioning with a realistic performance model of networks of heterogeneous computers,
A. L. Lastovetsky and R. Reddy, “Data partitioning with a realistic performance model of networks of heterogeneous computers,” in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, 2004, p. 104
work page 2004
-
[29]
Data partitioning for multipro- cessors with memory heterogeneity and memory constraints,
A. Lastovetsky and R. Reddy, “Data partitioning for multipro- cessors with memory heterogeneity and memory constraints,” Scientific Programming, vol. 13, no. 2, pp. 93–112, 2005
work page 2005
-
[30]
Data partitioning with a func- tional performance model of heterogeneous processors,
L. A. Lastovetsky and R. Reddy, “Data partitioning with a func- tional performance model of heterogeneous processors,” Interna- tional Journal of High Performance Computing Applications , vol. 21, no. 1, pp. 76–90, 2007
work page 2007
-
[31]
Model-based optimization of EULAG kernel on Intel Xeon Phi through load 28 imbalancing,
A. Lastovetsky, L. Szustak, and R. Wyrzykowski, “Model-based optimization of EULAG kernel on Intel Xeon Phi through load 28 imbalancing,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 3, pp. 787–797, 2017
work page 2017
-
[32]
H. Khaleghzadeh, R. R. Manumachu, and A. Lastovetsky, “A novel data-partitioning algorithm for performance optimization of data-parallel applications on heterogeneous HPC platforms,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 10, pp. 2176–2190, 2018
work page 2018
-
[33]
A comparative study of methods for measurement of energy of computing,
M. Fahad, A. Shahid, R. R. Manumachu, and A. Lastovetsky, “A comparative study of methods for measurement of energy of computing,” Energies, vol. 12, no. 11, p. 2204, 2019
work page 2019
-
[34]
Energy conservation in heterogeneous server clusters,
T. Heath, B. Diniz, B. Horizonte, E. V . Carrera, and R. Bianchini, “Energy conservation in heterogeneous server clusters,” in 10th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP). ACM, 2005, pp. 186–195
work page 2005
-
[35]
Full- system power analysis and modeling for server environments,
D. Economou, S. Rivoire, C. Kozyrakis, and P . Ranganathan, “Full- system power analysis and modeling for server environments,” in In Proceedings of Workshop on Modeling, Benchmarking, and Simula- tion, 2006, pp. 70–77
work page 2006
-
[36]
Complete system power estimation using processor performance events,
W. L. Bircher and L. K. John, “Complete system power estimation using processor performance events,” IEEE Transactions on Com- puters, vol. 61, no. 4, pp. 563–577, Apr. 2012
work page 2012
-
[37]
A methodology to predict the power consumption of servers in data centres,
R. Basmadjian, N. Ali, F. Niedermeier, H. de Meer, and G. Giuliani, “A methodology to predict the power consumption of servers in data centres,” in 2nd International Conference on Energy-Efficient Computing and Networking. ACM, 2011
work page 2011
-
[38]
An integrated GPU power and per- formance model,
H. Hong, Sunpyand Kim, “An integrated GPU power and per- formance model,” SIGARCH Comput. Archit. News , vol. 38, no. 3, 2010
work page 2010
-
[39]
Runtime power monitoring in high- end processors: Methodology and empirical data,
C. Isci and M. Martonosi, “Runtime power monitoring in high- end processors: Methodology and empirical data,” in 36th annual IEEE/ACM International Symposium on Microarchitecture . IEEE Computer Society, 2003, p. 93
work page 2003
-
[40]
Statistical power modeling of GPU kernels using perfor- mance counters,
H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and S. Mat- suoka, “Statistical power modeling of GPU kernels using perfor- mance counters,” in International Green Computing Conference and Workshops (IGCC). IEEE, 2010
work page 2010
-
[41]
A simplified and accurate model of power-performance efficiency on emergent GPU architectures,
S. Song, C. Su, B. Rountree, and K. W. Cameron, “A simplified and accurate model of power-performance efficiency on emergent GPU architectures,” in 27th IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS) . IEEE Computer Society, 2013, pp. 673–686
work page 2013
-
[42]
Energy characterization and instruction-level energy model of Intel’s Xeon Phi processor,
Y. S. Shao and D. Brooks, “Energy characterization and instruction-level energy model of Intel’s Xeon Phi processor,” in Proceedings of the 2013 International Symposium on Low Power Electronics and Design, ser. ISLPED ’13. IEEE Press, 2013
work page 2013
-
[43]
A. Beloglazov, J. Abawajy, and R. Buyya, “Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing,” Future Generation Computer Systems , vol. 28, no. 5, pp. 755 – 768, 2012, special Section: Energy efficiency in large-scale distributed systems
work page 2012
-
[44]
Algorithmic time, energy, and power on candidate HPC compute building blocks,
J. Choi, M. Dukhan, X. Liu, and R. Vuduc, “Algorithmic time, energy, and power on candidate HPC compute building blocks,” in Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 2014, pp. 447–457
work page 2014
-
[45]
Energy trade-offs analysis using equal-energy maps,
M. Drozdowski, J. M. Marszalkowski, and J. Marszalkowski, “Energy trade-offs analysis using equal-energy maps,” Future Generation Computer Systems, vol. 36, pp. 311–321, 2014
work page 2014
-
[46]
Time and energy performance of parallel systems with hierarchi- cal memory,
J. M. Marszałkowski, M. Drozdowski, and J. Marszałkowski, “Time and energy performance of parallel systems with hierarchi- cal memory,” Journal of Grid Computing, vol. 14, no. 1, pp. 153–170, 2016
work page 2016
-
[47]
HCLWattsUp: API for power and energy measurements using WattsUp Pro Meter,
Heterogeneous Computing Laboratory, University College Dublin, “HCLWattsUp: API for power and energy measurements using WattsUp Pro Meter,” 2016. [Online]. Available: https://csgitlab.ucd.ie/ucd-hcl/hclwattsup ¡¡¡¡¡¡¡ HEAD
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.