pith. sign in

arxiv: 1907.04080 · v1 · pith:F3DC3TGRnew · submitted 2019-07-09 · 💻 cs.DC · cs.PF· cs.SY· eess.SY

Bi-objective Optimisation of Data-parallel Applications on Heterogeneous Platforms for Performance and Energy via Workload Distribution

Pith reviewed 2026-05-25 00:06 UTC · model grok-4.3

classification 💻 cs.DC cs.PFcs.SYeess.SY
keywords bi-objective optimizationworkload distributionheterogeneous platformsperformance and energyPareto-optimal solutionsdata-parallel applicationsdynamic energy profilessystem-level measurement
0
0 comments X

The pith

An exact global optimization algorithm finds all Pareto-optimal workload distributions for performance and energy on heterogeneous platforms from discrete system-level profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that bi-objective optimization of workload distribution for performance and energy on heterogeneous processors produces many Pareto-optimal solutions even when profiles are linear. Real applications exhibit non-linear, complex, non-smooth profiles, so the authors supply an efficient exact algorithm that accepts the most general discrete performance and dynamic energy profiles as input and returns the full set of optimal distributions. The same algorithm extends to total energy. A supporting methodology constructs the required discrete dynamic energy profiles using only system-level measurements, solving the problem of accurate per-device energy accounting in hybrid CPU-accelerator runs without component modeling or special hardware.

Core claim

The central claim is that an efficient and exact global optimisation algorithm, given only discrete performance and dynamic energy profiles of heterogeneous processors, solves the bi-objective optimisation problem for performance and energy (and separately for performance and total energy) by returning every Pareto-optimal workload distribution. The algorithm is fed by a novel system-level methodology that produces accurate discrete dynamic energy profiles of individual devices in a hybrid data-parallel application.

What carries the argument

The exact global optimisation algorithm that enumerates all Pareto-optimal workload distributions directly from arbitrary discrete performance and dynamic energy profiles of the heterogeneous processors.

If this is right

  • Even linear performance and energy profiles on heterogeneous processors yield a large number of Pareto-optimal workload distributions.
  • The algorithm applies unchanged to both dynamic energy and total energy objectives.
  • The system-level profile construction method works for real applications such as matrix multiplication and 2D-FFT on platforms mixing CPUs and accelerators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The algorithm could be embedded inside runtime schedulers to select workload splits at job submission time.
  • Non-smooth profile shapes imply that local search methods would miss many high-quality distributions that the global algorithm finds.
  • The same profile-driven approach might extend to three-objective problems that also include monetary cost.

Load-bearing premise

System-level measurements alone can produce accurate discrete dynamic energy profiles for each computing device inside a hybrid data-parallel application without any component-level modeling or specialized instrumentation.

What would settle it

Execute matrix multiplication or 2D-FFT with the workload distributions returned by the algorithm on the target heterogeneous platform and check whether the measured performance and dynamic energy values match the input profile predictions within the reported measurement uncertainty.

Figures

Figures reproduced from arXiv: 1907.04080 by Alexey Lastovetsky, Arsalan Shahid, Hamidreza Khaleghzadeh, Muhammad Fahad, Ravi Reddy Manumachu.

Figure 1
Figure 1. Figure 1: Linear execution time functions of the processors [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Linear dynamic energy functions of the processors [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Globally Pareto-optimal front of solutions for a workload size [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Speed functions of heterogeneous DGEMM application executing [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pareto-front solutions of heterogeneous DGEMM application for [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Speed functions of heterogeneous 2D-FFT application executing [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Dynamic energy functions of heterogeneous 2D-FFT application [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pareto-front solutions of 2D-FFT for a given workload size, [PITH_FULL_IMAGE:figures/full_fig_p005_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sample dynamic energy and times functions sorted in non [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The solution tree explored by the naive algorithm to find all distributions and its Pareto-optimal set for a workload [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Removing some data points from the profiles by applying the [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Parallel and Combined dynamic energy profiles for matrix multi [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Parallel and Combined dynamic energy profiles for 2D-FFT [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Globally Pareto-front solutions of execution time and total en [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Globally Pareto-front solutions of execution time and total en [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: The Linear time functions for given processors [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: The Linear dynamic energy functions for given processors [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: The HEPOPTA solution tree for executing a sample set of five profiles (p = 5), each contains 2 data points. The memorization technique is only considered to reduce the full search space of solutions [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
read the original abstract

Performance and energy are the two most important objectives for optimisation on modern parallel platforms. Latest research demonstrated the importance of workload distribution as a decision variable in the bi-objective optimisation for performance and energy on homogeneous multicore clusters. We show in this work that bi-objective optimisation for performance and energy on heterogeneous processors results in a large number of Pareto-optimal optimal solutions (workload distributions) even in the simple case of linear performance and energy profiles. We then study performance and energy profiles of real-life data-parallel applications and find that their shapes are non-linear, complex and non-smooth. We, therefore, propose an efficient and exact global optimisation algorithm, which takes as an input most general discrete performance and dynamic energy profiles of the heterogeneous processors and solves the bi-objective optimisation problem. The algorithm is also used as a building block to solve the bi-objective optimisation problem for performance and total energy. We also propose a novel methodology to build discrete dynamic energy profiles of individual computing devices, which are input to the algorithm. The methodology is based purely on system-level measurements and addresses the fundamental challenge of accurate component-level energy modelling of a hybrid data-parallel application running on a heterogeneous platform integrating CPUs and accelerators. We experimentally validate the proposed method using two data-parallel applications, matrix multiplication and 2D fast Fourier transform (2D-FFT).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that bi-objective optimization of performance and energy on heterogeneous platforms yields many Pareto-optimal workload distributions even for linear profiles, that real applications exhibit non-linear/non-smooth profiles, and that an efficient exact global optimization algorithm exists which takes arbitrary discrete performance and dynamic-energy profiles as input and solves the problem (also extended to total energy). It further claims a novel system-level measurement methodology that constructs accurate per-device discrete dynamic energy profiles for hybrid data-parallel applications without component-level models or specialized instrumentation, validated experimentally on matrix multiplication and 2D-FFT.

Significance. If the energy-profile methodology is accurate and the algorithm is indeed exact and efficient for general discrete profiles, the work would supply a practical, profile-driven approach to bi-objective optimization on heterogeneous CPU+accelerator platforms, a topic of clear importance. The experimental validation on two representative data-parallel kernels and the use of the algorithm as a building block for total-energy optimization are concrete strengths.

major comments (2)
  1. [Methodology for constructing discrete dynamic energy profiles (and its experimental validation)] The central claim that the proposed system-level methodology produces accurate discrete dynamic energy profiles for individual devices rests on the assumption that total-system power and execution-time data can be decomposed without modeling cross-device interactions or shared resources. The experimental validation section reports results only for the two applications and does not quantify isolation error, repeatability, or sensitivity to unmodeled overheads; this directly affects the correctness of the input profiles supplied to the optimization algorithm.
  2. [Description of the global optimisation algorithm] The abstract asserts that the algorithm is 'exact' for the most general discrete non-linear, non-smooth profiles, yet the provided description supplies neither a formal proof of exactness nor a complexity analysis. Because the algorithm is the core technical contribution and is used as a building block for the total-energy case, the absence of these supporting arguments is load-bearing.
minor comments (2)
  1. [Abstract] The abstract states that 'a large number of Pareto-optimal solutions' exist but supplies no quantitative counts or figures; adding these would clarify the scale of the search space the algorithm must handle.
  2. [Algorithm section] Notation for the discrete profiles (performance and energy) should be introduced with explicit symbols and domains before the algorithm is presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify areas where additional rigor would strengthen the manuscript. We address each below and commit to revisions that directly respond to the concerns without altering the core claims.

read point-by-point responses
  1. Referee: [Methodology for constructing discrete dynamic energy profiles (and its experimental validation)] The central claim that the proposed system-level methodology produces accurate discrete dynamic energy profiles for individual devices rests on the assumption that total-system power and execution-time data can be decomposed without modeling cross-device interactions or shared resources. The experimental validation section reports results only for the two applications and does not quantify isolation error, repeatability, or sensitivity to unmodeled overheads; this directly affects the correctness of the input profiles supplied to the optimization algorithm.

    Authors: We agree that quantitative assessment of isolation error, repeatability, and sensitivity to shared-resource overheads would increase confidence in the profiles. The methodology isolates devices by running controlled single-device experiments for each workload point while measuring total system power, but the current validation emphasizes end-to-end consistency on the two kernels rather than these metrics. In the revision we will add a dedicated subsection reporting (i) standard deviation of energy measurements over 10 repeated runs per point, (ii) an estimate of isolation error obtained by comparing summed single-device profiles against a simultaneous multi-device run at selected points, and (iii) a brief sensitivity discussion of memory-bandwidth contention. These additions will be placed immediately after the description of the profiling procedure. revision: yes

  2. Referee: [Description of the global optimisation algorithm] The abstract asserts that the algorithm is 'exact' for the most general discrete non-linear, non-smooth profiles, yet the provided description supplies neither a formal proof of exactness nor a complexity analysis. Because the algorithm is the core technical contribution and is used as a building block for the total-energy case, the absence of these supporting arguments is load-bearing.

    Authors: The algorithm is a dynamic-programming procedure that fills a table of Pareto-optimal partial solutions by iterating over devices and workload points; because every feasible combination is considered exactly once and dominance is applied only after exhaustive enumeration, the result is exact for arbitrary discrete profiles. We acknowledge that an explicit inductive proof and complexity statement are missing from the manuscript. In the revision we will insert a new subsection containing (a) a proof by induction on the number of devices that every non-dominated solution is retained, and (b) a complexity analysis showing O(D · W · P) time where D is the number of devices, W the maximum number of workload points per device, and P the size of the Pareto front maintained at each step. The same analysis will be referenced when the algorithm is reused for the total-energy formulation. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithm takes external profiles as input; methodology produces profiles from measurements

full rationale

The paper's central claims are an exact global optimization algorithm that accepts discrete performance and dynamic energy profiles as external inputs, plus a separate system-level measurement methodology to construct those profiles for hybrid applications. No equations, derivations, or steps in the provided text reduce the algorithm output or Pareto solutions to fitted parameters defined by the same data, nor do they rely on self-citation chains or imported uniqueness theorems. The profiles are obtained independently via system measurements and validated experimentally on matrix multiplication and 2D-FFT; the optimization step is therefore not equivalent to its inputs by construction. This is the normal case of a self-contained proposal against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the existence of accurate discrete performance and dynamic energy profiles that can be obtained via the described system-level method; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5798 in / 1084 out tokens · 18620 ms · 2026-05-25T00:06:58.234598+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

  1. [1]

    A multi-objective approach for workflow scheduling in hetero- geneous environments,

    H. M. Fard, R. Prodan, J. J. D. Barrionuevo, and T. Fahringer, “A multi-objective approach for workflow scheduling in hetero- geneous environments,” in Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Cc- grid 2012) , ser. CCGRID ’12. IEEE Computer Society, 2012, pp. 300–309

  2. [2]

    A pareto-based meta- heuristic for scheduling HPC applications on a geographically distributed cloud federation,

    Y. Kessaci, N. Melab, and E.-G. Talbi, “A pareto-based meta- heuristic for scheduling HPC applications on a geographically distributed cloud federation,” Cluster Computing , vol. 16, no. 3, pp. 451–468, Sep. 2013

  3. [3]

    Multi-objective energy- efficient workflow scheduling using list-based heuristics,

    J. J. Durillo, V . Nae, and R. Prodan, “Multi-objective energy- efficient workflow scheduling using list-based heuristics,” Future Generation Computer Systems, vol. 36, pp. 221 – 236, 2014

  4. [4]

    E-eco: Performance-aware energy-efficient cloud data center orchestration,

    F. D. Rossi, M. G. Xavier, C. A. De Rose, R. N. Calheiros, and R. Buyya, “E-eco: Performance-aware energy-efficient cloud data center orchestration,” Journal of Network and Computer Applications, vol. 78, pp. 83–96, 2017

  5. [5]

    Quantitative modeling of power performance tradeoffs on ex- treme scale systems,

    “Quantitative modeling of power performance tradeoffs on ex- treme scale systems,” Journal of Parallel and Distributed Computing , vol. 84, pp. 1 – 14, 2015

  6. [6]

    Power tuning HPC jobs on power-constrained systems,

    N. Gholkar, F. Mueller, and B. Rountree, “Power tuning HPC jobs on power-constrained systems,” in Proceedings of the 2016 International Conference on Parallel Architectures and Compilation . ACM, 2016, pp. 179–191

  7. [7]

    Bounding energy consumption in large- scale MPI programs,

    B. Rountree, D. K. Lowenthal, S. Funk, V . W. Freeh, B. R. de Supinski, and M. Schulz, “Bounding energy consumption in large- scale MPI programs,” in SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, Nov 2007, pp. 1–9

  8. [8]

    Energy ef- ficient genetic-based schedulers in computational grids,

    J. Kołodziej, S. U. Khan, L. Wang, and A. Y. Zomaya, “Energy ef- ficient genetic-based schedulers in computational grids,” Concurr. Comput. : Pract. Exper., vol. 27, no. 4, pp. 809–829, Mar. 2015

  9. [9]

    Statistical power and performance modeling for optimizing the energy efficiency of scientific com- puting,

    B. Subramaniam and W. Feng, “Statistical power and performance modeling for optimizing the energy efficiency of scientific com- puting,” in 2010 IEEE/ACM Int’l Conference on Green Computing and Communications and Int’l Conference on Cyber, Physical and Social Computing, Dec 2010, pp. 139–146

  10. [10]

    Perfect strong scaling using no additional energy,

    J. Demmel, A. Gearhart, B. Lipshitz, and O. Schwartz, “Perfect strong scaling using no additional energy,” in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing , 2013, pp. 649–660

  11. [11]

    An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration,

    J. Lang and G. Rnger, “An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration,” Journal of Parallel and Distributed Computing, vol. 74, no. 9, pp. 2884 – 2897, 2014

  12. [12]

    A pareto frame- work for data analytics on heterogeneous systems: Implications for green energy usage and performance,

    A. Chakrabarti, S. Parthasarathy, and C. Stewart, “A pareto frame- work for data analytics on heterogeneous systems: Implications for green energy usage and performance,” in Parallel Processing (ICPP), 2017 46th International Conference on. IEEE, 2017, pp. 533– 542

  13. [13]

    New model-based methods and algorithms for performance and energy optimization of data parallel applications on homogeneous multicore clusters,

    A. Lastovetsky and R. Reddy, “New model-based methods and algorithms for performance and energy optimization of data parallel applications on homogeneous multicore clusters,” IEEE Transactions on Parallel and Distributed Systems , vol. 28, no. 4, pp. 1119–1133, April 2017

  14. [14]

    Bi-objective optimization of data-parallel applications on homogeneous multicore clusters for performance and energy,

    R. R. Manumachu and A. Lastovetsky, “Bi-objective optimization of data-parallel applications on homogeneous multicore clusters for performance and energy,” IEEE Transactions on Computers , vol. 67, no. 2, pp. 160–177, 2018

  15. [15]

    Design of self- adaptable data parallel applications on multicore clusters auto- matically optimized for performance and energy through load distribution,

    R. Reddy Manumachu and A. L. Lastovetsky, “Design of self- adaptable data parallel applications on multicore clusters auto- matically optimized for performance and energy through load distribution,” Concurrency and Computation: Practice and Experience, vol. 0, no. 0, p. e4958

  16. [16]

    Data partitioning on multicore and multi-GPU platforms using functional performance models,

    Z. Zhong, V . Rychkov, and A. Lastovetsky, “Data partitioning on multicore and multi-GPU platforms using functional performance models,” Computers, IEEE Transactions on, vol. 64, no. 9, pp. 2506– 2518, 2015

  17. [17]

    Evaluating the effectiveness of model-based power characterization,

    J. C. McCullough, Y. Agarwal, J. Chandrashekar, S. Kuppuswamy, A. C. Snoeren, and R. K. Gupta, “Evaluating the effectiveness of model-based power characterization,” in Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference , ser. USENIXATC’11. USENIX Association, 2011

  18. [18]

    A survey of power and energy predictive models in HPC systems and applications,

    K. O’Brien, I. Pietri, R. Reddy, A. Lastovetsky, and R. Sakellariou, “A survey of power and energy predictive models in HPC systems and applications,” ACM Computing Surveys, vol. 50, no. 3, 2017

  19. [19]

    Additivity: A selection criterion for performance events for reliable energy predictive modeling,

    A. Shahid, M. Fahad, R. Reddy, and A. Lastovetsky, “Additivity: A selection criterion for performance events for reliable energy predictive modeling,” Supercomputing Frontiers and Innovations , vol. 4, no. 4, 2017

  20. [20]

    Out- of-core implementation for accelerator kernels on heterogeneous clouds,

    H. Khaleghzadeh, Z. Zhong, R. Reddy, and A. Lastovetsky, “Out- of-core implementation for accelerator kernels on heterogeneous clouds,” The Journal of Supercomputing , vol. 74, no. 2, pp. 551–568, 2018

  21. [21]

    A survey and taxonomy of energy efficient resource management techniques in platform as a service cloud,

    S. F. Piraghaj, A. V . Dastjerdi, R. N. Calheiros, and R. Buyya, “A survey and taxonomy of energy efficient resource management techniques in platform as a service cloud,” in Handbook of Research on End-to-End Cloud Computing Architecture Design . IGI Global, 2017, pp. 410–454

  22. [22]

    Improving energy effi- ciency of web servers by using a load distribution algorithm and shutting down idle nodes,

    K. Chen, J. Lenhardt, and W. Schiffmann, “Improving energy effi- ciency of web servers by using a load distribution algorithm and shutting down idle nodes,” in Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on . IEEE, 2015, pp. 745–748

  23. [23]

    Reducing the energy consumption of large-scale computing systems through combined shutdown policies with multiple constraints,

    A. Benoit, L. Lef `evre, A.-C. Orgerie, and I. Ra ¨ıs, “Reducing the energy consumption of large-scale computing systems through combined shutdown policies with multiple constraints,” The Inter- national Journal of High Performance Computing Applications, vol. 32, no. 1, pp. 176–188, 2018

  24. [24]

    Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model,

    K.-H. Kim, K. Kim, and Q.-H. Park, “Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model,” Computer Physics Communications, vol. 182, no. 6, pp. 1201– 1207, 2011

  25. [25]

    Work- load partitioning for accelerating applications on heterogeneous platforms,

    J. Shen, A. L. Varbanescu, Y. Lu, P . Zou, and H. Sips, “Work- load partitioning for accelerating applications on heterogeneous platforms,” IEEE Transactions on Parallel and Distributed Systems , vol. 27, no. 9, pp. 2766–2780, 2016

  26. [26]

    Heterogeneous distribution of computations solving linear algebra problems on networks of heterogeneous computers,

    A. Kalinov and A. Lastovetsky, “Heterogeneous distribution of computations solving linear algebra problems on networks of heterogeneous computers,” Journal of Parallel and Distributed Com- puting, vol. 61, no. 4, pp. 520 – 535, 2001

  27. [27]

    Matrix multi- plication on heterogeneous platforms,

    O. Beaumont, V . Boudet, F. Rastello, and Y. Robert, “Matrix multi- plication on heterogeneous platforms,” IEEE Trans. Parallel Distrib. Syst., vol. 12, no. 10, Oct. 2001

  28. [28]

    Data partitioning with a realistic performance model of networks of heterogeneous computers,

    A. L. Lastovetsky and R. Reddy, “Data partitioning with a realistic performance model of networks of heterogeneous computers,” in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, 2004, p. 104

  29. [29]

    Data partitioning for multipro- cessors with memory heterogeneity and memory constraints,

    A. Lastovetsky and R. Reddy, “Data partitioning for multipro- cessors with memory heterogeneity and memory constraints,” Scientific Programming, vol. 13, no. 2, pp. 93–112, 2005

  30. [30]

    Data partitioning with a func- tional performance model of heterogeneous processors,

    L. A. Lastovetsky and R. Reddy, “Data partitioning with a func- tional performance model of heterogeneous processors,” Interna- tional Journal of High Performance Computing Applications , vol. 21, no. 1, pp. 76–90, 2007

  31. [31]

    Model-based optimization of EULAG kernel on Intel Xeon Phi through load 28 imbalancing,

    A. Lastovetsky, L. Szustak, and R. Wyrzykowski, “Model-based optimization of EULAG kernel on Intel Xeon Phi through load 28 imbalancing,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 3, pp. 787–797, 2017

  32. [32]

    A novel data-partitioning algorithm for performance optimization of data-parallel applications on heterogeneous HPC platforms,

    H. Khaleghzadeh, R. R. Manumachu, and A. Lastovetsky, “A novel data-partitioning algorithm for performance optimization of data-parallel applications on heterogeneous HPC platforms,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 10, pp. 2176–2190, 2018

  33. [33]

    A comparative study of methods for measurement of energy of computing,

    M. Fahad, A. Shahid, R. R. Manumachu, and A. Lastovetsky, “A comparative study of methods for measurement of energy of computing,” Energies, vol. 12, no. 11, p. 2204, 2019

  34. [34]

    Energy conservation in heterogeneous server clusters,

    T. Heath, B. Diniz, B. Horizonte, E. V . Carrera, and R. Bianchini, “Energy conservation in heterogeneous server clusters,” in 10th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP). ACM, 2005, pp. 186–195

  35. [35]

    Full- system power analysis and modeling for server environments,

    D. Economou, S. Rivoire, C. Kozyrakis, and P . Ranganathan, “Full- system power analysis and modeling for server environments,” in In Proceedings of Workshop on Modeling, Benchmarking, and Simula- tion, 2006, pp. 70–77

  36. [36]

    Complete system power estimation using processor performance events,

    W. L. Bircher and L. K. John, “Complete system power estimation using processor performance events,” IEEE Transactions on Com- puters, vol. 61, no. 4, pp. 563–577, Apr. 2012

  37. [37]

    A methodology to predict the power consumption of servers in data centres,

    R. Basmadjian, N. Ali, F. Niedermeier, H. de Meer, and G. Giuliani, “A methodology to predict the power consumption of servers in data centres,” in 2nd International Conference on Energy-Efficient Computing and Networking. ACM, 2011

  38. [38]

    An integrated GPU power and per- formance model,

    H. Hong, Sunpyand Kim, “An integrated GPU power and per- formance model,” SIGARCH Comput. Archit. News , vol. 38, no. 3, 2010

  39. [39]

    Runtime power monitoring in high- end processors: Methodology and empirical data,

    C. Isci and M. Martonosi, “Runtime power monitoring in high- end processors: Methodology and empirical data,” in 36th annual IEEE/ACM International Symposium on Microarchitecture . IEEE Computer Society, 2003, p. 93

  40. [40]

    Statistical power modeling of GPU kernels using perfor- mance counters,

    H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and S. Mat- suoka, “Statistical power modeling of GPU kernels using perfor- mance counters,” in International Green Computing Conference and Workshops (IGCC). IEEE, 2010

  41. [41]

    A simplified and accurate model of power-performance efficiency on emergent GPU architectures,

    S. Song, C. Su, B. Rountree, and K. W. Cameron, “A simplified and accurate model of power-performance efficiency on emergent GPU architectures,” in 27th IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS) . IEEE Computer Society, 2013, pp. 673–686

  42. [42]

    Energy characterization and instruction-level energy model of Intel’s Xeon Phi processor,

    Y. S. Shao and D. Brooks, “Energy characterization and instruction-level energy model of Intel’s Xeon Phi processor,” in Proceedings of the 2013 International Symposium on Low Power Electronics and Design, ser. ISLPED ’13. IEEE Press, 2013

  43. [43]

    Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing,

    A. Beloglazov, J. Abawajy, and R. Buyya, “Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing,” Future Generation Computer Systems , vol. 28, no. 5, pp. 755 – 768, 2012, special Section: Energy efficiency in large-scale distributed systems

  44. [44]

    Algorithmic time, energy, and power on candidate HPC compute building blocks,

    J. Choi, M. Dukhan, X. Liu, and R. Vuduc, “Algorithmic time, energy, and power on candidate HPC compute building blocks,” in Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 2014, pp. 447–457

  45. [45]

    Energy trade-offs analysis using equal-energy maps,

    M. Drozdowski, J. M. Marszalkowski, and J. Marszalkowski, “Energy trade-offs analysis using equal-energy maps,” Future Generation Computer Systems, vol. 36, pp. 311–321, 2014

  46. [46]

    Time and energy performance of parallel systems with hierarchi- cal memory,

    J. M. Marszałkowski, M. Drozdowski, and J. Marszałkowski, “Time and energy performance of parallel systems with hierarchi- cal memory,” Journal of Grid Computing, vol. 14, no. 1, pp. 153–170, 2016

  47. [47]

    HCLWattsUp: API for power and energy measurements using WattsUp Pro Meter,

    Heterogeneous Computing Laboratory, University College Dublin, “HCLWattsUp: API for power and energy measurements using WattsUp Pro Meter,” 2016. [Online]. Available: https://csgitlab.ucd.ie/ucd-hcl/hclwattsup ¡¡¡¡¡¡¡ HEAD