Bi-objective Optimisation of Data-parallel Applications on Heterogeneous Platforms for Performance and Energy via Workload Distribution

Alexey Lastovetsky; Arsalan Shahid; Hamidreza Khaleghzadeh; Muhammad Fahad; Ravi Reddy Manumachu

arxiv: 1907.04080 · v1 · pith:F3DC3TGRnew · submitted 2019-07-09 · 💻 cs.DC · cs.PF· cs.SY· eess.SY

Bi-objective Optimisation of Data-parallel Applications on Heterogeneous Platforms for Performance and Energy via Workload Distribution

Hamidreza Khaleghzadeh , Muhammad Fahad , Arsalan Shahid , Ravi Reddy Manumachu , Alexey Lastovetsky This is my paper

Pith reviewed 2026-05-25 00:06 UTC · model grok-4.3

classification 💻 cs.DC cs.PFcs.SYeess.SY

keywords bi-objective optimizationworkload distributionheterogeneous platformsperformance and energyPareto-optimal solutionsdata-parallel applicationsdynamic energy profilessystem-level measurement

0 comments

The pith

An exact global optimization algorithm finds all Pareto-optimal workload distributions for performance and energy on heterogeneous platforms from discrete system-level profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that bi-objective optimization of workload distribution for performance and energy on heterogeneous processors produces many Pareto-optimal solutions even when profiles are linear. Real applications exhibit non-linear, complex, non-smooth profiles, so the authors supply an efficient exact algorithm that accepts the most general discrete performance and dynamic energy profiles as input and returns the full set of optimal distributions. The same algorithm extends to total energy. A supporting methodology constructs the required discrete dynamic energy profiles using only system-level measurements, solving the problem of accurate per-device energy accounting in hybrid CPU-accelerator runs without component modeling or special hardware.

Core claim

The central claim is that an efficient and exact global optimisation algorithm, given only discrete performance and dynamic energy profiles of heterogeneous processors, solves the bi-objective optimisation problem for performance and energy (and separately for performance and total energy) by returning every Pareto-optimal workload distribution. The algorithm is fed by a novel system-level methodology that produces accurate discrete dynamic energy profiles of individual devices in a hybrid data-parallel application.

What carries the argument

The exact global optimisation algorithm that enumerates all Pareto-optimal workload distributions directly from arbitrary discrete performance and dynamic energy profiles of the heterogeneous processors.

If this is right

Even linear performance and energy profiles on heterogeneous processors yield a large number of Pareto-optimal workload distributions.
The algorithm applies unchanged to both dynamic energy and total energy objectives.
The system-level profile construction method works for real applications such as matrix multiplication and 2D-FFT on platforms mixing CPUs and accelerators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The algorithm could be embedded inside runtime schedulers to select workload splits at job submission time.
Non-smooth profile shapes imply that local search methods would miss many high-quality distributions that the global algorithm finds.
The same profile-driven approach might extend to three-objective problems that also include monetary cost.

Load-bearing premise

System-level measurements alone can produce accurate discrete dynamic energy profiles for each computing device inside a hybrid data-parallel application without any component-level modeling or specialized instrumentation.

What would settle it

Execute matrix multiplication or 2D-FFT with the workload distributions returned by the algorithm on the target heterogeneous platform and check whether the measured performance and dynamic energy values match the input profile predictions within the reported measurement uncertainty.

Figures

Figures reproduced from arXiv: 1907.04080 by Alexey Lastovetsky, Arsalan Shahid, Hamidreza Khaleghzadeh, Muhammad Fahad, Ravi Reddy Manumachu.

**Figure 2.** Figure 2: Linear dynamic energy functions of the processors [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Globally Pareto-optimal front of solutions for a workload size [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Speed functions of heterogeneous DGEMM application executing [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Pareto-front solutions of heterogeneous DGEMM application for [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Speed functions of heterogeneous 2D-FFT application executing [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Dynamic energy functions of heterogeneous 2D-FFT application [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 9.** Figure 9: Pareto-front solutions of 2D-FFT for a given workload size, [PITH_FULL_IMAGE:figures/full_fig_p005_9.png] view at source ↗

**Figure 10.** Figure 10: Sample dynamic energy and times functions sorted in non [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: The solution tree explored by the naive algorithm to find all distributions and its Pareto-optimal set for a workload [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Removing some data points from the profiles by applying the [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 13.** Figure 13: Parallel and Combined dynamic energy profiles for matrix multi [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Parallel and Combined dynamic energy profiles for 2D-FFT [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: Globally Pareto-front solutions of execution time and total en [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Globally Pareto-front solutions of execution time and total en [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

**Figure 17.** Figure 17: The Linear time functions for given processors [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

**Figure 18.** Figure 18: The Linear dynamic energy functions for given processors [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: The HEPOPTA solution tree for executing a sample set of five profiles (p = 5), each contains 2 data points. The memorization technique is only considered to reduce the full search space of solutions [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗

read the original abstract

Performance and energy are the two most important objectives for optimisation on modern parallel platforms. Latest research demonstrated the importance of workload distribution as a decision variable in the bi-objective optimisation for performance and energy on homogeneous multicore clusters. We show in this work that bi-objective optimisation for performance and energy on heterogeneous processors results in a large number of Pareto-optimal optimal solutions (workload distributions) even in the simple case of linear performance and energy profiles. We then study performance and energy profiles of real-life data-parallel applications and find that their shapes are non-linear, complex and non-smooth. We, therefore, propose an efficient and exact global optimisation algorithm, which takes as an input most general discrete performance and dynamic energy profiles of the heterogeneous processors and solves the bi-objective optimisation problem. The algorithm is also used as a building block to solve the bi-objective optimisation problem for performance and total energy. We also propose a novel methodology to build discrete dynamic energy profiles of individual computing devices, which are input to the algorithm. The methodology is based purely on system-level measurements and addresses the fundamental challenge of accurate component-level energy modelling of a hybrid data-parallel application running on a heterogeneous platform integrating CPUs and accelerators. We experimentally validate the proposed method using two data-parallel applications, matrix multiplication and 2D fast Fourier transform (2D-FFT).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers an exact algorithm for bi-objective workload distribution on heterogeneous platforms with non-linear profiles and pairs it with a system-level energy profiling method; the profiling step is the part that needs the most checking.

read the letter

The main thing here is an exact global optimization algorithm that takes arbitrary discrete performance and dynamic energy profiles from heterogeneous processors and returns the Pareto front for performance versus energy via workload distribution. It also builds on that to handle total energy and adds a profiling technique that relies only on system-level power measurements instead of per-component models or extra hardware. They test it on matrix multiplication and 2D-FFT, which gives some concrete grounding. That extension from the earlier homogeneous linear case to the general heterogeneous non-linear case is the clearest incremental step. The algorithm itself looks like a practical solver for a problem that schedulers actually face. The soft spot is the profiling methodology. It assumes you can cleanly separate per-device dynamic energy from total-system measurements in a hybrid run without modeling interactions or using device-specific tools. If cross-device effects, shared memory, or unaccounted overheads blur that separation, the profiles fed to the optimizer will be noisy and the claimed Pareto solutions will not match reality. The abstract gives no derivation or error analysis for the decomposition, so it is hard to tell how well it holds. This is the kind of paper that belongs in the energy-aware scheduling corner of HPC. Readers working on heterogeneous resource management would get value from the algorithm and the experimental setup, provided the profiling accuracy is demonstrated more convincingly. It is worth sending to peer review; the core problem is relevant and the approach is direct, even if the measurement claim will draw questions.

Referee Report

2 major / 2 minor

Summary. The paper claims that bi-objective optimization of performance and energy on heterogeneous platforms yields many Pareto-optimal workload distributions even for linear profiles, that real applications exhibit non-linear/non-smooth profiles, and that an efficient exact global optimization algorithm exists which takes arbitrary discrete performance and dynamic-energy profiles as input and solves the problem (also extended to total energy). It further claims a novel system-level measurement methodology that constructs accurate per-device discrete dynamic energy profiles for hybrid data-parallel applications without component-level models or specialized instrumentation, validated experimentally on matrix multiplication and 2D-FFT.

Significance. If the energy-profile methodology is accurate and the algorithm is indeed exact and efficient for general discrete profiles, the work would supply a practical, profile-driven approach to bi-objective optimization on heterogeneous CPU+accelerator platforms, a topic of clear importance. The experimental validation on two representative data-parallel kernels and the use of the algorithm as a building block for total-energy optimization are concrete strengths.

major comments (2)

[Methodology for constructing discrete dynamic energy profiles (and its experimental validation)] The central claim that the proposed system-level methodology produces accurate discrete dynamic energy profiles for individual devices rests on the assumption that total-system power and execution-time data can be decomposed without modeling cross-device interactions or shared resources. The experimental validation section reports results only for the two applications and does not quantify isolation error, repeatability, or sensitivity to unmodeled overheads; this directly affects the correctness of the input profiles supplied to the optimization algorithm.
[Description of the global optimisation algorithm] The abstract asserts that the algorithm is 'exact' for the most general discrete non-linear, non-smooth profiles, yet the provided description supplies neither a formal proof of exactness nor a complexity analysis. Because the algorithm is the core technical contribution and is used as a building block for the total-energy case, the absence of these supporting arguments is load-bearing.

minor comments (2)

[Abstract] The abstract states that 'a large number of Pareto-optimal solutions' exist but supplies no quantitative counts or figures; adding these would clarify the scale of the search space the algorithm must handle.
[Algorithm section] Notation for the discrete profiles (performance and energy) should be introduced with explicit symbols and domains before the algorithm is presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify areas where additional rigor would strengthen the manuscript. We address each below and commit to revisions that directly respond to the concerns without altering the core claims.

read point-by-point responses

Referee: [Methodology for constructing discrete dynamic energy profiles (and its experimental validation)] The central claim that the proposed system-level methodology produces accurate discrete dynamic energy profiles for individual devices rests on the assumption that total-system power and execution-time data can be decomposed without modeling cross-device interactions or shared resources. The experimental validation section reports results only for the two applications and does not quantify isolation error, repeatability, or sensitivity to unmodeled overheads; this directly affects the correctness of the input profiles supplied to the optimization algorithm.

Authors: We agree that quantitative assessment of isolation error, repeatability, and sensitivity to shared-resource overheads would increase confidence in the profiles. The methodology isolates devices by running controlled single-device experiments for each workload point while measuring total system power, but the current validation emphasizes end-to-end consistency on the two kernels rather than these metrics. In the revision we will add a dedicated subsection reporting (i) standard deviation of energy measurements over 10 repeated runs per point, (ii) an estimate of isolation error obtained by comparing summed single-device profiles against a simultaneous multi-device run at selected points, and (iii) a brief sensitivity discussion of memory-bandwidth contention. These additions will be placed immediately after the description of the profiling procedure. revision: yes
Referee: [Description of the global optimisation algorithm] The abstract asserts that the algorithm is 'exact' for the most general discrete non-linear, non-smooth profiles, yet the provided description supplies neither a formal proof of exactness nor a complexity analysis. Because the algorithm is the core technical contribution and is used as a building block for the total-energy case, the absence of these supporting arguments is load-bearing.

Authors: The algorithm is a dynamic-programming procedure that fills a table of Pareto-optimal partial solutions by iterating over devices and workload points; because every feasible combination is considered exactly once and dominance is applied only after exhaustive enumeration, the result is exact for arbitrary discrete profiles. We acknowledge that an explicit inductive proof and complexity statement are missing from the manuscript. In the revision we will insert a new subsection containing (a) a proof by induction on the number of devices that every non-dominated solution is retained, and (b) a complexity analysis showing O(D · W · P) time where D is the number of devices, W the maximum number of workload points per device, and P the size of the Pareto front maintained at each step. The same analysis will be referenced when the algorithm is reused for the total-energy formulation. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithm takes external profiles as input; methodology produces profiles from measurements

full rationale

The paper's central claims are an exact global optimization algorithm that accepts discrete performance and dynamic energy profiles as external inputs, plus a separate system-level measurement methodology to construct those profiles for hybrid applications. No equations, derivations, or steps in the provided text reduce the algorithm output or Pareto solutions to fitted parameters defined by the same data, nor do they rely on self-citation chains or imported uniqueness theorems. The profiles are obtained independently via system measurements and validated experimentally on matrix multiplication and 2D-FFT; the optimization step is therefore not equivalent to its inputs by construction. This is the normal case of a self-contained proposal against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the existence of accurate discrete performance and dynamic energy profiles that can be obtained via the described system-level method; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5798 in / 1084 out tokens · 18620 ms · 2026-05-25T00:06:58.234598+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

A multi-objective approach for workﬂow scheduling in hetero- geneous environments,

H. M. Fard, R. Prodan, J. J. D. Barrionuevo, and T. Fahringer, “A multi-objective approach for workﬂow scheduling in hetero- geneous environments,” in Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Cc- grid 2012) , ser. CCGRID ’12. IEEE Computer Society, 2012, pp. 300–309

work page 2012
[2]

A pareto-based meta- heuristic for scheduling HPC applications on a geographically distributed cloud federation,

Y. Kessaci, N. Melab, and E.-G. Talbi, “A pareto-based meta- heuristic for scheduling HPC applications on a geographically distributed cloud federation,” Cluster Computing , vol. 16, no. 3, pp. 451–468, Sep. 2013

work page 2013
[3]

Multi-objective energy- efﬁcient workﬂow scheduling using list-based heuristics,

J. J. Durillo, V . Nae, and R. Prodan, “Multi-objective energy- efﬁcient workﬂow scheduling using list-based heuristics,” Future Generation Computer Systems, vol. 36, pp. 221 – 236, 2014

work page 2014
[4]

E-eco: Performance-aware energy-efﬁcient cloud data center orchestration,

F. D. Rossi, M. G. Xavier, C. A. De Rose, R. N. Calheiros, and R. Buyya, “E-eco: Performance-aware energy-efﬁcient cloud data center orchestration,” Journal of Network and Computer Applications, vol. 78, pp. 83–96, 2017

work page 2017
[5]

Quantitative modeling of power performance tradeoffs on ex- treme scale systems,

“Quantitative modeling of power performance tradeoffs on ex- treme scale systems,” Journal of Parallel and Distributed Computing , vol. 84, pp. 1 – 14, 2015

work page 2015
[6]

Power tuning HPC jobs on power-constrained systems,

N. Gholkar, F. Mueller, and B. Rountree, “Power tuning HPC jobs on power-constrained systems,” in Proceedings of the 2016 International Conference on Parallel Architectures and Compilation . ACM, 2016, pp. 179–191

work page 2016
[7]

Bounding energy consumption in large- scale MPI programs,

B. Rountree, D. K. Lowenthal, S. Funk, V . W. Freeh, B. R. de Supinski, and M. Schulz, “Bounding energy consumption in large- scale MPI programs,” in SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, Nov 2007, pp. 1–9

work page 2007
[8]

Energy ef- ﬁcient genetic-based schedulers in computational grids,

J. Kołodziej, S. U. Khan, L. Wang, and A. Y. Zomaya, “Energy ef- ﬁcient genetic-based schedulers in computational grids,” Concurr. Comput. : Pract. Exper., vol. 27, no. 4, pp. 809–829, Mar. 2015

work page 2015
[9]

Statistical power and performance modeling for optimizing the energy efﬁciency of scientiﬁc com- puting,

B. Subramaniam and W. Feng, “Statistical power and performance modeling for optimizing the energy efﬁciency of scientiﬁc com- puting,” in 2010 IEEE/ACM Int’l Conference on Green Computing and Communications and Int’l Conference on Cyber, Physical and Social Computing, Dec 2010, pp. 139–146

work page 2010
[10]

Perfect strong scaling using no additional energy,

J. Demmel, A. Gearhart, B. Lipshitz, and O. Schwartz, “Perfect strong scaling using no additional energy,” in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing , 2013, pp. 649–660

work page 2013
[11]

An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration,

J. Lang and G. Rnger, “An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration,” Journal of Parallel and Distributed Computing, vol. 74, no. 9, pp. 2884 – 2897, 2014

work page 2014
[12]

A pareto frame- work for data analytics on heterogeneous systems: Implications for green energy usage and performance,

A. Chakrabarti, S. Parthasarathy, and C. Stewart, “A pareto frame- work for data analytics on heterogeneous systems: Implications for green energy usage and performance,” in Parallel Processing (ICPP), 2017 46th International Conference on. IEEE, 2017, pp. 533– 542

work page 2017
[13]

New model-based methods and algorithms for performance and energy optimization of data parallel applications on homogeneous multicore clusters,

A. Lastovetsky and R. Reddy, “New model-based methods and algorithms for performance and energy optimization of data parallel applications on homogeneous multicore clusters,” IEEE Transactions on Parallel and Distributed Systems , vol. 28, no. 4, pp. 1119–1133, April 2017

work page 2017
[14]

Bi-objective optimization of data-parallel applications on homogeneous multicore clusters for performance and energy,

R. R. Manumachu and A. Lastovetsky, “Bi-objective optimization of data-parallel applications on homogeneous multicore clusters for performance and energy,” IEEE Transactions on Computers , vol. 67, no. 2, pp. 160–177, 2018

work page 2018
[15]

Design of self- adaptable data parallel applications on multicore clusters auto- matically optimized for performance and energy through load distribution,

R. Reddy Manumachu and A. L. Lastovetsky, “Design of self- adaptable data parallel applications on multicore clusters auto- matically optimized for performance and energy through load distribution,” Concurrency and Computation: Practice and Experience, vol. 0, no. 0, p. e4958

work page
[16]

Data partitioning on multicore and multi-GPU platforms using functional performance models,

Z. Zhong, V . Rychkov, and A. Lastovetsky, “Data partitioning on multicore and multi-GPU platforms using functional performance models,” Computers, IEEE Transactions on, vol. 64, no. 9, pp. 2506– 2518, 2015

work page 2015
[17]

Evaluating the effectiveness of model-based power characterization,

J. C. McCullough, Y. Agarwal, J. Chandrashekar, S. Kuppuswamy, A. C. Snoeren, and R. K. Gupta, “Evaluating the effectiveness of model-based power characterization,” in Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference , ser. USENIXATC’11. USENIX Association, 2011

work page 2011
[18]

A survey of power and energy predictive models in HPC systems and applications,

K. O’Brien, I. Pietri, R. Reddy, A. Lastovetsky, and R. Sakellariou, “A survey of power and energy predictive models in HPC systems and applications,” ACM Computing Surveys, vol. 50, no. 3, 2017

work page 2017
[19]

Additivity: A selection criterion for performance events for reliable energy predictive modeling,

A. Shahid, M. Fahad, R. Reddy, and A. Lastovetsky, “Additivity: A selection criterion for performance events for reliable energy predictive modeling,” Supercomputing Frontiers and Innovations , vol. 4, no. 4, 2017

work page 2017
[20]

Out- of-core implementation for accelerator kernels on heterogeneous clouds,

H. Khaleghzadeh, Z. Zhong, R. Reddy, and A. Lastovetsky, “Out- of-core implementation for accelerator kernels on heterogeneous clouds,” The Journal of Supercomputing , vol. 74, no. 2, pp. 551–568, 2018

work page 2018
[21]

A survey and taxonomy of energy efﬁcient resource management techniques in platform as a service cloud,

S. F. Piraghaj, A. V . Dastjerdi, R. N. Calheiros, and R. Buyya, “A survey and taxonomy of energy efﬁcient resource management techniques in platform as a service cloud,” in Handbook of Research on End-to-End Cloud Computing Architecture Design . IGI Global, 2017, pp. 410–454

work page 2017
[22]

Improving energy efﬁ- ciency of web servers by using a load distribution algorithm and shutting down idle nodes,

K. Chen, J. Lenhardt, and W. Schiffmann, “Improving energy efﬁ- ciency of web servers by using a load distribution algorithm and shutting down idle nodes,” in Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on . IEEE, 2015, pp. 745–748

work page 2015
[23]

Reducing the energy consumption of large-scale computing systems through combined shutdown policies with multiple constraints,

A. Benoit, L. Lef `evre, A.-C. Orgerie, and I. Ra ¨ıs, “Reducing the energy consumption of large-scale computing systems through combined shutdown policies with multiple constraints,” The Inter- national Journal of High Performance Computing Applications, vol. 32, no. 1, pp. 176–188, 2018

work page 2018
[24]

Performance analysis and optimization of three-dimensional FDTD on GPU using rooﬂine model,

K.-H. Kim, K. Kim, and Q.-H. Park, “Performance analysis and optimization of three-dimensional FDTD on GPU using rooﬂine model,” Computer Physics Communications, vol. 182, no. 6, pp. 1201– 1207, 2011

work page 2011
[25]

Work- load partitioning for accelerating applications on heterogeneous platforms,

J. Shen, A. L. Varbanescu, Y. Lu, P . Zou, and H. Sips, “Work- load partitioning for accelerating applications on heterogeneous platforms,” IEEE Transactions on Parallel and Distributed Systems , vol. 27, no. 9, pp. 2766–2780, 2016

work page 2016
[26]

Heterogeneous distribution of computations solving linear algebra problems on networks of heterogeneous computers,

A. Kalinov and A. Lastovetsky, “Heterogeneous distribution of computations solving linear algebra problems on networks of heterogeneous computers,” Journal of Parallel and Distributed Com- puting, vol. 61, no. 4, pp. 520 – 535, 2001

work page 2001
[27]

Matrix multi- plication on heterogeneous platforms,

O. Beaumont, V . Boudet, F. Rastello, and Y. Robert, “Matrix multi- plication on heterogeneous platforms,” IEEE Trans. Parallel Distrib. Syst., vol. 12, no. 10, Oct. 2001

work page 2001
[28]

Data partitioning with a realistic performance model of networks of heterogeneous computers,

A. L. Lastovetsky and R. Reddy, “Data partitioning with a realistic performance model of networks of heterogeneous computers,” in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, 2004, p. 104

work page 2004
[29]

Data partitioning for multipro- cessors with memory heterogeneity and memory constraints,

A. Lastovetsky and R. Reddy, “Data partitioning for multipro- cessors with memory heterogeneity and memory constraints,” Scientiﬁc Programming, vol. 13, no. 2, pp. 93–112, 2005

work page 2005
[30]

Data partitioning with a func- tional performance model of heterogeneous processors,

L. A. Lastovetsky and R. Reddy, “Data partitioning with a func- tional performance model of heterogeneous processors,” Interna- tional Journal of High Performance Computing Applications , vol. 21, no. 1, pp. 76–90, 2007

work page 2007
[31]

Model-based optimization of EULAG kernel on Intel Xeon Phi through load 28 imbalancing,

A. Lastovetsky, L. Szustak, and R. Wyrzykowski, “Model-based optimization of EULAG kernel on Intel Xeon Phi through load 28 imbalancing,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 3, pp. 787–797, 2017

work page 2017
[32]

A novel data-partitioning algorithm for performance optimization of data-parallel applications on heterogeneous HPC platforms,

H. Khaleghzadeh, R. R. Manumachu, and A. Lastovetsky, “A novel data-partitioning algorithm for performance optimization of data-parallel applications on heterogeneous HPC platforms,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 10, pp. 2176–2190, 2018

work page 2018
[33]

A comparative study of methods for measurement of energy of computing,

M. Fahad, A. Shahid, R. R. Manumachu, and A. Lastovetsky, “A comparative study of methods for measurement of energy of computing,” Energies, vol. 12, no. 11, p. 2204, 2019

work page 2019
[34]

Energy conservation in heterogeneous server clusters,

T. Heath, B. Diniz, B. Horizonte, E. V . Carrera, and R. Bianchini, “Energy conservation in heterogeneous server clusters,” in 10th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP). ACM, 2005, pp. 186–195

work page 2005
[35]

Full- system power analysis and modeling for server environments,

D. Economou, S. Rivoire, C. Kozyrakis, and P . Ranganathan, “Full- system power analysis and modeling for server environments,” in In Proceedings of Workshop on Modeling, Benchmarking, and Simula- tion, 2006, pp. 70–77

work page 2006
[36]

Complete system power estimation using processor performance events,

W. L. Bircher and L. K. John, “Complete system power estimation using processor performance events,” IEEE Transactions on Com- puters, vol. 61, no. 4, pp. 563–577, Apr. 2012

work page 2012
[37]

A methodology to predict the power consumption of servers in data centres,

R. Basmadjian, N. Ali, F. Niedermeier, H. de Meer, and G. Giuliani, “A methodology to predict the power consumption of servers in data centres,” in 2nd International Conference on Energy-Efﬁcient Computing and Networking. ACM, 2011

work page 2011
[38]

An integrated GPU power and per- formance model,

H. Hong, Sunpyand Kim, “An integrated GPU power and per- formance model,” SIGARCH Comput. Archit. News , vol. 38, no. 3, 2010

work page 2010
[39]

Runtime power monitoring in high- end processors: Methodology and empirical data,

C. Isci and M. Martonosi, “Runtime power monitoring in high- end processors: Methodology and empirical data,” in 36th annual IEEE/ACM International Symposium on Microarchitecture . IEEE Computer Society, 2003, p. 93

work page 2003
[40]

Statistical power modeling of GPU kernels using perfor- mance counters,

H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and S. Mat- suoka, “Statistical power modeling of GPU kernels using perfor- mance counters,” in International Green Computing Conference and Workshops (IGCC). IEEE, 2010

work page 2010
[41]

A simpliﬁed and accurate model of power-performance efﬁciency on emergent GPU architectures,

S. Song, C. Su, B. Rountree, and K. W. Cameron, “A simpliﬁed and accurate model of power-performance efﬁciency on emergent GPU architectures,” in 27th IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS) . IEEE Computer Society, 2013, pp. 673–686

work page 2013
[42]

Energy characterization and instruction-level energy model of Intel’s Xeon Phi processor,

Y. S. Shao and D. Brooks, “Energy characterization and instruction-level energy model of Intel’s Xeon Phi processor,” in Proceedings of the 2013 International Symposium on Low Power Electronics and Design, ser. ISLPED ’13. IEEE Press, 2013

work page 2013
[43]

Energy-aware resource allocation heuristics for efﬁcient management of data centers for cloud computing,

A. Beloglazov, J. Abawajy, and R. Buyya, “Energy-aware resource allocation heuristics for efﬁcient management of data centers for cloud computing,” Future Generation Computer Systems , vol. 28, no. 5, pp. 755 – 768, 2012, special Section: Energy efﬁciency in large-scale distributed systems

work page 2012
[44]

Algorithmic time, energy, and power on candidate HPC compute building blocks,

J. Choi, M. Dukhan, X. Liu, and R. Vuduc, “Algorithmic time, energy, and power on candidate HPC compute building blocks,” in Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 2014, pp. 447–457

work page 2014
[45]

Energy trade-offs analysis using equal-energy maps,

M. Drozdowski, J. M. Marszalkowski, and J. Marszalkowski, “Energy trade-offs analysis using equal-energy maps,” Future Generation Computer Systems, vol. 36, pp. 311–321, 2014

work page 2014
[46]

Time and energy performance of parallel systems with hierarchi- cal memory,

J. M. Marszałkowski, M. Drozdowski, and J. Marszałkowski, “Time and energy performance of parallel systems with hierarchi- cal memory,” Journal of Grid Computing, vol. 14, no. 1, pp. 153–170, 2016

work page 2016
[47]

HCLWattsUp: API for power and energy measurements using WattsUp Pro Meter,

Heterogeneous Computing Laboratory, University College Dublin, “HCLWattsUp: API for power and energy measurements using WattsUp Pro Meter,” 2016. [Online]. Available: https://csgitlab.ucd.ie/ucd-hcl/hclwattsup ¡¡¡¡¡¡¡ HEAD

work page 2016

[1] [1]

A multi-objective approach for workﬂow scheduling in hetero- geneous environments,

H. M. Fard, R. Prodan, J. J. D. Barrionuevo, and T. Fahringer, “A multi-objective approach for workﬂow scheduling in hetero- geneous environments,” in Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Cc- grid 2012) , ser. CCGRID ’12. IEEE Computer Society, 2012, pp. 300–309

work page 2012

[2] [2]

A pareto-based meta- heuristic for scheduling HPC applications on a geographically distributed cloud federation,

Y. Kessaci, N. Melab, and E.-G. Talbi, “A pareto-based meta- heuristic for scheduling HPC applications on a geographically distributed cloud federation,” Cluster Computing , vol. 16, no. 3, pp. 451–468, Sep. 2013

work page 2013

[3] [3]

Multi-objective energy- efﬁcient workﬂow scheduling using list-based heuristics,

J. J. Durillo, V . Nae, and R. Prodan, “Multi-objective energy- efﬁcient workﬂow scheduling using list-based heuristics,” Future Generation Computer Systems, vol. 36, pp. 221 – 236, 2014

work page 2014

[4] [4]

E-eco: Performance-aware energy-efﬁcient cloud data center orchestration,

F. D. Rossi, M. G. Xavier, C. A. De Rose, R. N. Calheiros, and R. Buyya, “E-eco: Performance-aware energy-efﬁcient cloud data center orchestration,” Journal of Network and Computer Applications, vol. 78, pp. 83–96, 2017

work page 2017

[5] [5]

Quantitative modeling of power performance tradeoffs on ex- treme scale systems,

“Quantitative modeling of power performance tradeoffs on ex- treme scale systems,” Journal of Parallel and Distributed Computing , vol. 84, pp. 1 – 14, 2015

work page 2015

[6] [6]

Power tuning HPC jobs on power-constrained systems,

N. Gholkar, F. Mueller, and B. Rountree, “Power tuning HPC jobs on power-constrained systems,” in Proceedings of the 2016 International Conference on Parallel Architectures and Compilation . ACM, 2016, pp. 179–191

work page 2016

[7] [7]

Bounding energy consumption in large- scale MPI programs,

B. Rountree, D. K. Lowenthal, S. Funk, V . W. Freeh, B. R. de Supinski, and M. Schulz, “Bounding energy consumption in large- scale MPI programs,” in SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, Nov 2007, pp. 1–9

work page 2007

[8] [8]

Energy ef- ﬁcient genetic-based schedulers in computational grids,

J. Kołodziej, S. U. Khan, L. Wang, and A. Y. Zomaya, “Energy ef- ﬁcient genetic-based schedulers in computational grids,” Concurr. Comput. : Pract. Exper., vol. 27, no. 4, pp. 809–829, Mar. 2015

work page 2015

[9] [9]

Statistical power and performance modeling for optimizing the energy efﬁciency of scientiﬁc com- puting,

B. Subramaniam and W. Feng, “Statistical power and performance modeling for optimizing the energy efﬁciency of scientiﬁc com- puting,” in 2010 IEEE/ACM Int’l Conference on Green Computing and Communications and Int’l Conference on Cyber, Physical and Social Computing, Dec 2010, pp. 139–146

work page 2010

[10] [10]

Perfect strong scaling using no additional energy,

J. Demmel, A. Gearhart, B. Lipshitz, and O. Schwartz, “Perfect strong scaling using no additional energy,” in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing , 2013, pp. 649–660

work page 2013

[11] [11]

An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration,

J. Lang and G. Rnger, “An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration,” Journal of Parallel and Distributed Computing, vol. 74, no. 9, pp. 2884 – 2897, 2014

work page 2014

[12] [12]

A pareto frame- work for data analytics on heterogeneous systems: Implications for green energy usage and performance,

A. Chakrabarti, S. Parthasarathy, and C. Stewart, “A pareto frame- work for data analytics on heterogeneous systems: Implications for green energy usage and performance,” in Parallel Processing (ICPP), 2017 46th International Conference on. IEEE, 2017, pp. 533– 542

work page 2017

[13] [13]

New model-based methods and algorithms for performance and energy optimization of data parallel applications on homogeneous multicore clusters,

A. Lastovetsky and R. Reddy, “New model-based methods and algorithms for performance and energy optimization of data parallel applications on homogeneous multicore clusters,” IEEE Transactions on Parallel and Distributed Systems , vol. 28, no. 4, pp. 1119–1133, April 2017

work page 2017

[14] [14]

Bi-objective optimization of data-parallel applications on homogeneous multicore clusters for performance and energy,

R. R. Manumachu and A. Lastovetsky, “Bi-objective optimization of data-parallel applications on homogeneous multicore clusters for performance and energy,” IEEE Transactions on Computers , vol. 67, no. 2, pp. 160–177, 2018

work page 2018

[15] [15]

Design of self- adaptable data parallel applications on multicore clusters auto- matically optimized for performance and energy through load distribution,

R. Reddy Manumachu and A. L. Lastovetsky, “Design of self- adaptable data parallel applications on multicore clusters auto- matically optimized for performance and energy through load distribution,” Concurrency and Computation: Practice and Experience, vol. 0, no. 0, p. e4958

work page

[16] [16]

Data partitioning on multicore and multi-GPU platforms using functional performance models,

Z. Zhong, V . Rychkov, and A. Lastovetsky, “Data partitioning on multicore and multi-GPU platforms using functional performance models,” Computers, IEEE Transactions on, vol. 64, no. 9, pp. 2506– 2518, 2015

work page 2015

[17] [17]

Evaluating the effectiveness of model-based power characterization,

J. C. McCullough, Y. Agarwal, J. Chandrashekar, S. Kuppuswamy, A. C. Snoeren, and R. K. Gupta, “Evaluating the effectiveness of model-based power characterization,” in Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference , ser. USENIXATC’11. USENIX Association, 2011

work page 2011

[18] [18]

A survey of power and energy predictive models in HPC systems and applications,

K. O’Brien, I. Pietri, R. Reddy, A. Lastovetsky, and R. Sakellariou, “A survey of power and energy predictive models in HPC systems and applications,” ACM Computing Surveys, vol. 50, no. 3, 2017

work page 2017

[19] [19]

Additivity: A selection criterion for performance events for reliable energy predictive modeling,

A. Shahid, M. Fahad, R. Reddy, and A. Lastovetsky, “Additivity: A selection criterion for performance events for reliable energy predictive modeling,” Supercomputing Frontiers and Innovations , vol. 4, no. 4, 2017

work page 2017

[20] [20]

Out- of-core implementation for accelerator kernels on heterogeneous clouds,

H. Khaleghzadeh, Z. Zhong, R. Reddy, and A. Lastovetsky, “Out- of-core implementation for accelerator kernels on heterogeneous clouds,” The Journal of Supercomputing , vol. 74, no. 2, pp. 551–568, 2018

work page 2018

[21] [21]

A survey and taxonomy of energy efﬁcient resource management techniques in platform as a service cloud,

S. F. Piraghaj, A. V . Dastjerdi, R. N. Calheiros, and R. Buyya, “A survey and taxonomy of energy efﬁcient resource management techniques in platform as a service cloud,” in Handbook of Research on End-to-End Cloud Computing Architecture Design . IGI Global, 2017, pp. 410–454

work page 2017

[22] [22]

Improving energy efﬁ- ciency of web servers by using a load distribution algorithm and shutting down idle nodes,

K. Chen, J. Lenhardt, and W. Schiffmann, “Improving energy efﬁ- ciency of web servers by using a load distribution algorithm and shutting down idle nodes,” in Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on . IEEE, 2015, pp. 745–748

work page 2015

[23] [23]

Reducing the energy consumption of large-scale computing systems through combined shutdown policies with multiple constraints,

A. Benoit, L. Lef `evre, A.-C. Orgerie, and I. Ra ¨ıs, “Reducing the energy consumption of large-scale computing systems through combined shutdown policies with multiple constraints,” The Inter- national Journal of High Performance Computing Applications, vol. 32, no. 1, pp. 176–188, 2018

work page 2018

[24] [24]

Performance analysis and optimization of three-dimensional FDTD on GPU using rooﬂine model,

K.-H. Kim, K. Kim, and Q.-H. Park, “Performance analysis and optimization of three-dimensional FDTD on GPU using rooﬂine model,” Computer Physics Communications, vol. 182, no. 6, pp. 1201– 1207, 2011

work page 2011

[25] [25]

Work- load partitioning for accelerating applications on heterogeneous platforms,

J. Shen, A. L. Varbanescu, Y. Lu, P . Zou, and H. Sips, “Work- load partitioning for accelerating applications on heterogeneous platforms,” IEEE Transactions on Parallel and Distributed Systems , vol. 27, no. 9, pp. 2766–2780, 2016

work page 2016

[26] [26]

Heterogeneous distribution of computations solving linear algebra problems on networks of heterogeneous computers,

A. Kalinov and A. Lastovetsky, “Heterogeneous distribution of computations solving linear algebra problems on networks of heterogeneous computers,” Journal of Parallel and Distributed Com- puting, vol. 61, no. 4, pp. 520 – 535, 2001

work page 2001

[27] [27]

Matrix multi- plication on heterogeneous platforms,

O. Beaumont, V . Boudet, F. Rastello, and Y. Robert, “Matrix multi- plication on heterogeneous platforms,” IEEE Trans. Parallel Distrib. Syst., vol. 12, no. 10, Oct. 2001

work page 2001

[28] [28]

Data partitioning with a realistic performance model of networks of heterogeneous computers,

A. L. Lastovetsky and R. Reddy, “Data partitioning with a realistic performance model of networks of heterogeneous computers,” in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, 2004, p. 104

work page 2004

[29] [29]

Data partitioning for multipro- cessors with memory heterogeneity and memory constraints,

A. Lastovetsky and R. Reddy, “Data partitioning for multipro- cessors with memory heterogeneity and memory constraints,” Scientiﬁc Programming, vol. 13, no. 2, pp. 93–112, 2005

work page 2005

[30] [30]

Data partitioning with a func- tional performance model of heterogeneous processors,

L. A. Lastovetsky and R. Reddy, “Data partitioning with a func- tional performance model of heterogeneous processors,” Interna- tional Journal of High Performance Computing Applications , vol. 21, no. 1, pp. 76–90, 2007

work page 2007

[31] [31]

Model-based optimization of EULAG kernel on Intel Xeon Phi through load 28 imbalancing,

A. Lastovetsky, L. Szustak, and R. Wyrzykowski, “Model-based optimization of EULAG kernel on Intel Xeon Phi through load 28 imbalancing,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 3, pp. 787–797, 2017

work page 2017

[32] [32]

A novel data-partitioning algorithm for performance optimization of data-parallel applications on heterogeneous HPC platforms,

H. Khaleghzadeh, R. R. Manumachu, and A. Lastovetsky, “A novel data-partitioning algorithm for performance optimization of data-parallel applications on heterogeneous HPC platforms,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 10, pp. 2176–2190, 2018

work page 2018

[33] [33]

A comparative study of methods for measurement of energy of computing,

M. Fahad, A. Shahid, R. R. Manumachu, and A. Lastovetsky, “A comparative study of methods for measurement of energy of computing,” Energies, vol. 12, no. 11, p. 2204, 2019

work page 2019

[34] [34]

Energy conservation in heterogeneous server clusters,

T. Heath, B. Diniz, B. Horizonte, E. V . Carrera, and R. Bianchini, “Energy conservation in heterogeneous server clusters,” in 10th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP). ACM, 2005, pp. 186–195

work page 2005

[35] [35]

Full- system power analysis and modeling for server environments,

D. Economou, S. Rivoire, C. Kozyrakis, and P . Ranganathan, “Full- system power analysis and modeling for server environments,” in In Proceedings of Workshop on Modeling, Benchmarking, and Simula- tion, 2006, pp. 70–77

work page 2006

[36] [36]

Complete system power estimation using processor performance events,

W. L. Bircher and L. K. John, “Complete system power estimation using processor performance events,” IEEE Transactions on Com- puters, vol. 61, no. 4, pp. 563–577, Apr. 2012

work page 2012

[37] [37]

A methodology to predict the power consumption of servers in data centres,

R. Basmadjian, N. Ali, F. Niedermeier, H. de Meer, and G. Giuliani, “A methodology to predict the power consumption of servers in data centres,” in 2nd International Conference on Energy-Efﬁcient Computing and Networking. ACM, 2011

work page 2011

[38] [38]

An integrated GPU power and per- formance model,

H. Hong, Sunpyand Kim, “An integrated GPU power and per- formance model,” SIGARCH Comput. Archit. News , vol. 38, no. 3, 2010

work page 2010

[39] [39]

Runtime power monitoring in high- end processors: Methodology and empirical data,

C. Isci and M. Martonosi, “Runtime power monitoring in high- end processors: Methodology and empirical data,” in 36th annual IEEE/ACM International Symposium on Microarchitecture . IEEE Computer Society, 2003, p. 93

work page 2003

[40] [40]

Statistical power modeling of GPU kernels using perfor- mance counters,

H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and S. Mat- suoka, “Statistical power modeling of GPU kernels using perfor- mance counters,” in International Green Computing Conference and Workshops (IGCC). IEEE, 2010

work page 2010

[41] [41]

A simpliﬁed and accurate model of power-performance efﬁciency on emergent GPU architectures,

S. Song, C. Su, B. Rountree, and K. W. Cameron, “A simpliﬁed and accurate model of power-performance efﬁciency on emergent GPU architectures,” in 27th IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS) . IEEE Computer Society, 2013, pp. 673–686

work page 2013

[42] [42]

Energy characterization and instruction-level energy model of Intel’s Xeon Phi processor,

Y. S. Shao and D. Brooks, “Energy characterization and instruction-level energy model of Intel’s Xeon Phi processor,” in Proceedings of the 2013 International Symposium on Low Power Electronics and Design, ser. ISLPED ’13. IEEE Press, 2013

work page 2013

[43] [43]

Energy-aware resource allocation heuristics for efﬁcient management of data centers for cloud computing,

A. Beloglazov, J. Abawajy, and R. Buyya, “Energy-aware resource allocation heuristics for efﬁcient management of data centers for cloud computing,” Future Generation Computer Systems , vol. 28, no. 5, pp. 755 – 768, 2012, special Section: Energy efﬁciency in large-scale distributed systems

work page 2012

[44] [44]

Algorithmic time, energy, and power on candidate HPC compute building blocks,

J. Choi, M. Dukhan, X. Liu, and R. Vuduc, “Algorithmic time, energy, and power on candidate HPC compute building blocks,” in Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 2014, pp. 447–457

work page 2014

[45] [45]

Energy trade-offs analysis using equal-energy maps,

M. Drozdowski, J. M. Marszalkowski, and J. Marszalkowski, “Energy trade-offs analysis using equal-energy maps,” Future Generation Computer Systems, vol. 36, pp. 311–321, 2014

work page 2014

[46] [46]

Time and energy performance of parallel systems with hierarchi- cal memory,

J. M. Marszałkowski, M. Drozdowski, and J. Marszałkowski, “Time and energy performance of parallel systems with hierarchi- cal memory,” Journal of Grid Computing, vol. 14, no. 1, pp. 153–170, 2016

work page 2016

[47] [47]

HCLWattsUp: API for power and energy measurements using WattsUp Pro Meter,

Heterogeneous Computing Laboratory, University College Dublin, “HCLWattsUp: API for power and energy measurements using WattsUp Pro Meter,” 2016. [Online]. Available: https://csgitlab.ucd.ie/ucd-hcl/hclwattsup ¡¡¡¡¡¡¡ HEAD

work page 2016