On the energy efficiency of sparse matrix computations on multi-GPU clusters

Alessandro Celestini; Giorgio Richelli; Massimo Bernaschi; Pasqua D'Ambra

arxiv: 2510.02878 · v2 · submitted 2025-10-03 · 💻 cs.DC · cs.MS· cs.PF

On the energy efficiency of sparse matrix computations on multi-GPU clusters

Massimo Bernaschi , Alessandro Celestini , Pasqua D'Ambra , Giorgio Richelli This is my paper

Pith reviewed 2026-05-18 10:20 UTC · model grok-4.3

classification 💻 cs.DC cs.MScs.PF

keywords energy efficiencysparse matrixmulti-GPUHPCparallel computinglinear systemssustainability

0 comments

The pith

Optimizing GPU computations and minimizing data movement across nodes reduces both runtime and energy use for large sparse linear systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how a library for sparse matrix computations on clusters of GPUs achieves energy savings while solving very large linear systems that exceed single-node memory. It extends earlier performance results by adding detailed runtime energy measurements of the library's main parts. The work shows that careful design to increase parallelism and reduce data transfers between memory and nodes cuts both the time needed to finish and the total energy drawn. These gains also appear as clear advantages compared with other available software on common test problems. Readers interested in running large scientific simulations on modern high-performance machines would care because energy use now limits how far such calculations can scale.

Core claim

The library achieves energy-efficient execution of sparse linear system solves on multi-GPU platforms by exposing high parallelism in the algorithms and by optimizing implementations to limit data movement across memory hierarchies and compute nodes. Runtime energy profiles of the core components confirm that these choices lower both time-to-solution and energy consumption relative to less optimized approaches, while delivering measurable improvements over comparable frameworks on standard benchmarks.

What carries the argument

Methods that expose high parallelism in sparse matrix operations while optimizing data movement for efficient multi-GPU execution, paired with runtime tools for accurate energy measurement of those components.

Load-bearing premise

The energy measurement tools record true consumption without meaningful overhead or bias, and the chosen benchmarks reflect typical large-scale sparse linear system workloads.

What would settle it

Direct comparison of measured energy draw and runtime on the same multi-GPU cluster using a different sparse solver library that does not apply the same data-movement reductions, on the same set of benchmark matrices.

Figures

Figures reproduced from arXiv: 2510.02878 by Alessandro Celestini, Giorgio Richelli, Massimo Bernaschi, Pasqua D'Ambra.

**Figure 2.** Figure 2: Power–time profile of the SpMV kernel measured within the BootC [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: SpMV execution times under weak and strong scalability scenarios. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Dynamic energy consumption breakdown of the SpMV computa [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: GPU power peak of the SpMV computation under weak and strong [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Dynamic energy consumption per DOF breakdown of the SpMV [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Un-preconditioned CG execution times under weak and strong [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Dynamic energy consumption per iteration breakdown of the un [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Dynamic energy consumption per DOF breakdown of the un [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: GPU power peak of the CG computation under weak and strong [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Execution times breakdown of the PCG method of solve and [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Solve time per iteration of the PCG method under weak and [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Dynamic energy consumption breakdown of the PCG computa [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Dynamic energy consumption per DOF breakdown of the PCG [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Dynamic energy consumption per iteration breakdown of the PCG [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: GPU power peak of the PCG computation under weak and strong [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

read the original abstract

We investigate the energy efficiency of a library designed for parallel computations with sparse matrices. The library leverages high-performance, energy-efficient Graphics Processing Unit (GPU) accelerators to enable large-scale scientific applications. Our primary development objective was to maximize parallel performance and scalability in solving sparse linear systems whose dimensions far exceed the memory capacity of a single node. To this end, we devised methods that expose a high degree of parallelism while optimizing algorithmic implementations for efficient multi-GPU usage. Previous work has already demonstrated the library's performance efficiency on large-scale systems comprising thousands of NVIDIA GPUs, achieving improvements over state-of-the-art solutions. In this paper, we extend those results by providing energy profiles that address the growing sustainability requirements of modern HPC platforms. We present our methodology and tools for accurate runtime energy measurements of the library's core components and discuss the findings. Our results confirm that optimizing GPU computations and minimizing data movement across memory and computing nodes reduces both time-to-solution and energy consumption. Moreover, we show that the library delivers substantial advantages over comparable software frameworks on standard benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds energy profiles to the authors' prior multi-GPU sparse matrix library and shows that the same optimizations cut both runtime and power use.

read the letter

The main point is that this work extends the authors' existing library for large sparse linear systems on multi-GPU clusters by measuring energy consumption alongside performance. They report that minimizing data movement and optimizing GPU parallelism reduces both time-to-solution and energy, with clear edges over comparable frameworks on standard benchmarks. The prior performance results on thousands of GPUs already existed, so this is mainly an incremental step focused on sustainability metrics that matter more now in HPC.

Referee Report

1 major / 1 minor

Summary. The paper investigates the energy efficiency of a library for parallel sparse matrix computations on multi-GPU clusters. It describes algorithmic optimizations to expose high parallelism and minimize data movement for solving large sparse linear systems exceeding single-node memory capacity. Building on prior performance results, the authors present a methodology and tools for runtime energy measurements of core components, reporting that these optimizations reduce both time-to-solution and energy consumption while delivering advantages over comparable frameworks on standard benchmarks.

Significance. If the energy reductions are substantiated by unbiased and calibrated measurements that properly isolate GPU and system-level consumption, the work would provide valuable empirical evidence for energy-aware design in large-scale HPC sparse linear algebra, addressing sustainability concerns in multi-thousand GPU deployments.

major comments (1)

[Methodology for energy measurements] The central claim that optimizations reduce energy consumption rests on runtime energy profiles, yet the description of the measurement methodology (referenced in the abstract as addressing 'accurate runtime energy measurements of the library's core components') provides no details on calibration against external meters, quantification of monitoring overhead, or accounting for non-GPU power draw from host CPUs, interconnects, and memory in the multi-node cluster. This omission is load-bearing, as unaccounted bias or incomplete isolation could artifactually inflate reported savings.

minor comments (1)

[Abstract] The abstract refers to 'standard benchmarks' without naming them or providing quantitative results, error bars, or exclusion criteria; this should be expanded in the main text for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for identifying an area where the manuscript can be strengthened. We address the major comment below and will revise the paper to provide greater transparency on the energy measurement approach.

read point-by-point responses

Referee: The central claim that optimizations reduce energy consumption rests on runtime energy profiles, yet the description of the measurement methodology (referenced in the abstract as addressing 'accurate runtime energy measurements of the library's core components') provides no details on calibration against external meters, quantification of monitoring overhead, or accounting for non-GPU power draw from host CPUs, interconnects, and memory in the multi-node cluster. This omission is load-bearing, as unaccounted bias or incomplete isolation could artifactually inflate reported savings.

Authors: We agree that a more detailed exposition of the measurement methodology is warranted to support the energy-efficiency claims. The current manuscript outlines the tools employed for runtime profiling of the library components but does not fully elaborate on calibration procedures, overhead assessment, or separation of GPU versus host-system power. In the revised version we will expand the relevant section to include: explicit description of any calibration steps performed against external meters; quantitative assessment of monitoring overhead obtained through dedicated experiments; and clarification of how non-GPU contributions (host CPUs, interconnects, memory) were either measured separately or accounted for in the reported figures. These additions will allow readers to evaluate potential biases and will strengthen the empirical basis for the reported energy reductions. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical measurements of energy and performance

full rationale

The paper presents a methodology for runtime energy measurements on multi-GPU sparse matrix computations and reports benchmark results showing reduced time-to-solution and energy via optimizations and minimized data movement. No derivation chain, first-principles predictions, or fitted parameters are claimed; results rest on direct experimental profiling and comparisons to other frameworks. Prior work is cited only for established performance baselines, not as a load-bearing uniqueness theorem or self-referential definition for the energy claims. The analysis is self-contained against external benchmarks and does not reduce any output to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the accuracy of runtime energy measurement tools and the assumption that benchmark results generalize to production large-scale workloads; no free parameters or invented entities are described.

axioms (1)

domain assumption Runtime energy measurement tools provide accurate consumption data for the library's core components without introducing significant overhead.
The paper's methodology and findings depend on these tools delivering reliable profiles.

pith-pipeline@v0.9.0 · 5725 in / 1180 out tokens · 39309 ms · 2026-05-18T10:20:23.269244+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present our methodology and tools for accurate runtime energy measurements of the library's core components... optimizing GPU computations and minimizing data movement across memory and computing nodes reduces both time-to-solution and energy consumption.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BootCMatchGX... Algebraic MultiGrid (AMG) preconditioners... communication-reduction strategies

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Electricity 2025. Analysis and forecast to 2027

“Electricity 2025. Analysis and forecast to 2027.”https://www.iea. org/reports/electricity-2025

work page 2025
[2]

Energy-aware operation of HPC systems in Germany,

E. Suarez, H. Bockelmann, N. Eicker, J. Eitzinger, S. El Sayed, T. Fieseler, M. Frank, P. Frech, P. Giesselmann, D. Hackenberg, G. Hager, A. Herten, T. Ilsche, B. Koller, E. Laure, C. Manzano, S. Oeste, M. Ott, K. Reuter, R. Schneider, K. Thust, and B. von St. Vi- eth, “Energy-aware operation of HPC systems in Germany,”Frontiers in High Performance Comput...

work page 2025
[3]

Green500

“Green500.”https://top500.org/lists/green500/2024/11/[Ac- cessed: (13 May 2025)]

work page 2024
[4]

Understanding GPU power: A survey of profiling, modeling, and simulation methods,

R. A. Bridges, N. Imam, and T. M. Mintz, “Understanding GPU power: A survey of profiling, modeling, and simulation methods,”ACM Com- puting Surveys (CSUR), vol. 49, no. 3, pp. 1–27, 2016

work page 2016
[5]

On the performance and energy efficiency of sparse linear algebra on GPUs,

H. Anzt, S. Tomov, and J. Dongarra, “On the performance and energy efficiency of sparse linear algebra on GPUs,”The International Journal of High Performance Computing Applications, vol. 31, no. 5, pp. 375– 390, 2017

work page 2017
[6]

“Top500.”https://top500.org/lists/top500/2024/11/[Accessed: (13 May 2025)]. 25 1 2 4 8 16 32 64 #GPUs 170 175 180 185 190 195GPU power peak (W) weak scaling: 3703 dofs/GPU, up to 3.2 billion dofs AMGX BootCMatchGX (a) 7-points stencil matrix with 370 3 DOFs per GPU under weak scalability. 1 2 4 8 16 32 64 #GPUs 100 120 140 160 180GPU power peak (W) strong...

work page 2024
[7]

Towards EXtreme scale Technologies and Accelerators for euROhpc hw/Sw Supercomputing Applications for exascale: The TEXTAROSSA approach,

G. Agostaet al., “Towards EXtreme scale Technologies and Accelerators for euROhpc hw/Sw Supercomputing Applications for exascale: The TEXTAROSSA approach,”Microprocessors and Microsystems, vol. 95, p. 104679, 2022

work page 2022
[8]

The TEXTAROSSA project: Cool all the way down to the hardware,

A. Filgueraset al., “The TEXTAROSSA project: Cool all the way down to the hardware,” in2024 27th Euromicro Conference on Digital System Design (DSD), pp. 526–533, IEEE, 2024

work page 2024
[9]

Alya toward exascale: algorithmic scalability using PSCToolkit,

H. Owen, O. Lehmkuhl, P. D’Ambra, F. Durastante, and S. Filippone, “Alya toward exascale: algorithmic scalability using PSCToolkit,”The Journal of Supercomputing, vol. 80, pp. 13533–13556, 2024

work page 2024
[10]

PETScML: Second-order solvers for training regression problems in scientific ma- chine learning,

S. Zampini, U. Zerbinati, G. Turkyyiah, and D. Keyes, “PETScML: Second-order solvers for training regression problems in scientific ma- chine learning,” inProceedings of the Platform for Advanced Scientific Computing Conference, PASC ’24, (New York, NY, USA), Association for Computing Machinery, 2024

work page 2024
[11]

Quantifying the energy cost of data movement in scientific applications,

G. Kestor, R. Gioiosa, D. J. Kerbyson, and A. Hoisie, “Quantifying the energy cost of data movement in scientific applications,” in2013 IEEE International Symposium on Workload Characterization (IISWC), pp. 56–65, 2013

work page 2013
[12]

The evolution of mathematical software,

J. Dongarra, “The evolution of mathematical software,”Commun. ACM, vol. 65, p. 66–72, nov 2022. 26

work page 2022
[13]

Analyzing GPU energy consump- tion in data movement and storage,

P. Delestrac, J. Miquel, D. Bhattacharjee, D. Moolchandani, F. Catthoor, L. Torres, and D. Novo, “Analyzing GPU energy consump- tion in data movement and storage,” in2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Proces- sors (ASAP), pp. 143–151, 2024

work page 2024
[14]

Selecting optimal SpMV realizations for GPUs via machine learning,

E. Dufrechou, P. Ezzatti, and E. S. Quintana-Ort´ ı, “Selecting optimal SpMV realizations for GPUs via machine learning,”The International Journal of High Performance Computing Applications, vol. 35, no. 3, pp. 254–267, 2021

work page 2021
[15]

Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors,

H. Anzt, M. Castillo, J. C. Fern´ andez, J. Dongarra, and S. Tomov, “Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors,”Computer Science - Research and Development, vol. 27, no. 4, pp. 299–307, 2012

work page 2012
[16]

Batched sparse and mixed-precision linear algebra interface for effi- cient use of GPU hardware accelerators in scientific applications,

P. Luszczek, A. Abdelfattah, H. Anzt, A. Suzuki, and S. Tomov, “Batched sparse and mixed-precision linear algebra interface for effi- cient use of GPU hardware accelerators in scientific applications,”Fu- ture Generation Computer Systems, vol. 160, pp. 359–374, 2024

work page 2024
[17]

A performance and energy study of GPU-resident pre- conditioners for conjugate gradient solvers: In the context of existing and novel approaches,

K. ´Swirydowicz, J. Firoz, J. Manzano, M. Halappanavar, S. Thomas, and K. Barker, “A performance and energy study of GPU-resident pre- conditioners for conjugate gradient solvers: In the context of existing and novel approaches,” in2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC- PAD), pp. 70–80, 2024

work page 2024
[18]

BootCMatch: A soft- ware package for bootstrap AMG based on graph weighted matching,

P. D’Ambra, S. Filippone, and P. S. Vassilevski, “BootCMatch: A soft- ware package for bootstrap AMG based on graph weighted matching,” ACM Trans. Math. Softw., vol. 44, June 2018

work page 2018
[19]

AMG based on compatible weighted matching for GPUs,

M. Bernaschi, P. D’Ambra, and D. Pasquini, “AMG based on compatible weighted matching for GPUs,”Parallel Computing, vol. 92, p. 102599, 2020

work page 2020
[20]

BootCMatchG: An adap- tive algebraic multigrid linear solver for GPUs,

M. Bernaschi, P. D’Ambra, and D. Pasquini, “BootCMatchG: An adap- tive algebraic multigrid linear solver for GPUs,”Software Impacts, vol. 6, p. 100041, 2020

work page 2020
[21]

A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,

M. Bernaschi, A. Celestini, F. Vella, and P. D’Ambra, “A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,” IEEE Transactions on Parallel&Distributed Systems, vol. 34, pp. 2365– 2376, aug 2023. 27

work page 2023
[22]

Communication-reduced conjugate gradient variants for GPU-accelerated clusters,

M. Bernaschi, M. G. Carrozzo, A. Celestini, G. Piperno, and P. D’Ambra, “Communication-reduced conjugate gradient variants for GPU-accelerated clusters,” in2025 33rd Euromicro International Con- ference on Parallel, Distributed, and Network-Based Processing (PDP), pp. 178–186, 2025

work page 2025
[23]

Methods of conjugate gradients for solving linear systems,

M. Hestenes and E. Stiefel, “Methods of conjugate gradients for solving linear systems,”Journal of Research of the National Bureau of Stan- dards, vol. 49, pp. 409–436, 1952

work page 1952
[24]

A massively parallel solver for discrete Poisson-like problems,

Y. Notay and A. Napov, “A massively parallel solver for discrete Poisson-like problems,”Journal of Computational Physics, vol. 281, pp. 237–250, 2015

work page 2015
[25]

On the efficient implementation of pre- conditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy,

A. Chronopoulos and C. Gear, “On the efficient implementation of pre- conditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy,”Parallel Computing, vol. 11, no. 1, pp. 37–53, 1989

work page 1989
[26]

AmgX: A library for GPU accelerated algebraic multi- grid and preconditioned iterative methods,

M. Naumov, M. Arsaev, P. Castonguay, J. Cohen, J. Demouth, J. Eaton, S. Layton, N. Markovskiy, I. Reguly, N. Sakharnykh, V. Sellappan, and R. Strzodka, “AmgX: A library for GPU accelerated algebraic multi- grid and preconditioned iterative methods,”SIAM Journal on Scientific Computing, vol. 37, no. 5, pp. S602–S626, 2015

work page 2015
[27]

RAPL in action: Experiences in using RAPL for power measurements,

K. N. Khan, M. Hirki, T. Niemi, J. K. Nurminen, and Z. Ou, “RAPL in action: Experiences in using RAPL for power measurements,”ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS), vol. 3, no. 2, pp. 1–26, 2018

work page 2018
[28]

Likwid: A lightweight performance-oriented tool suite for x86 multicore environments,

J. Treibig, G. Hager, and G. Wellein, “Likwid: A lightweight performance-oriented tool suite for x86 multicore environments,” in2010 39th international conference on parallel processing workshops, pp. 207– 216, IEEE, 2010

work page 2010
[29]

“Likwid.”https://github.com/RRZE-HPC/likwid[Accessed: (31 March 2025)]

work page 2025
[30]

NVIDIA management library: NVML API reference guide

“NVIDIA management library: NVML API reference guide.” https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference. html#nvml-api-reference[Accessed: (31 March 2025)]

work page 2025
[31]

powerMonitor

“powerMonitor.”https://github.com/alecel/powerMonitor

work page
[32]

GPowerU

“GPowerU.”https://github.com/crrossi/GPowerU. 28

work page
[33]

Energy-efficient parallel com- puting: Challenges to scaling,

A. Lastovetsky and R. R. Manumachu, “Energy-efficient parallel com- puting: Challenges to scaling,”Information, vol. 14, no. 4, p. 248, 2023

work page 2023
[34]

A survey of power and energy efficient techniques for high performance numerical linear algebra operations,

L. Tan, S. Kothapalli, L. Chen, O. Hussaini, R. Bissiri, and Z. Chen, “A survey of power and energy efficient techniques for high performance numerical linear algebra operations,”Parallel Computing, vol. 40, no. 10, pp. 559–573, 2014

work page 2014
[35]

“Ginkgo.”https://ginkgo-project.github.io

work page
[36]

Ginkgo: A modern linear operator algebra framework for high performance computing,

H. Anzt, T. Cojean, G. Flegar, F. G¨ obel, T. Gr¨ utzmacher, P. Nayak, T. Ribizel, Y. M. Tsai, and E. S. Quintana-Ort´ ı, “Ginkgo: A modern linear operator algebra framework for high performance computing,” ACM Transactions on Mathematical Software, vol. 48, pp. 2:1–2:33, Feb. 2022

work page 2022
[37]

AmgX: A library for GPU accelerated algebraic multigrid and preconditioned iter- ative methods,

M. Naumov, M. Arsaev, P. Castonguay, J. Cohen, J. Demouth, J. Eaton, S. Layton, N. Markovskiy, I. Reguly, N. Sakharnykh,et al., “AmgX: A library for GPU accelerated algebraic multigrid and preconditioned iter- ative methods,”SIAM Journal on Scientific Computing, vol. 37, no. 5, pp. S602–S626, 2015

work page 2015
[38]

NVIDIA, Algebraic multigrid solver (AmgX) library version 2.5.0, 2025

“NVIDIA, Algebraic multigrid solver (AmgX) library version 2.5.0, 2025.”https://github.com/NVIDIA/AMGX

work page 2025
[39]

High-performance conjugate- gradient benchmark: A new metric for ranking high-performance com- puting systems,

J. Dongarra, M. Heroux, and P. Luszczek, “High-performance conjugate- gradient benchmark: A new metric for ranking high-performance com- puting systems,”The International Journal of High Performance Com- puting Applications, vol. 30, no. 1, pp. 3–10, 2016. 29

work page 2016

[1] [1]

Electricity 2025. Analysis and forecast to 2027

“Electricity 2025. Analysis and forecast to 2027.”https://www.iea. org/reports/electricity-2025

work page 2025

[2] [2]

Energy-aware operation of HPC systems in Germany,

E. Suarez, H. Bockelmann, N. Eicker, J. Eitzinger, S. El Sayed, T. Fieseler, M. Frank, P. Frech, P. Giesselmann, D. Hackenberg, G. Hager, A. Herten, T. Ilsche, B. Koller, E. Laure, C. Manzano, S. Oeste, M. Ott, K. Reuter, R. Schneider, K. Thust, and B. von St. Vi- eth, “Energy-aware operation of HPC systems in Germany,”Frontiers in High Performance Comput...

work page 2025

[3] [3]

Green500

“Green500.”https://top500.org/lists/green500/2024/11/[Ac- cessed: (13 May 2025)]

work page 2024

[4] [4]

Understanding GPU power: A survey of profiling, modeling, and simulation methods,

R. A. Bridges, N. Imam, and T. M. Mintz, “Understanding GPU power: A survey of profiling, modeling, and simulation methods,”ACM Com- puting Surveys (CSUR), vol. 49, no. 3, pp. 1–27, 2016

work page 2016

[5] [5]

On the performance and energy efficiency of sparse linear algebra on GPUs,

H. Anzt, S. Tomov, and J. Dongarra, “On the performance and energy efficiency of sparse linear algebra on GPUs,”The International Journal of High Performance Computing Applications, vol. 31, no. 5, pp. 375– 390, 2017

work page 2017

[6] [6]

“Top500.”https://top500.org/lists/top500/2024/11/[Accessed: (13 May 2025)]. 25 1 2 4 8 16 32 64 #GPUs 170 175 180 185 190 195GPU power peak (W) weak scaling: 3703 dofs/GPU, up to 3.2 billion dofs AMGX BootCMatchGX (a) 7-points stencil matrix with 370 3 DOFs per GPU under weak scalability. 1 2 4 8 16 32 64 #GPUs 100 120 140 160 180GPU power peak (W) strong...

work page 2024

[7] [7]

Towards EXtreme scale Technologies and Accelerators for euROhpc hw/Sw Supercomputing Applications for exascale: The TEXTAROSSA approach,

G. Agostaet al., “Towards EXtreme scale Technologies and Accelerators for euROhpc hw/Sw Supercomputing Applications for exascale: The TEXTAROSSA approach,”Microprocessors and Microsystems, vol. 95, p. 104679, 2022

work page 2022

[8] [8]

The TEXTAROSSA project: Cool all the way down to the hardware,

A. Filgueraset al., “The TEXTAROSSA project: Cool all the way down to the hardware,” in2024 27th Euromicro Conference on Digital System Design (DSD), pp. 526–533, IEEE, 2024

work page 2024

[9] [9]

Alya toward exascale: algorithmic scalability using PSCToolkit,

H. Owen, O. Lehmkuhl, P. D’Ambra, F. Durastante, and S. Filippone, “Alya toward exascale: algorithmic scalability using PSCToolkit,”The Journal of Supercomputing, vol. 80, pp. 13533–13556, 2024

work page 2024

[10] [10]

PETScML: Second-order solvers for training regression problems in scientific ma- chine learning,

S. Zampini, U. Zerbinati, G. Turkyyiah, and D. Keyes, “PETScML: Second-order solvers for training regression problems in scientific ma- chine learning,” inProceedings of the Platform for Advanced Scientific Computing Conference, PASC ’24, (New York, NY, USA), Association for Computing Machinery, 2024

work page 2024

[11] [11]

Quantifying the energy cost of data movement in scientific applications,

G. Kestor, R. Gioiosa, D. J. Kerbyson, and A. Hoisie, “Quantifying the energy cost of data movement in scientific applications,” in2013 IEEE International Symposium on Workload Characterization (IISWC), pp. 56–65, 2013

work page 2013

[12] [12]

The evolution of mathematical software,

J. Dongarra, “The evolution of mathematical software,”Commun. ACM, vol. 65, p. 66–72, nov 2022. 26

work page 2022

[13] [13]

Analyzing GPU energy consump- tion in data movement and storage,

P. Delestrac, J. Miquel, D. Bhattacharjee, D. Moolchandani, F. Catthoor, L. Torres, and D. Novo, “Analyzing GPU energy consump- tion in data movement and storage,” in2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Proces- sors (ASAP), pp. 143–151, 2024

work page 2024

[14] [14]

Selecting optimal SpMV realizations for GPUs via machine learning,

E. Dufrechou, P. Ezzatti, and E. S. Quintana-Ort´ ı, “Selecting optimal SpMV realizations for GPUs via machine learning,”The International Journal of High Performance Computing Applications, vol. 35, no. 3, pp. 254–267, 2021

work page 2021

[15] [15]

Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors,

H. Anzt, M. Castillo, J. C. Fern´ andez, J. Dongarra, and S. Tomov, “Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors,”Computer Science - Research and Development, vol. 27, no. 4, pp. 299–307, 2012

work page 2012

[16] [16]

Batched sparse and mixed-precision linear algebra interface for effi- cient use of GPU hardware accelerators in scientific applications,

P. Luszczek, A. Abdelfattah, H. Anzt, A. Suzuki, and S. Tomov, “Batched sparse and mixed-precision linear algebra interface for effi- cient use of GPU hardware accelerators in scientific applications,”Fu- ture Generation Computer Systems, vol. 160, pp. 359–374, 2024

work page 2024

[17] [17]

A performance and energy study of GPU-resident pre- conditioners for conjugate gradient solvers: In the context of existing and novel approaches,

K. ´Swirydowicz, J. Firoz, J. Manzano, M. Halappanavar, S. Thomas, and K. Barker, “A performance and energy study of GPU-resident pre- conditioners for conjugate gradient solvers: In the context of existing and novel approaches,” in2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC- PAD), pp. 70–80, 2024

work page 2024

[18] [18]

BootCMatch: A soft- ware package for bootstrap AMG based on graph weighted matching,

P. D’Ambra, S. Filippone, and P. S. Vassilevski, “BootCMatch: A soft- ware package for bootstrap AMG based on graph weighted matching,” ACM Trans. Math. Softw., vol. 44, June 2018

work page 2018

[19] [19]

AMG based on compatible weighted matching for GPUs,

M. Bernaschi, P. D’Ambra, and D. Pasquini, “AMG based on compatible weighted matching for GPUs,”Parallel Computing, vol. 92, p. 102599, 2020

work page 2020

[20] [20]

BootCMatchG: An adap- tive algebraic multigrid linear solver for GPUs,

M. Bernaschi, P. D’Ambra, and D. Pasquini, “BootCMatchG: An adap- tive algebraic multigrid linear solver for GPUs,”Software Impacts, vol. 6, p. 100041, 2020

work page 2020

[21] [21]

A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,

M. Bernaschi, A. Celestini, F. Vella, and P. D’Ambra, “A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,” IEEE Transactions on Parallel&Distributed Systems, vol. 34, pp. 2365– 2376, aug 2023. 27

work page 2023

[22] [22]

Communication-reduced conjugate gradient variants for GPU-accelerated clusters,

M. Bernaschi, M. G. Carrozzo, A. Celestini, G. Piperno, and P. D’Ambra, “Communication-reduced conjugate gradient variants for GPU-accelerated clusters,” in2025 33rd Euromicro International Con- ference on Parallel, Distributed, and Network-Based Processing (PDP), pp. 178–186, 2025

work page 2025

[23] [23]

Methods of conjugate gradients for solving linear systems,

M. Hestenes and E. Stiefel, “Methods of conjugate gradients for solving linear systems,”Journal of Research of the National Bureau of Stan- dards, vol. 49, pp. 409–436, 1952

work page 1952

[24] [24]

A massively parallel solver for discrete Poisson-like problems,

Y. Notay and A. Napov, “A massively parallel solver for discrete Poisson-like problems,”Journal of Computational Physics, vol. 281, pp. 237–250, 2015

work page 2015

[25] [25]

On the efficient implementation of pre- conditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy,

A. Chronopoulos and C. Gear, “On the efficient implementation of pre- conditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy,”Parallel Computing, vol. 11, no. 1, pp. 37–53, 1989

work page 1989

[26] [26]

AmgX: A library for GPU accelerated algebraic multi- grid and preconditioned iterative methods,

M. Naumov, M. Arsaev, P. Castonguay, J. Cohen, J. Demouth, J. Eaton, S. Layton, N. Markovskiy, I. Reguly, N. Sakharnykh, V. Sellappan, and R. Strzodka, “AmgX: A library for GPU accelerated algebraic multi- grid and preconditioned iterative methods,”SIAM Journal on Scientific Computing, vol. 37, no. 5, pp. S602–S626, 2015

work page 2015

[27] [27]

RAPL in action: Experiences in using RAPL for power measurements,

K. N. Khan, M. Hirki, T. Niemi, J. K. Nurminen, and Z. Ou, “RAPL in action: Experiences in using RAPL for power measurements,”ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS), vol. 3, no. 2, pp. 1–26, 2018

work page 2018

[28] [28]

Likwid: A lightweight performance-oriented tool suite for x86 multicore environments,

J. Treibig, G. Hager, and G. Wellein, “Likwid: A lightweight performance-oriented tool suite for x86 multicore environments,” in2010 39th international conference on parallel processing workshops, pp. 207– 216, IEEE, 2010

work page 2010

[29] [29]

“Likwid.”https://github.com/RRZE-HPC/likwid[Accessed: (31 March 2025)]

work page 2025

[30] [30]

NVIDIA management library: NVML API reference guide

“NVIDIA management library: NVML API reference guide.” https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference. html#nvml-api-reference[Accessed: (31 March 2025)]

work page 2025

[31] [31]

powerMonitor

“powerMonitor.”https://github.com/alecel/powerMonitor

work page

[32] [32]

GPowerU

“GPowerU.”https://github.com/crrossi/GPowerU. 28

work page

[33] [33]

Energy-efficient parallel com- puting: Challenges to scaling,

A. Lastovetsky and R. R. Manumachu, “Energy-efficient parallel com- puting: Challenges to scaling,”Information, vol. 14, no. 4, p. 248, 2023

work page 2023

[34] [34]

A survey of power and energy efficient techniques for high performance numerical linear algebra operations,

L. Tan, S. Kothapalli, L. Chen, O. Hussaini, R. Bissiri, and Z. Chen, “A survey of power and energy efficient techniques for high performance numerical linear algebra operations,”Parallel Computing, vol. 40, no. 10, pp. 559–573, 2014

work page 2014

[35] [35]

“Ginkgo.”https://ginkgo-project.github.io

work page

[36] [36]

Ginkgo: A modern linear operator algebra framework for high performance computing,

H. Anzt, T. Cojean, G. Flegar, F. G¨ obel, T. Gr¨ utzmacher, P. Nayak, T. Ribizel, Y. M. Tsai, and E. S. Quintana-Ort´ ı, “Ginkgo: A modern linear operator algebra framework for high performance computing,” ACM Transactions on Mathematical Software, vol. 48, pp. 2:1–2:33, Feb. 2022

work page 2022

[37] [37]

AmgX: A library for GPU accelerated algebraic multigrid and preconditioned iter- ative methods,

M. Naumov, M. Arsaev, P. Castonguay, J. Cohen, J. Demouth, J. Eaton, S. Layton, N. Markovskiy, I. Reguly, N. Sakharnykh,et al., “AmgX: A library for GPU accelerated algebraic multigrid and preconditioned iter- ative methods,”SIAM Journal on Scientific Computing, vol. 37, no. 5, pp. S602–S626, 2015

work page 2015

[38] [38]

NVIDIA, Algebraic multigrid solver (AmgX) library version 2.5.0, 2025

“NVIDIA, Algebraic multigrid solver (AmgX) library version 2.5.0, 2025.”https://github.com/NVIDIA/AMGX

work page 2025

[39] [39]

High-performance conjugate- gradient benchmark: A new metric for ranking high-performance com- puting systems,

J. Dongarra, M. Heroux, and P. Luszczek, “High-performance conjugate- gradient benchmark: A new metric for ranking high-performance com- puting systems,”The International Journal of High Performance Com- puting Applications, vol. 30, no. 1, pp. 3–10, 2016. 29

work page 2016