Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

Alessandro Celestini; Giacomo Piperno; Massimo Bernaschi; Mauro G. Carrozzo; Pasqua D'Ambra

arxiv: 2501.03743 · v1 · submitted 2025-01-07 · 🧮 math.NA · cs.NA

Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

Massimo Bernaschi , Mauro G. Carrozzo , Alessandro Celestini , Giacomo Piperno , Pasqua D'Ambra This is my paper

Pith reviewed 2026-05-23 06:24 UTC · model grok-4.3

classification 🧮 math.NA cs.NA

keywords conjugate gradients-step methodsGPU computingcommunication reductionparallel linear solverssparse matrix-vector productPoisson equationpreconditioned iteration

0 comments

The pith

A preconditioned s-step Conjugate Gradient solver reduces global synchronizations and overlaps communication with computation on GPU clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that the s-step CG method can be implemented efficiently on clusters of Nvidia GPU nodes by aggregating low-granularity operations to match GPU throughput and by overlapping data movement with arithmetic in the multi-GPU sparse matrix-vector product. This approach cuts the number of global synchronizations relative to classical CG, which should improve both strong and weak scaling for large sparse systems. A reader would care because linear solvers remain a dominant cost in PDE-based simulations and data-driven scientific workflows. The claim is supported by experiments on Poisson-discretized benchmarks that show the method preserves the expected convergence behavior.

Core claim

The central claim is that a parallel implementation of the preconditioned s-step CG method, originally proposed by Chronopoulos and Gear, fully exploits the aggregation of operations inherent to the s-step formulation to leverage GPU accelerators while also overlapping communication and computation inside the distributed sparse matrix-vector product, thereby reducing global synchronizations and data movement compared with standard CG on GPU-accelerated clusters.

What carries the argument

The s-step Conjugate Gradient iteration, which groups multiple inner products and updates into a single synchronization point to reduce communication frequency.

If this is right

Strong and weak scalability improve on GPU clusters because fewer global reductions are required per iteration.
The same aggregation technique can be applied inside the multi-GPU sparse matrix-vector product to hide latency.
Preconditioned versions remain practical for the linear systems that arise from second-order elliptic PDEs.
The approach is directly usable inside existing scientific computing platforms that already rely on CG.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same communication-reduction pattern could be tested on other Krylov methods such as BiCGStab or GMRES once the s-step formulation is available.
Performance gains may become larger on systems with higher communication-to-computation ratios, such as those with slower interconnects.
Stability questions left open by the Poisson tests could be probed by repeating the experiments on increasingly ill-conditioned matrices.

Load-bearing premise

The s-step variant keeps numerical stability and convergence rates comparable to ordinary CG on the tested Poisson problems without extra stabilization steps.

What would settle it

Running the same Poisson test cases with the s-step implementation and observing either a clear loss of convergence or a sharp rise in iteration count relative to standard CG would falsify the claim.

Figures

Figures reproduced from arXiv: 2501.03743 by Alessandro Celestini, Giacomo Piperno, Massimo Bernaschi, Mauro G. Carrozzo, Pasqua D'Ambra.

**Figure 2.** Figure 2: Strong scalability: solve time per iteration when no preconditioner is [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Scaled Speedup of the solve time when no preconditioner is applied. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Weak Scalability: breakdown of solve time when AMG preconditioner [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Scaled Speedup of the solve time when AMG preconditioner is [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in datadriven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies the 1989 s-step CG to GPU clusters with overlap tricks but brings no new algorithm and skips stability validation.

read the letter

The main takeaway is that this paper ports the communication-reduced s-step CG from Chronopoulos and Gear 1989 to Nvidia GPU clusters. The new pieces are the engineering choices: aggregating low-granularity operations to match GPU throughput and overlapping communication with computation inside the multi-GPU sparse matrix-vector product. Those are practical tweaks for reducing synchronizations on distributed GPU nodes, and the abstract positions the Poisson PDE tests as evidence that the approach can improve strong and weak scaling. That part is straightforward and could be useful to people who already run large sparse solvers on similar hardware. The implementation focus is the real content here rather than any mathematical advance. The soft spot is the missing numerical checks. s-step methods are known to lose orthogonality fast in finite precision, which can cause stagnation or divergence, yet the abstract gives no sign that residual histories, iteration counts, or final accuracy were compared to classical CG, and no stabilization is mentioned. If the full paper does not supply those comparisons, the performance claims sit on an untested premise. The work is aimed at engineers and researchers who build or tune iterative solvers for GPU-accelerated PDE simulations and scientific computing codes. A reader looking for concrete GPU implementation patterns might extract value from the overlap and aggregation details. It deserves peer review because the topic is relevant to current hardware trends and the implementation could be reproducible, though any referee would need to see quantitative scaling numbers and at least basic stability evidence before accepting the central claims.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an implementation of the preconditioned s-step Conjugate Gradient method (Chronopoulos-Gear, 1989) for multi-node Nvidia GPU clusters. The central contribution is a parallel solver design that aggregates low-granularity operations to exploit GPU throughput and overlaps communication with computation in the multi-GPU sparse matrix-vector product. Experiments on Poisson PDE discretizations are used to demonstrate potential performance gains in strong and weak scalability.

Significance. If the reported implementation achieves the claimed communication reductions while preserving convergence, it would provide a practical route to improved scalability for sparse linear solvers on GPU-accelerated clusters, addressing a recognized bottleneck in PDE-based simulations. The explicit focus on GPU-specific optimizations such as operation aggregation and SpMV overlap is a concrete engineering contribution that builds directly on the established s-step framework.

major comments (2)

[Numerical experiments] Numerical experiments section: No residual histories, iteration counts, or final accuracy metrics are reported in comparison to classical CG, leaving unaddressed the known finite-precision sensitivity of the monomial basis in s-step CG (rapid loss of orthogonality leading to stagnation). This directly undermines the claim that the GPU implementation 'preserves' the method's behavior for the Poisson benchmarks.
[§3] §3 (Implementation details): The description of the aggregated operations and overlap strategy does not specify whether any reorthogonalization or basis change (e.g., to Newton or Chebyshev) was applied to mitigate rounding-error accumulation; without this, the central performance claim rests on an unverified assumption that the standard Chronopoulos-Gear formulation remains stable at the tested problem sizes.

minor comments (2)

[Abstract] Abstract: 'datadriven' is missing a hyphen; should read 'data-driven'.
[Title] The title refers to 'Variants' (plural) but the text and experiments focus exclusively on the single s-step formulation; clarify scope or adjust title.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the numerical validation and clarify implementation details.

read point-by-point responses

Referee: [Numerical experiments] Numerical experiments section: No residual histories, iteration counts, or final accuracy metrics are reported in comparison to classical CG, leaving unaddressed the known finite-precision sensitivity of the monomial basis in s-step CG (rapid loss of orthogonality leading to stagnation). This directly undermines the claim that the GPU implementation 'preserves' the method's behavior for the Poisson benchmarks.

Authors: We agree that direct comparisons of convergence behavior are necessary to support any implicit assumption of equivalence. The original experiments emphasized performance and scalability on the Poisson benchmarks, but we will add a new subsection with residual histories, iteration counts, and final residual norms comparing the s-step implementation to classical CG. This will document that no stagnation occurred for the tested problem sizes and iteration counts. The revision will be included in the next version of the manuscript. revision: yes
Referee: [§3] §3 (Implementation details): The description of the aggregated operations and overlap strategy does not specify whether any reorthogonalization or basis change (e.g., to Newton or Chebyshev) was applied to mitigate rounding-error accumulation; without this, the central performance claim rests on an unverified assumption that the standard Chronopoulos-Gear formulation remains stable at the tested problem sizes.

Authors: The implementation uses the standard monomial basis from the original Chronopoulos-Gear formulation without reorthogonalization or basis change. For the Poisson discretizations and problem sizes reported, the method achieved the expected residuals without observable instability. To address the concern, we will revise §3 to explicitly state that the standard formulation was used and add a short discussion of the stability conditions observed in our experiments. No changes to the code or method are required. revision: yes

Circularity Check

0 steps flagged

No circularity; implementation of externally cited algorithm with no derivation chain

full rationale

The paper implements the s-step CG method originally proposed by Chronopoulos and Gear (1989) for GPU clusters, focusing on communication reduction and overlap in SpMV. It cites the prior algorithm without deriving it or claiming new first-principles results. No equations or claims reduce by construction to fitted parameters, self-citations, or renamed inputs. Experiments report performance on Poisson benchmarks but do not present 'predictions' derived from the implementation itself. This is a standard engineering report relying on external literature; the derivation chain is absent by design.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the established mathematical properties of the Conjugate Gradient method and its s-step variant; no new free parameters, axioms, or invented entities are introduced beyond standard numerical linear algebra assumptions.

axioms (1)

standard math The s-step CG algorithm maintains the mathematical equivalence and convergence properties of standard CG under exact arithmetic.
Invoked implicitly by building on the 1989 Chronopoulos and Gear method without re-derivation.

pith-pipeline@v0.9.0 · 5740 in / 1199 out tokens · 58174 ms · 2026-05-23T06:24:26.227892+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

SParC-LES: enabling large eddy simulations with parallel sparse matrix computation tools,

A. Aprovitola, P. D’Ambra, F. M. Denaro, D. di Serafino, and S. Filip- pone, “SParC-LES: enabling large eddy simulations with parallel sparse matrix computation tools,” Comput. Math. Appl. , vol. 70, no. 11, pp. 2688–2700, 2015

work page 2015
[2]

Saad, Iterative Methods for Sparse Linear Systems , 2nd ed., ser

Y . Saad, Iterative Methods for Sparse Linear Systems , 2nd ed., ser. Other Titles in Applied Mathematics. SIAM, 2003. [Online]. Available: http://www-users.cs.umn.edu/∼{}saad/IterMethBook 2ndEd.pdf

work page 2003
[3]

High-performance conjugate- gradient benchmark: A new metric for ranking high-performance com- puting systems,

J. Dongarra, M. Heroux, and P. Luszczek, “High-performance conjugate- gradient benchmark: A new metric for ranking high-performance com- puting systems,” The International Journal of High Performance Com- puting Applications, vol. 30, no. 1, pp. 3–10, 2016

work page 2016
[4]

The evolution of mathematical software,

J. Dongarra, “The evolution of mathematical software,” Commun. ACM, vol. 65, no. 12, p. 66–72, nov 2022. [Online]. Available: https://doi.org/10.1145/3554977

work page doi:10.1145/3554977 2022
[5]

Methods of conjugate gradients for solving linear systems,

M. Hestenes and E. Stiefel, “Methods of conjugate gradients for solving linear systems,” Journal of Research of the National Bureau of Stan- dards, vol. 49, pp. 409–436, 1952

work page 1952
[6]

The numerical stability analysis of pipelined conjugate gradient methods: Historical context and methodology,

E. Carson, M. Rozlo ˇzn´ık, Z. Strako ˇs, P. Tich ´y, and M. T ˚uma, “The numerical stability analysis of pipelined conjugate gradient methods: Historical context and methodology,” SIAM Journal on Scientific Computing, vol. 40, no. 5, pp. A3549–A3580, 2018. [Online]. Available: https://doi.org/10.1137/16M1103361

work page doi:10.1137/16m1103361 2018
[7]

On the convergence rate of variants of the conjugate gradient algorithm in finite precision arithmetic,

A. Greenbaum, H. Liu, and T. Chen, “On the convergence rate of variants of the conjugate gradient algorithm in finite precision arithmetic,” SIAM Journal on Scientific Computing , vol. 43, no. 5, pp. S496–S515,

work page
[8]

Available: https://doi.org/10.1137/20M1346249

[Online]. Available: https://doi.org/10.1137/20M1346249

work page doi:10.1137/20m1346249
[9]

Practical use of polynomial preconditionings for the conjugate gradient method,

Y . Saad, “Practical use of polynomial preconditionings for the conjugate gradient method,” SIAM Journal on Scientific and Statistical Computing, vol. 6, no. 4, pp. 865–881, 1985. [Online]. Available: https://doi.org/10.1137/0906059

work page doi:10.1137/0906059 1985
[10]

Multitasking the conjugate gradient method on the CRAY X-MP/48,

G. Meurant, “Multitasking the conjugate gradient method on the CRAY X-MP/48,” Parallel Computing , vol. 5, no. 3, pp. 267–280, 1987. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ 0167819187900378

work page 1987
[11]

Reducing communication costs in the conjugate gradient algorithm on distributed memory multi- processors,

E. D’Azevedo, V . Eijkhout, and C. Romine, “Reducing communication costs in the conjugate gradient algorithm on distributed memory multi- processors,” ORNL TM/12192, Tech. Rep., 1992

work page 1992
[12]

A massively parallel solver for discrete Poisson-like problems,

Y . Notay and A. Napov, “A massively parallel solver for discrete Poisson-like problems,” Journal of Computational Physics , vol. 281, pp. 237–250, 2015. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0021999114007256

work page 2015
[13]

AMG based on compatible weighted matching for GPUs,

M. Bernaschi, P. D’Ambra, and D. Pasquini, “AMG based on compatible weighted matching for GPUs,” Parallel Computing, vol. 92, p. 102599, 2020. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0167819119301905

work page 2020
[14]

BootCMatchG: An adaptive algebraic multigrid linear solver for GPUs,

——, “BootCMatchG: An adaptive algebraic multigrid linear solver for GPUs,” Software Impacts, vol. 6, p. 100041, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2665963820300324

work page 2020
[15]

A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,

M. Bernaschi, A. Celestini, F. Vella, and P. D’Ambra, “A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,” IEEE Transactions on Parallel & Distributed Systems, vol. 34, no. 08, pp. 2365–2376, aug 2023

work page 2023
[16]

On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy,

A. Chronopoulos and C. Gear, “On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy,” Parallel Computing, vol. 11, no. 1, pp. 37– 53, 1989. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/0167819189900628

work page arXiv 1989
[17]

s-step iterative methods for symmetric linear systems,

——, “s-step iterative methods for symmetric linear systems,” J. Comput. Appl. Math. , vol. 25, no. 2, pp. 153–168, 1989. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ 0377042789900459 9

work page 1989
[18]

Communication lower bounds and optimal algorithms for numerical linear algebra,

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, “Communication lower bounds and optimal algorithms for numerical linear algebra,” Acta Numerica, vol. 23, p. 1–155, 2014

work page 2014
[19]

Communication-avoiding Krylov subspace methods in the- ory and practice,

E. Carson, “Communication-avoiding Krylov subspace methods in the- ory and practice,” PhD thesis, University of California, Berkeley, CA, 2015

work page 2015
[20]

Mixed precision s-step lanczos and conjugate gradient algorithms,

E. Carson, T. Gergelits, and I. Yamazaki, “Mixed precision s-step lanczos and conjugate gradient algorithms,” Numerical Linear Algebra with Applications , vol. 29, no. 3, p. e2425, 2022. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.2425

work page doi:10.1002/nla.2425 2022
[21]

Hiding global synchronization latency in the preconditioned conjugate gradient algorithm,

P. Ghysels and W. Vanroose, “Hiding global synchronization latency in the preconditioned conjugate gradient algorithm,” Parallel Computing, vol. 40, no. 7, pp. 224–238, 2014, 7th Workshop on Parallel Matrix Algorithms and Applications. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0167819113000719

work page 2014
[22]

Pipelined preconditioned s-step conjugate gradient methods for distributed memory systems,

M. Tiwari and S. Vadhiyar, “Pipelined preconditioned s-step conjugate gradient methods for distributed memory systems,” in 2021 IEEE International Conference on Cluster Computing (CLUSTER) , 2021, pp. 215–225

work page 2021
[23]

Communication avoiding ILU0 preconditioner,

L. Grigori and S. Moufawad, “Communication avoiding ILU0 preconditioner,” SIAM Journal on Scientific Computing , vol. 37, no. 2, pp. C217–C246, 2015. [Online]. Available: https://doi.org/10.1137/ 130930376

work page 2015
[24]

In: SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

I. Yamazaki, S. Rajamanickam, E. Boman, M. Hoemmen, M. Heroux, and S. Tomov, “Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU/GPU cluster,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , ser. SC ’14. IEEE Press, 2014, p. 933–944. [Online]. Ava...

work page doi:10.1109/sc.2014.81 2014
[25]

Parallel performance of algebraic multigrid domain decomposition,

W. Mitchell, R. Strzodka, and R. Falgout, “Parallel performance of algebraic multigrid domain decomposition,” Numerical Linear Algebra with Applications , vol. 28, no. 3, p. e2342, 2021. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.2342

work page doi:10.1002/nla.2342 2021
[26]

Left- preconditioned communication-avoiding conjugate gradient methods for multiphase CFD simulations on the K computer,

A. Mayumi, Y . Idomura, T. Ina, S. Yamada, and T. Imamura, “Left- preconditioned communication-avoiding conjugate gradient methods for multiphase CFD simulations on the K computer,” in 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), 2016, pp. 17–24

work page 2016

[1] [1]

SParC-LES: enabling large eddy simulations with parallel sparse matrix computation tools,

A. Aprovitola, P. D’Ambra, F. M. Denaro, D. di Serafino, and S. Filip- pone, “SParC-LES: enabling large eddy simulations with parallel sparse matrix computation tools,” Comput. Math. Appl. , vol. 70, no. 11, pp. 2688–2700, 2015

work page 2015

[2] [2]

Saad, Iterative Methods for Sparse Linear Systems , 2nd ed., ser

Y . Saad, Iterative Methods for Sparse Linear Systems , 2nd ed., ser. Other Titles in Applied Mathematics. SIAM, 2003. [Online]. Available: http://www-users.cs.umn.edu/∼{}saad/IterMethBook 2ndEd.pdf

work page 2003

[3] [3]

High-performance conjugate- gradient benchmark: A new metric for ranking high-performance com- puting systems,

J. Dongarra, M. Heroux, and P. Luszczek, “High-performance conjugate- gradient benchmark: A new metric for ranking high-performance com- puting systems,” The International Journal of High Performance Com- puting Applications, vol. 30, no. 1, pp. 3–10, 2016

work page 2016

[4] [4]

The evolution of mathematical software,

J. Dongarra, “The evolution of mathematical software,” Commun. ACM, vol. 65, no. 12, p. 66–72, nov 2022. [Online]. Available: https://doi.org/10.1145/3554977

work page doi:10.1145/3554977 2022

[5] [5]

Methods of conjugate gradients for solving linear systems,

M. Hestenes and E. Stiefel, “Methods of conjugate gradients for solving linear systems,” Journal of Research of the National Bureau of Stan- dards, vol. 49, pp. 409–436, 1952

work page 1952

[6] [6]

The numerical stability analysis of pipelined conjugate gradient methods: Historical context and methodology,

E. Carson, M. Rozlo ˇzn´ık, Z. Strako ˇs, P. Tich ´y, and M. T ˚uma, “The numerical stability analysis of pipelined conjugate gradient methods: Historical context and methodology,” SIAM Journal on Scientific Computing, vol. 40, no. 5, pp. A3549–A3580, 2018. [Online]. Available: https://doi.org/10.1137/16M1103361

work page doi:10.1137/16m1103361 2018

[7] [7]

On the convergence rate of variants of the conjugate gradient algorithm in finite precision arithmetic,

A. Greenbaum, H. Liu, and T. Chen, “On the convergence rate of variants of the conjugate gradient algorithm in finite precision arithmetic,” SIAM Journal on Scientific Computing , vol. 43, no. 5, pp. S496–S515,

work page

[8] [8]

Available: https://doi.org/10.1137/20M1346249

[Online]. Available: https://doi.org/10.1137/20M1346249

work page doi:10.1137/20m1346249

[9] [9]

Practical use of polynomial preconditionings for the conjugate gradient method,

Y . Saad, “Practical use of polynomial preconditionings for the conjugate gradient method,” SIAM Journal on Scientific and Statistical Computing, vol. 6, no. 4, pp. 865–881, 1985. [Online]. Available: https://doi.org/10.1137/0906059

work page doi:10.1137/0906059 1985

[10] [10]

Multitasking the conjugate gradient method on the CRAY X-MP/48,

G. Meurant, “Multitasking the conjugate gradient method on the CRAY X-MP/48,” Parallel Computing , vol. 5, no. 3, pp. 267–280, 1987. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ 0167819187900378

work page 1987

[11] [11]

Reducing communication costs in the conjugate gradient algorithm on distributed memory multi- processors,

E. D’Azevedo, V . Eijkhout, and C. Romine, “Reducing communication costs in the conjugate gradient algorithm on distributed memory multi- processors,” ORNL TM/12192, Tech. Rep., 1992

work page 1992

[12] [12]

A massively parallel solver for discrete Poisson-like problems,

Y . Notay and A. Napov, “A massively parallel solver for discrete Poisson-like problems,” Journal of Computational Physics , vol. 281, pp. 237–250, 2015. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0021999114007256

work page 2015

[13] [13]

AMG based on compatible weighted matching for GPUs,

M. Bernaschi, P. D’Ambra, and D. Pasquini, “AMG based on compatible weighted matching for GPUs,” Parallel Computing, vol. 92, p. 102599, 2020. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0167819119301905

work page 2020

[14] [14]

BootCMatchG: An adaptive algebraic multigrid linear solver for GPUs,

——, “BootCMatchG: An adaptive algebraic multigrid linear solver for GPUs,” Software Impacts, vol. 6, p. 100041, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2665963820300324

work page 2020

[15] [15]

A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,

M. Bernaschi, A. Celestini, F. Vella, and P. D’Ambra, “A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,” IEEE Transactions on Parallel & Distributed Systems, vol. 34, no. 08, pp. 2365–2376, aug 2023

work page 2023

[16] [16]

On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy,

A. Chronopoulos and C. Gear, “On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy,” Parallel Computing, vol. 11, no. 1, pp. 37– 53, 1989. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/0167819189900628

work page arXiv 1989

[17] [17]

s-step iterative methods for symmetric linear systems,

——, “s-step iterative methods for symmetric linear systems,” J. Comput. Appl. Math. , vol. 25, no. 2, pp. 153–168, 1989. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ 0377042789900459 9

work page 1989

[18] [18]

Communication lower bounds and optimal algorithms for numerical linear algebra,

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, “Communication lower bounds and optimal algorithms for numerical linear algebra,” Acta Numerica, vol. 23, p. 1–155, 2014

work page 2014

[19] [19]

Communication-avoiding Krylov subspace methods in the- ory and practice,

E. Carson, “Communication-avoiding Krylov subspace methods in the- ory and practice,” PhD thesis, University of California, Berkeley, CA, 2015

work page 2015

[20] [20]

Mixed precision s-step lanczos and conjugate gradient algorithms,

E. Carson, T. Gergelits, and I. Yamazaki, “Mixed precision s-step lanczos and conjugate gradient algorithms,” Numerical Linear Algebra with Applications , vol. 29, no. 3, p. e2425, 2022. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.2425

work page doi:10.1002/nla.2425 2022

[21] [21]

Hiding global synchronization latency in the preconditioned conjugate gradient algorithm,

P. Ghysels and W. Vanroose, “Hiding global synchronization latency in the preconditioned conjugate gradient algorithm,” Parallel Computing, vol. 40, no. 7, pp. 224–238, 2014, 7th Workshop on Parallel Matrix Algorithms and Applications. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0167819113000719

work page 2014

[22] [22]

Pipelined preconditioned s-step conjugate gradient methods for distributed memory systems,

M. Tiwari and S. Vadhiyar, “Pipelined preconditioned s-step conjugate gradient methods for distributed memory systems,” in 2021 IEEE International Conference on Cluster Computing (CLUSTER) , 2021, pp. 215–225

work page 2021

[23] [23]

Communication avoiding ILU0 preconditioner,

L. Grigori and S. Moufawad, “Communication avoiding ILU0 preconditioner,” SIAM Journal on Scientific Computing , vol. 37, no. 2, pp. C217–C246, 2015. [Online]. Available: https://doi.org/10.1137/ 130930376

work page 2015

[24] [24]

In: SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

I. Yamazaki, S. Rajamanickam, E. Boman, M. Hoemmen, M. Heroux, and S. Tomov, “Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU/GPU cluster,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , ser. SC ’14. IEEE Press, 2014, p. 933–944. [Online]. Ava...

work page doi:10.1109/sc.2014.81 2014

[25] [25]

Parallel performance of algebraic multigrid domain decomposition,

W. Mitchell, R. Strzodka, and R. Falgout, “Parallel performance of algebraic multigrid domain decomposition,” Numerical Linear Algebra with Applications , vol. 28, no. 3, p. e2342, 2021. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.2342

work page doi:10.1002/nla.2342 2021

[26] [26]

Left- preconditioned communication-avoiding conjugate gradient methods for multiphase CFD simulations on the K computer,

A. Mayumi, Y . Idomura, T. Ina, S. Yamada, and T. Imamura, “Left- preconditioned communication-avoiding conjugate gradient methods for multiphase CFD simulations on the K computer,” in 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), 2016, pp. 17–24

work page 2016