pith. sign in

arxiv: 2501.03743 · v1 · submitted 2025-01-07 · 🧮 math.NA · cs.NA

Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

Pith reviewed 2026-05-23 06:24 UTC · model grok-4.3

classification 🧮 math.NA cs.NA
keywords conjugate gradients-step methodsGPU computingcommunication reductionparallel linear solverssparse matrix-vector productPoisson equationpreconditioned iteration
0
0 comments X

The pith

A preconditioned s-step Conjugate Gradient solver reduces global synchronizations and overlaps communication with computation on GPU clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that the s-step CG method can be implemented efficiently on clusters of Nvidia GPU nodes by aggregating low-granularity operations to match GPU throughput and by overlapping data movement with arithmetic in the multi-GPU sparse matrix-vector product. This approach cuts the number of global synchronizations relative to classical CG, which should improve both strong and weak scaling for large sparse systems. A reader would care because linear solvers remain a dominant cost in PDE-based simulations and data-driven scientific workflows. The claim is supported by experiments on Poisson-discretized benchmarks that show the method preserves the expected convergence behavior.

Core claim

The central claim is that a parallel implementation of the preconditioned s-step CG method, originally proposed by Chronopoulos and Gear, fully exploits the aggregation of operations inherent to the s-step formulation to leverage GPU accelerators while also overlapping communication and computation inside the distributed sparse matrix-vector product, thereby reducing global synchronizations and data movement compared with standard CG on GPU-accelerated clusters.

What carries the argument

The s-step Conjugate Gradient iteration, which groups multiple inner products and updates into a single synchronization point to reduce communication frequency.

If this is right

  • Strong and weak scalability improve on GPU clusters because fewer global reductions are required per iteration.
  • The same aggregation technique can be applied inside the multi-GPU sparse matrix-vector product to hide latency.
  • Preconditioned versions remain practical for the linear systems that arise from second-order elliptic PDEs.
  • The approach is directly usable inside existing scientific computing platforms that already rely on CG.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same communication-reduction pattern could be tested on other Krylov methods such as BiCGStab or GMRES once the s-step formulation is available.
  • Performance gains may become larger on systems with higher communication-to-computation ratios, such as those with slower interconnects.
  • Stability questions left open by the Poisson tests could be probed by repeating the experiments on increasingly ill-conditioned matrices.

Load-bearing premise

The s-step variant keeps numerical stability and convergence rates comparable to ordinary CG on the tested Poisson problems without extra stabilization steps.

What would settle it

Running the same Poisson test cases with the s-step implementation and observing either a clear loss of convergence or a sharp rise in iteration count relative to standard CG would falsify the claim.

Figures

Figures reproduced from arXiv: 2501.03743 by Alessandro Celestini, Giacomo Piperno, Massimo Bernaschi, Mauro G. Carrozzo, Pasqua D'Ambra.

Figure 1
Figure 1. Figure 1: Strong scalability: breakdown of solve time when no preconditioner [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Strong scalability: solve time per iteration when no preconditioner is [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scaled Speedup of the solve time when no preconditioner is applied. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Weak Scalability: breakdown of solve time when AMG preconditioner [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scaled Speedup of the solve time when AMG preconditioner is [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in datadriven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an implementation of the preconditioned s-step Conjugate Gradient method (Chronopoulos-Gear, 1989) for multi-node Nvidia GPU clusters. The central contribution is a parallel solver design that aggregates low-granularity operations to exploit GPU throughput and overlaps communication with computation in the multi-GPU sparse matrix-vector product. Experiments on Poisson PDE discretizations are used to demonstrate potential performance gains in strong and weak scalability.

Significance. If the reported implementation achieves the claimed communication reductions while preserving convergence, it would provide a practical route to improved scalability for sparse linear solvers on GPU-accelerated clusters, addressing a recognized bottleneck in PDE-based simulations. The explicit focus on GPU-specific optimizations such as operation aggregation and SpMV overlap is a concrete engineering contribution that builds directly on the established s-step framework.

major comments (2)
  1. [Numerical experiments] Numerical experiments section: No residual histories, iteration counts, or final accuracy metrics are reported in comparison to classical CG, leaving unaddressed the known finite-precision sensitivity of the monomial basis in s-step CG (rapid loss of orthogonality leading to stagnation). This directly undermines the claim that the GPU implementation 'preserves' the method's behavior for the Poisson benchmarks.
  2. [§3] §3 (Implementation details): The description of the aggregated operations and overlap strategy does not specify whether any reorthogonalization or basis change (e.g., to Newton or Chebyshev) was applied to mitigate rounding-error accumulation; without this, the central performance claim rests on an unverified assumption that the standard Chronopoulos-Gear formulation remains stable at the tested problem sizes.
minor comments (2)
  1. [Abstract] Abstract: 'datadriven' is missing a hyphen; should read 'data-driven'.
  2. [Title] The title refers to 'Variants' (plural) but the text and experiments focus exclusively on the single s-step formulation; clarify scope or adjust title.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the numerical validation and clarify implementation details.

read point-by-point responses
  1. Referee: [Numerical experiments] Numerical experiments section: No residual histories, iteration counts, or final accuracy metrics are reported in comparison to classical CG, leaving unaddressed the known finite-precision sensitivity of the monomial basis in s-step CG (rapid loss of orthogonality leading to stagnation). This directly undermines the claim that the GPU implementation 'preserves' the method's behavior for the Poisson benchmarks.

    Authors: We agree that direct comparisons of convergence behavior are necessary to support any implicit assumption of equivalence. The original experiments emphasized performance and scalability on the Poisson benchmarks, but we will add a new subsection with residual histories, iteration counts, and final residual norms comparing the s-step implementation to classical CG. This will document that no stagnation occurred for the tested problem sizes and iteration counts. The revision will be included in the next version of the manuscript. revision: yes

  2. Referee: [§3] §3 (Implementation details): The description of the aggregated operations and overlap strategy does not specify whether any reorthogonalization or basis change (e.g., to Newton or Chebyshev) was applied to mitigate rounding-error accumulation; without this, the central performance claim rests on an unverified assumption that the standard Chronopoulos-Gear formulation remains stable at the tested problem sizes.

    Authors: The implementation uses the standard monomial basis from the original Chronopoulos-Gear formulation without reorthogonalization or basis change. For the Poisson discretizations and problem sizes reported, the method achieved the expected residuals without observable instability. To address the concern, we will revise §3 to explicitly state that the standard formulation was used and add a short discussion of the stability conditions observed in our experiments. No changes to the code or method are required. revision: yes

Circularity Check

0 steps flagged

No circularity; implementation of externally cited algorithm with no derivation chain

full rationale

The paper implements the s-step CG method originally proposed by Chronopoulos and Gear (1989) for GPU clusters, focusing on communication reduction and overlap in SpMV. It cites the prior algorithm without deriving it or claiming new first-principles results. No equations or claims reduce by construction to fitted parameters, self-citations, or renamed inputs. Experiments report performance on Poisson benchmarks but do not present 'predictions' derived from the implementation itself. This is a standard engineering report relying on external literature; the derivation chain is absent by design.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the established mathematical properties of the Conjugate Gradient method and its s-step variant; no new free parameters, axioms, or invented entities are introduced beyond standard numerical linear algebra assumptions.

axioms (1)
  • standard math The s-step CG algorithm maintains the mathematical equivalence and convergence properties of standard CG under exact arithmetic.
    Invoked implicitly by building on the 1989 Chronopoulos and Gear method without re-derivation.

pith-pipeline@v0.9.0 · 5740 in / 1199 out tokens · 58174 ms · 2026-05-23T06:24:26.227892+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    SParC-LES: enabling large eddy simulations with parallel sparse matrix computation tools,

    A. Aprovitola, P. D’Ambra, F. M. Denaro, D. di Serafino, and S. Filip- pone, “SParC-LES: enabling large eddy simulations with parallel sparse matrix computation tools,” Comput. Math. Appl. , vol. 70, no. 11, pp. 2688–2700, 2015

  2. [2]

    Saad, Iterative Methods for Sparse Linear Systems , 2nd ed., ser

    Y . Saad, Iterative Methods for Sparse Linear Systems , 2nd ed., ser. Other Titles in Applied Mathematics. SIAM, 2003. [Online]. Available: http://www-users.cs.umn.edu/∼{}saad/IterMethBook 2ndEd.pdf

  3. [3]

    High-performance conjugate- gradient benchmark: A new metric for ranking high-performance com- puting systems,

    J. Dongarra, M. Heroux, and P. Luszczek, “High-performance conjugate- gradient benchmark: A new metric for ranking high-performance com- puting systems,” The International Journal of High Performance Com- puting Applications, vol. 30, no. 1, pp. 3–10, 2016

  4. [4]

    The evolution of mathematical software,

    J. Dongarra, “The evolution of mathematical software,” Commun. ACM, vol. 65, no. 12, p. 66–72, nov 2022. [Online]. Available: https://doi.org/10.1145/3554977

  5. [5]

    Methods of conjugate gradients for solving linear systems,

    M. Hestenes and E. Stiefel, “Methods of conjugate gradients for solving linear systems,” Journal of Research of the National Bureau of Stan- dards, vol. 49, pp. 409–436, 1952

  6. [6]

    The numerical stability analysis of pipelined conjugate gradient methods: Historical context and methodology,

    E. Carson, M. Rozlo ˇzn´ık, Z. Strako ˇs, P. Tich ´y, and M. T ˚uma, “The numerical stability analysis of pipelined conjugate gradient methods: Historical context and methodology,” SIAM Journal on Scientific Computing, vol. 40, no. 5, pp. A3549–A3580, 2018. [Online]. Available: https://doi.org/10.1137/16M1103361

  7. [7]

    On the convergence rate of variants of the conjugate gradient algorithm in finite precision arithmetic,

    A. Greenbaum, H. Liu, and T. Chen, “On the convergence rate of variants of the conjugate gradient algorithm in finite precision arithmetic,” SIAM Journal on Scientific Computing , vol. 43, no. 5, pp. S496–S515,

  8. [8]

    Available: https://doi.org/10.1137/20M1346249

    [Online]. Available: https://doi.org/10.1137/20M1346249

  9. [9]

    Practical use of polynomial preconditionings for the conjugate gradient method,

    Y . Saad, “Practical use of polynomial preconditionings for the conjugate gradient method,” SIAM Journal on Scientific and Statistical Computing, vol. 6, no. 4, pp. 865–881, 1985. [Online]. Available: https://doi.org/10.1137/0906059

  10. [10]

    Multitasking the conjugate gradient method on the CRAY X-MP/48,

    G. Meurant, “Multitasking the conjugate gradient method on the CRAY X-MP/48,” Parallel Computing , vol. 5, no. 3, pp. 267–280, 1987. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ 0167819187900378

  11. [11]

    Reducing communication costs in the conjugate gradient algorithm on distributed memory multi- processors,

    E. D’Azevedo, V . Eijkhout, and C. Romine, “Reducing communication costs in the conjugate gradient algorithm on distributed memory multi- processors,” ORNL TM/12192, Tech. Rep., 1992

  12. [12]

    A massively parallel solver for discrete Poisson-like problems,

    Y . Notay and A. Napov, “A massively parallel solver for discrete Poisson-like problems,” Journal of Computational Physics , vol. 281, pp. 237–250, 2015. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0021999114007256

  13. [13]

    AMG based on compatible weighted matching for GPUs,

    M. Bernaschi, P. D’Ambra, and D. Pasquini, “AMG based on compatible weighted matching for GPUs,” Parallel Computing, vol. 92, p. 102599, 2020. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0167819119301905

  14. [14]

    BootCMatchG: An adaptive algebraic multigrid linear solver for GPUs,

    ——, “BootCMatchG: An adaptive algebraic multigrid linear solver for GPUs,” Software Impacts, vol. 6, p. 100041, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2665963820300324

  15. [15]

    A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,

    M. Bernaschi, A. Celestini, F. Vella, and P. D’Ambra, “A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,” IEEE Transactions on Parallel & Distributed Systems, vol. 34, no. 08, pp. 2365–2376, aug 2023

  16. [16]

    On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy,

    A. Chronopoulos and C. Gear, “On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy,” Parallel Computing, vol. 11, no. 1, pp. 37– 53, 1989. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/0167819189900628

  17. [17]

    s-step iterative methods for symmetric linear systems,

    ——, “s-step iterative methods for symmetric linear systems,” J. Comput. Appl. Math. , vol. 25, no. 2, pp. 153–168, 1989. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ 0377042789900459 9

  18. [18]

    Communication lower bounds and optimal algorithms for numerical linear algebra,

    G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, “Communication lower bounds and optimal algorithms for numerical linear algebra,” Acta Numerica, vol. 23, p. 1–155, 2014

  19. [19]

    Communication-avoiding Krylov subspace methods in the- ory and practice,

    E. Carson, “Communication-avoiding Krylov subspace methods in the- ory and practice,” PhD thesis, University of California, Berkeley, CA, 2015

  20. [20]

    Mixed precision s-step lanczos and conjugate gradient algorithms,

    E. Carson, T. Gergelits, and I. Yamazaki, “Mixed precision s-step lanczos and conjugate gradient algorithms,” Numerical Linear Algebra with Applications , vol. 29, no. 3, p. e2425, 2022. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.2425

  21. [21]

    Hiding global synchronization latency in the preconditioned conjugate gradient algorithm,

    P. Ghysels and W. Vanroose, “Hiding global synchronization latency in the preconditioned conjugate gradient algorithm,” Parallel Computing, vol. 40, no. 7, pp. 224–238, 2014, 7th Workshop on Parallel Matrix Algorithms and Applications. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0167819113000719

  22. [22]

    Pipelined preconditioned s-step conjugate gradient methods for distributed memory systems,

    M. Tiwari and S. Vadhiyar, “Pipelined preconditioned s-step conjugate gradient methods for distributed memory systems,” in 2021 IEEE International Conference on Cluster Computing (CLUSTER) , 2021, pp. 215–225

  23. [23]

    Communication avoiding ILU0 preconditioner,

    L. Grigori and S. Moufawad, “Communication avoiding ILU0 preconditioner,” SIAM Journal on Scientific Computing , vol. 37, no. 2, pp. C217–C246, 2015. [Online]. Available: https://doi.org/10.1137/ 130930376

  24. [24]

    In: SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

    I. Yamazaki, S. Rajamanickam, E. Boman, M. Hoemmen, M. Heroux, and S. Tomov, “Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU/GPU cluster,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , ser. SC ’14. IEEE Press, 2014, p. 933–944. [Online]. Ava...

  25. [25]

    Parallel performance of algebraic multigrid domain decomposition,

    W. Mitchell, R. Strzodka, and R. Falgout, “Parallel performance of algebraic multigrid domain decomposition,” Numerical Linear Algebra with Applications , vol. 28, no. 3, p. e2342, 2021. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.2342

  26. [26]

    Left- preconditioned communication-avoiding conjugate gradient methods for multiphase CFD simulations on the K computer,

    A. Mayumi, Y . Idomura, T. Ina, S. Yamada, and T. Imamura, “Left- preconditioned communication-avoiding conjugate gradient methods for multiphase CFD simulations on the K computer,” in 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), 2016, pp. 17–24