Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters
Pith reviewed 2026-05-23 06:24 UTC · model grok-4.3
The pith
A preconditioned s-step Conjugate Gradient solver reduces global synchronizations and overlaps communication with computation on GPU clusters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a parallel implementation of the preconditioned s-step CG method, originally proposed by Chronopoulos and Gear, fully exploits the aggregation of operations inherent to the s-step formulation to leverage GPU accelerators while also overlapping communication and computation inside the distributed sparse matrix-vector product, thereby reducing global synchronizations and data movement compared with standard CG on GPU-accelerated clusters.
What carries the argument
The s-step Conjugate Gradient iteration, which groups multiple inner products and updates into a single synchronization point to reduce communication frequency.
If this is right
- Strong and weak scalability improve on GPU clusters because fewer global reductions are required per iteration.
- The same aggregation technique can be applied inside the multi-GPU sparse matrix-vector product to hide latency.
- Preconditioned versions remain practical for the linear systems that arise from second-order elliptic PDEs.
- The approach is directly usable inside existing scientific computing platforms that already rely on CG.
Where Pith is reading between the lines
- The same communication-reduction pattern could be tested on other Krylov methods such as BiCGStab or GMRES once the s-step formulation is available.
- Performance gains may become larger on systems with higher communication-to-computation ratios, such as those with slower interconnects.
- Stability questions left open by the Poisson tests could be probed by repeating the experiments on increasingly ill-conditioned matrices.
Load-bearing premise
The s-step variant keeps numerical stability and convergence rates comparable to ordinary CG on the tested Poisson problems without extra stabilization steps.
What would settle it
Running the same Poisson test cases with the s-step implementation and observing either a clear loss of convergence or a sharp rise in iteration count relative to standard CG would falsify the claim.
Figures
read the original abstract
Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in datadriven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an implementation of the preconditioned s-step Conjugate Gradient method (Chronopoulos-Gear, 1989) for multi-node Nvidia GPU clusters. The central contribution is a parallel solver design that aggregates low-granularity operations to exploit GPU throughput and overlaps communication with computation in the multi-GPU sparse matrix-vector product. Experiments on Poisson PDE discretizations are used to demonstrate potential performance gains in strong and weak scalability.
Significance. If the reported implementation achieves the claimed communication reductions while preserving convergence, it would provide a practical route to improved scalability for sparse linear solvers on GPU-accelerated clusters, addressing a recognized bottleneck in PDE-based simulations. The explicit focus on GPU-specific optimizations such as operation aggregation and SpMV overlap is a concrete engineering contribution that builds directly on the established s-step framework.
major comments (2)
- [Numerical experiments] Numerical experiments section: No residual histories, iteration counts, or final accuracy metrics are reported in comparison to classical CG, leaving unaddressed the known finite-precision sensitivity of the monomial basis in s-step CG (rapid loss of orthogonality leading to stagnation). This directly undermines the claim that the GPU implementation 'preserves' the method's behavior for the Poisson benchmarks.
- [§3] §3 (Implementation details): The description of the aggregated operations and overlap strategy does not specify whether any reorthogonalization or basis change (e.g., to Newton or Chebyshev) was applied to mitigate rounding-error accumulation; without this, the central performance claim rests on an unverified assumption that the standard Chronopoulos-Gear formulation remains stable at the tested problem sizes.
minor comments (2)
- [Abstract] Abstract: 'datadriven' is missing a hyphen; should read 'data-driven'.
- [Title] The title refers to 'Variants' (plural) but the text and experiments focus exclusively on the single s-step formulation; clarify scope or adjust title.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the numerical validation and clarify implementation details.
read point-by-point responses
-
Referee: [Numerical experiments] Numerical experiments section: No residual histories, iteration counts, or final accuracy metrics are reported in comparison to classical CG, leaving unaddressed the known finite-precision sensitivity of the monomial basis in s-step CG (rapid loss of orthogonality leading to stagnation). This directly undermines the claim that the GPU implementation 'preserves' the method's behavior for the Poisson benchmarks.
Authors: We agree that direct comparisons of convergence behavior are necessary to support any implicit assumption of equivalence. The original experiments emphasized performance and scalability on the Poisson benchmarks, but we will add a new subsection with residual histories, iteration counts, and final residual norms comparing the s-step implementation to classical CG. This will document that no stagnation occurred for the tested problem sizes and iteration counts. The revision will be included in the next version of the manuscript. revision: yes
-
Referee: [§3] §3 (Implementation details): The description of the aggregated operations and overlap strategy does not specify whether any reorthogonalization or basis change (e.g., to Newton or Chebyshev) was applied to mitigate rounding-error accumulation; without this, the central performance claim rests on an unverified assumption that the standard Chronopoulos-Gear formulation remains stable at the tested problem sizes.
Authors: The implementation uses the standard monomial basis from the original Chronopoulos-Gear formulation without reorthogonalization or basis change. For the Poisson discretizations and problem sizes reported, the method achieved the expected residuals without observable instability. To address the concern, we will revise §3 to explicitly state that the standard formulation was used and add a short discussion of the stability conditions observed in our experiments. No changes to the code or method are required. revision: yes
Circularity Check
No circularity; implementation of externally cited algorithm with no derivation chain
full rationale
The paper implements the s-step CG method originally proposed by Chronopoulos and Gear (1989) for GPU clusters, focusing on communication reduction and overlap in SpMV. It cites the prior algorithm without deriving it or claiming new first-principles results. No equations or claims reduce by construction to fitted parameters, self-citations, or renamed inputs. Experiments report performance on Poisson benchmarks but do not present 'predictions' derived from the implementation itself. This is a standard engineering report relying on external literature; the derivation chain is absent by design.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math The s-step CG algorithm maintains the mathematical equivalence and convergence properties of standard CG under exact arithmetic.
Reference graph
Works this paper leans on
-
[1]
SParC-LES: enabling large eddy simulations with parallel sparse matrix computation tools,
A. Aprovitola, P. D’Ambra, F. M. Denaro, D. di Serafino, and S. Filip- pone, “SParC-LES: enabling large eddy simulations with parallel sparse matrix computation tools,” Comput. Math. Appl. , vol. 70, no. 11, pp. 2688–2700, 2015
work page 2015
-
[2]
Saad, Iterative Methods for Sparse Linear Systems , 2nd ed., ser
Y . Saad, Iterative Methods for Sparse Linear Systems , 2nd ed., ser. Other Titles in Applied Mathematics. SIAM, 2003. [Online]. Available: http://www-users.cs.umn.edu/∼{}saad/IterMethBook 2ndEd.pdf
work page 2003
-
[3]
J. Dongarra, M. Heroux, and P. Luszczek, “High-performance conjugate- gradient benchmark: A new metric for ranking high-performance com- puting systems,” The International Journal of High Performance Com- puting Applications, vol. 30, no. 1, pp. 3–10, 2016
work page 2016
-
[4]
The evolution of mathematical software,
J. Dongarra, “The evolution of mathematical software,” Commun. ACM, vol. 65, no. 12, p. 66–72, nov 2022. [Online]. Available: https://doi.org/10.1145/3554977
-
[5]
Methods of conjugate gradients for solving linear systems,
M. Hestenes and E. Stiefel, “Methods of conjugate gradients for solving linear systems,” Journal of Research of the National Bureau of Stan- dards, vol. 49, pp. 409–436, 1952
work page 1952
-
[6]
E. Carson, M. Rozlo ˇzn´ık, Z. Strako ˇs, P. Tich ´y, and M. T ˚uma, “The numerical stability analysis of pipelined conjugate gradient methods: Historical context and methodology,” SIAM Journal on Scientific Computing, vol. 40, no. 5, pp. A3549–A3580, 2018. [Online]. Available: https://doi.org/10.1137/16M1103361
-
[7]
A. Greenbaum, H. Liu, and T. Chen, “On the convergence rate of variants of the conjugate gradient algorithm in finite precision arithmetic,” SIAM Journal on Scientific Computing , vol. 43, no. 5, pp. S496–S515,
-
[8]
Available: https://doi.org/10.1137/20M1346249
[Online]. Available: https://doi.org/10.1137/20M1346249
-
[9]
Practical use of polynomial preconditionings for the conjugate gradient method,
Y . Saad, “Practical use of polynomial preconditionings for the conjugate gradient method,” SIAM Journal on Scientific and Statistical Computing, vol. 6, no. 4, pp. 865–881, 1985. [Online]. Available: https://doi.org/10.1137/0906059
-
[10]
Multitasking the conjugate gradient method on the CRAY X-MP/48,
G. Meurant, “Multitasking the conjugate gradient method on the CRAY X-MP/48,” Parallel Computing , vol. 5, no. 3, pp. 267–280, 1987. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ 0167819187900378
work page 1987
-
[11]
E. D’Azevedo, V . Eijkhout, and C. Romine, “Reducing communication costs in the conjugate gradient algorithm on distributed memory multi- processors,” ORNL TM/12192, Tech. Rep., 1992
work page 1992
-
[12]
A massively parallel solver for discrete Poisson-like problems,
Y . Notay and A. Napov, “A massively parallel solver for discrete Poisson-like problems,” Journal of Computational Physics , vol. 281, pp. 237–250, 2015. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0021999114007256
work page 2015
-
[13]
AMG based on compatible weighted matching for GPUs,
M. Bernaschi, P. D’Ambra, and D. Pasquini, “AMG based on compatible weighted matching for GPUs,” Parallel Computing, vol. 92, p. 102599, 2020. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0167819119301905
work page 2020
-
[14]
BootCMatchG: An adaptive algebraic multigrid linear solver for GPUs,
——, “BootCMatchG: An adaptive algebraic multigrid linear solver for GPUs,” Software Impacts, vol. 6, p. 100041, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2665963820300324
work page 2020
-
[15]
A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,
M. Bernaschi, A. Celestini, F. Vella, and P. D’Ambra, “A multi-GPU aggregation-based AMG preconditioner for iterative linear solvers,” IEEE Transactions on Parallel & Distributed Systems, vol. 34, no. 08, pp. 2365–2376, aug 2023
work page 2023
-
[16]
A. Chronopoulos and C. Gear, “On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy,” Parallel Computing, vol. 11, no. 1, pp. 37– 53, 1989. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/0167819189900628
-
[17]
s-step iterative methods for symmetric linear systems,
——, “s-step iterative methods for symmetric linear systems,” J. Comput. Appl. Math. , vol. 25, no. 2, pp. 153–168, 1989. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ 0377042789900459 9
work page 1989
-
[18]
Communication lower bounds and optimal algorithms for numerical linear algebra,
G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, “Communication lower bounds and optimal algorithms for numerical linear algebra,” Acta Numerica, vol. 23, p. 1–155, 2014
work page 2014
-
[19]
Communication-avoiding Krylov subspace methods in the- ory and practice,
E. Carson, “Communication-avoiding Krylov subspace methods in the- ory and practice,” PhD thesis, University of California, Berkeley, CA, 2015
work page 2015
-
[20]
Mixed precision s-step lanczos and conjugate gradient algorithms,
E. Carson, T. Gergelits, and I. Yamazaki, “Mixed precision s-step lanczos and conjugate gradient algorithms,” Numerical Linear Algebra with Applications , vol. 29, no. 3, p. e2425, 2022. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.2425
-
[21]
Hiding global synchronization latency in the preconditioned conjugate gradient algorithm,
P. Ghysels and W. Vanroose, “Hiding global synchronization latency in the preconditioned conjugate gradient algorithm,” Parallel Computing, vol. 40, no. 7, pp. 224–238, 2014, 7th Workshop on Parallel Matrix Algorithms and Applications. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0167819113000719
work page 2014
-
[22]
Pipelined preconditioned s-step conjugate gradient methods for distributed memory systems,
M. Tiwari and S. Vadhiyar, “Pipelined preconditioned s-step conjugate gradient methods for distributed memory systems,” in 2021 IEEE International Conference on Cluster Computing (CLUSTER) , 2021, pp. 215–225
work page 2021
-
[23]
Communication avoiding ILU0 preconditioner,
L. Grigori and S. Moufawad, “Communication avoiding ILU0 preconditioner,” SIAM Journal on Scientific Computing , vol. 37, no. 2, pp. C217–C246, 2015. [Online]. Available: https://doi.org/10.1137/ 130930376
work page 2015
-
[24]
I. Yamazaki, S. Rajamanickam, E. Boman, M. Hoemmen, M. Heroux, and S. Tomov, “Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU/GPU cluster,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , ser. SC ’14. IEEE Press, 2014, p. 933–944. [Online]. Ava...
-
[25]
Parallel performance of algebraic multigrid domain decomposition,
W. Mitchell, R. Strzodka, and R. Falgout, “Parallel performance of algebraic multigrid domain decomposition,” Numerical Linear Algebra with Applications , vol. 28, no. 3, p. e2342, 2021. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.2342
-
[26]
A. Mayumi, Y . Idomura, T. Ina, S. Yamada, and T. Imamura, “Left- preconditioned communication-avoiding conjugate gradient methods for multiphase CFD simulations on the K computer,” in 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), 2016, pp. 17–24
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.