Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs

Christie Alappat; Georg Hager; Gerhard Wellein; Holger Fehske; Jonas Thies

arxiv: 2309.02228 · v1 · submitted 2023-09-05 · 🧮 math.NA · cs.DC· cs.NA

Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs

Christie Alappat , Jonas Thies , Georg Hager , Holger Fehske , Gerhard Wellein This is my paper

Pith reviewed 2026-05-24 06:30 UTC · model grok-4.3

classification 🧮 math.NA cs.DCcs.NA

keywords sparse iterative solversmatrix power kerneltemporal cache blockingmulti-core performances-step methodspolynomial preconditionersalgebraic multigridperformance optimization

0 comments

The pith

Algebraic temporal blocking speeds matrix power kernels by up to 3x in sparse iterative solvers on multi-core CPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that temporal cache blocking can be applied to the matrix power kernel that evaluates polynomials through repeated sparse matrix-vector products in iterative solvers. This algebraic formulation improves data locality during the kernel without changing the underlying mathematics or the solver's convergence behavior. A sympathetic reader would care because these kernels often account for most of the runtime in large-scale linear systems, so shortening them directly reduces total simulation time on current hardware. The work integrates the blocking into several standard solver types and reports measured gains when the kernel dominates execution.

Core claim

The central claim is that level-based formulation of sparse matrix-vector multiplications enables temporal cache blocking of the matrix power kernel. When this optimized kernel is used inside preconditioned s-step GMRES, polynomial preconditioners, and algebraic multigrid, the overall solver runtime drops by up to a factor of three on modern multi-core nodes whenever the kernel dominates. Gains shrink when orthogonalization or other phases contribute moderately, often because those routines remain unoptimized.

What carries the argument

Level-based formulation of sparse matrix-vector multiplications that permits temporal cache blocking inside the matrix power kernel.

If this is right

Up to 3x speedups on modern multi-core compute nodes for MPK-dominated algorithms.
Reduced gains when subspace orthogonalization contributes moderately to runtime.
Successful application of the blocked kernel inside preconditioned s-step GMRES, polynomial preconditioners, and algebraic multigrid.
Demonstration of the optimized solvers inside a real-world large-scale simulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improving orthogonalization routines would make the reported speedups more consistent across different solver configurations.
The same blocking approach could apply to other iterative methods that rely on explicit matrix-polynomial evaluation.
On hardware with different cache sizes the blocking depth that maximizes performance would likely change and require re-selection.
Solver libraries could expose explicit matrix-power interfaces so that cache-blocking optimizations become easier to apply.

Load-bearing premise

The matrix power kernel must dominate runtime so that optimizing it produces overall gains without other phases becoming new bottlenecks.

What would settle it

Profiling an optimized solver run and finding that the matrix power kernel no longer accounts for the majority of time or that total speedup falls well below 3x because orthogonalization or communication now limits performance.

read the original abstract

Sparse linear iterative solvers are essential for many large-scale simulations. Much of the runtime of these solvers is often spent in the implicit evaluation of matrix polynomials via a sequence of sparse matrix-vector products. A variety of approaches has been proposed to make these polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial preconditioners or s-step Krylov methods. Furthermore, it is nowadays a popular practice to approximate triangular solves by a matrix polynomial to increase parallelism. Such algorithms allow to evaluate the polynomial using a so-called matrix power kernel (MPK), which computes the product between a power of a sparse matrix A and a dense vector x, or a related operation. Recently we have shown that using the level-based formulation of sparse matrix-vector multiplications in the Recursive Algebraic Coloring Engine (RACE) framework we can perform temporal cache blocking of MPK to increase its performance. In this work, we demonstrate the application of this cache-blocking optimization in sparse iterative solvers. By integrating the RACE library into the Trilinos framework, we demonstrate the speedups achieved in preconditioned) s-step GMRES, polynomial preconditioners, and algebraic multigrid (AMG). For MPK-dominated algorithms we achieve speedups of up to 3x on modern multi-core compute nodes. For algorithms with moderate contributions from subspace orthogonalization, the gain reduces significantly, which is often caused by the insufficient quality of the orthogonalization routines. Finally, we showcase the application of RACE-accelerated solvers in a real-world wind turbine simulation (Nalu-Wind) and highlight the new opportunities and perspectives opened up by RACE as a cache-blocking technique for MPK-enabled sparse solvers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript integrates the RACE library for algebraic temporal blocking of matrix-power kernels (MPK) into Trilinos and applies it to preconditioned s-step GMRES, polynomial preconditioners, and AMG. It reports empirical speedups of up to 3x on multi-core nodes for MPK-dominated cases, reduced gains when orthogonalization contributes, and a demonstration on a Nalu-Wind wind-turbine simulation.

Significance. If the performance claims are substantiated with phase-resolved timings, the work offers a practical route to accelerate MPK-based solvers that are already used in production codes. The Trilinos integration and end-to-end Nalu-Wind example provide concrete evidence of applicability beyond micro-benchmarks.

major comments (2)

[Section 5] Section 5 (performance results): the headline claim of up to 3x solver speedup for MPK-dominated algorithms is not accompanied by per-phase wall-clock breakdowns (MPK vs. orthogonalization vs. other) for the exact matrix sizes and solver configurations shown in the tables and figures. Without these fractions it is impossible to verify that MPK remains dominant after the optimization, which is required for the solver-level speedup to follow from the kernel improvement.
[Section 4.2] Section 4.2 (Trilinos integration): the description of how the RACE-accelerated MPK replaces the original SpMV sequence inside s-step GMRES and polynomial preconditioners lacks sufficient detail on data-layout changes and synchronization points, making it difficult to assess whether the reported speedups are portable or specific to the tested Trilinos build.

minor comments (3)

[Figure 3] Figure 3 and Table 2: the legend and caption do not explicitly state whether the reported times include the full solver iteration or only the MPK phase.
[Abstract] Abstract and Section 1: the phrase 'insufficient quality of the orthogonalization routines' is used without a quantitative definition or reference to the specific orthogonalization implementation.
[Section 6] Section 6 (Nalu-Wind): the problem size and number of cores used in the wind-turbine run should be stated explicitly so that the 1.8x overall speedup can be placed in context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and substantiation of the performance claims.

read point-by-point responses

Referee: [Section 5] Section 5 (performance results): the headline claim of up to 3x solver speedup for MPK-dominated algorithms is not accompanied by per-phase wall-clock breakdowns (MPK vs. orthogonalization vs. other) for the exact matrix sizes and solver configurations shown in the tables and figures. Without these fractions it is impossible to verify that MPK remains dominant after the optimization, which is required for the solver-level speedup to follow from the kernel improvement.

Authors: We agree that per-phase breakdowns are necessary to fully substantiate the claims. In the revised manuscript we will add explicit wall-clock time fractions (MPK, orthogonalization, and remaining operations) for the precise matrix sizes, solver parameters, and configurations already shown in the tables and figures of Section 5. These additions will confirm MPK dominance in the cases where the 3x solver-level speedup is reported. revision: yes
Referee: [Section 4.2] Section 4.2 (Trilinos integration): the description of how the RACE-accelerated MPK replaces the original SpMV sequence inside s-step GMRES and polynomial preconditioners lacks sufficient detail on data-layout changes and synchronization points, making it difficult to assess whether the reported speedups are portable or specific to the tested Trilinos build.

Authors: We will expand Section 4.2 with additional technical detail on the integration. Specifically, we will describe that RACE operates on the existing Trilinos Epetra/Tpetra matrix and vector data layouts without requiring reformatting or copies, and we will enumerate the exact synchronization points (only at the start and end of each MPK call) that are introduced. This clarification will demonstrate that the approach is portable across standard Trilinos builds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical speedups measured directly from library integration and benchmarks

full rationale

This is a performance-engineering paper that integrates the existing RACE library into Trilinos and reports wall-clock speedups on concrete test cases (s-step GMRES, polynomial preconditioners, AMG, Nalu-Wind). The central claims rest on measured runtimes, not on any derivation, fitted parameter, or self-citation that reduces to the target result by construction. The abstract and skeptic notes correctly identify that dominance of the MPK phase is an empirical premise, but that premise is external to any circular chain; it is simply a condition under which the measured kernel improvement translates to solver improvement. No equations, uniqueness theorems, or ansatzes are invoked that would trigger the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The performance claims rest on empirical testing under standard assumptions of modern CPU architectures and the dominance of MPK in certain solver phases.

axioms (1)

domain assumption Cache behavior on multi-core CPUs allows for effective temporal blocking via level-based sparse matrix formulations
The optimization relies on predictable memory access patterns in the RACE framework.

pith-pipeline@v0.9.0 · 5852 in / 1130 out tokens · 32278 ms · 2026-05-24T06:30:47.854938+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels
cs.DC 2024-05 unverdicted novelty 7.0

Introduces Distributed Level-Blocked MPK combining RACE cache blocking with MPI, reporting substantial speedups up to 4x on 832 cores for matrix power kernels across scientific sparse matrices.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper

[1]

[Online]

The Trilinos Project Team, The Trilinos Project Website , 2021 (acccessed Aug 6, 2021). [Online]. Available: https://trilinos.github.io

work page 2021
[2]

Preconditioning,

A. J. Wathen, “Preconditioning,” Acta Numerica, vol. 24, p. 329–376, 2015

work page 2015
[3]

A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units,

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, “A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units,” SIAM Journal on Scientific Computing , vol. 36, no. 5, pp. C401–C423, 2014. [Online]. Available: https://doi.org/10.1137/130930352

work page doi:10.1137/130930352 2014
[4]

A parallel GMRES version for general sparse matrices,

J. Erhel, “A parallel GMRES version for general sparse matrices,” Electronic Transactions on Numerical Analysis, vol. 3, pp. 160–176, 1995

work page 1995
[5]

s-step iterative methods for symmetric linear systems,

A. Chronopoulos and C. Gear, “s-step iterative methods for symmetric linear systems,” Journal of Computational and Applied Mathematics , vol. 25, no. 2, pp. 153–168, 1989. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0377042789900459

work page arXiv 1989
[6]

s-step iterative methods for (non)symmetric (in)definite linear systems,

A. T. Chronopoulos, “s-step iterative methods for (non)symmetric (in)definite linear systems,” SIAM Journal on Numerical Analysis , vol. 28, no. 6, pp. 1776–1789, 1991. [Online]. Available: https://doi.org/10.1137/0728088

work page doi:10.1137/0728088 1991
[7]

s-step orthomin and gmres implemented on parallel computers,

A. T. Chronopoulos and S. K. Kim, “s-step orthomin and gmres implemented on parallel computers,” 2020. [Online]. Available: https://arxiv.org/abs/2001.04886

work page arXiv 2020
[8]

Avoiding communication in sparse matrix computations,

J. Demmel, M. Hoemmen, M. Mohiyuddin, and K. Yelick, “Avoiding communication in sparse matrix computations,” in 2008 IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1–12

work page 2008
[9]

Communication-avoiding krylov subspace methods,

M. Hoemmen, “Communication-avoiding krylov subspace methods,” Ph.D. dissertation, USA, 2010, aAI3413388

work page 2010
[10]

Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU/GPU cluster,

I. Yamazaki, S. Rajamanickam, E. G. Boman, M. Hoemmen, M. A. Heroux, and S. Tomov, “Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU/GPU cluster,” in SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , 2014, pp. 933–944

work page 2014
[11]

With extreme computing, the rules have changed,

J. Dongarra, S. Tomov, P. Luszczek, J. Kurzak, M. Gates, I. Yamazaki, H. Anzt, A. Haidar, and A. Abdelfattah, “With extreme computing, the rules have changed,” Computing in Science Engineering, vol. 19, no. 3, pp. 52–62, 2017

work page 2017
[12]

Improving performance of GMRES by reducing communication and pipelining global collectives,

I. Yamazaki, M. Hoemmen, P. Luszczek, and J. Dongarra, “Improving performance of GMRES by reducing communication and pipelining global collectives,” in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , 2017, pp. 1118– 1127. 3AOCL-BLIS was compiled with gcc v10.2.0 as the library did not support our de-facto Intel c...

work page 2017
[13]

Auto-tuning stencil codes for cache-based multicore platforms,

K. Datta, “Auto-tuning stencil codes for cache-based multicore platforms,” Ph.D. dissertation, USA, 2009, aAI3411221

work page 2009
[14]

Level-based blocking for sparse matrices: Sparse matrix-power-vector multiplication,

C. L. Alappat, G. Hager, O. Schenk, and G. Wellein, “Level-based blocking for sparse matrices: Sparse matrix-power-vector multiplication,” 2022. [Online]. Available: https://arxiv.org/abs/2205.01598

work page arXiv 2022
[15]

Alappat, Recursive Algebraic Coloring Engine library , 2019 (acccessed May 2, 2022)

C. Alappat, Recursive Algebraic Coloring Engine library , 2019 (acccessed May 2, 2022). [Online]. Available: https://github.com/RRZE-HPC/RACE

work page 2019
[16]

Exawind: A multifidelity modeling and simulation environment for wind energy,

M. A. Sprague, S. Ananthan, G. Vijayakumar, and M. Robinson, “Exawind: A multifidelity modeling and simulation environment for wind energy,” Journal of Physics: Conference Series , vol. 1452, no. 1, p. 012071, jan 2020. [Online]. Available: https://dx.doi.org/10.1088/1742-6596/1452/1/012071

work page doi:10.1088/1742-6596/1452/1/012071 2020
[17]

Top 500: June 2022 list

“Top 500: June 2022 list.” [Online]. Available: https://top500.org/lists/top500/2022/06/

work page 2022
[18]

10 Almut Demel, Dominik Dürrschnabel, Tamara Mchedlidze, Marcel Radermacher, and Lasse Wulf

T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,” ACM Trans. Math. Softw. , vol. 38, no. 1, pp. 1:1–1:25, Dec. 2011, website: http://suitesparse-collection-website.herokuapp.com. [Online]. Available: http://doi.acm. org/10.1145/2049662.2049663

work page doi:10.1145/2049662.2049663 2011
[19]

Understanding HPC benchmark performance on Intel Broadwell and Cascade Lake processors,

C. L. Alappat, J. Hofmann, G. Hager, H. Fehske, A. R. Bishop, and G. Wellein, “Understanding HPC benchmark performance on Intel Broadwell and Cascade Lake processors,” in High Performance Computing, P. Sadayappan, B. L. Chamberlain, G. Juckeland, and H. Ltaief, Eds. Cham: Springer International Publishing, 2020, pp. 412–433

work page 2020
[20]

Race version used for experiments

“Race version used for experiments.” [Online]. Available: https://github.com/RRZE-HPC/ RACE/tree/v0.8.0

work page
[21]

Modified trilinos version used for experiments

“Modified trilinos version used for experiments.” [Online]. Available: https://github.com/ christiealappatt/TrilRACE/commit/119adc404d5c5d7f965970d86ec8a91205ab247a

work page
[22]

Intel Math Kernel Library,

Intel, “Intel Math Kernel Library,” 2022. [Online]. Available: https://www.intel.com/content/ www/us/en/developer/tools/oneapi/onemkl.html

work page 2022
[23]

MKL hack for AMD CPUs,

“MKL hack for AMD CPUs,” accessed on 27.03.2023. [Online]. Available: https: //doc.zih.tu-dresden.de/jobs and resources/rome nodes/

work page 2023
[24]

AOCL-BLIS,

AMD, “AOCL-BLIS,” 2022. [Online]. Available: https://developer.amd.com/amd-aocl/ blas-library/

work page 2022
[25]

BLIS: A framework for rapidly instantiating BLAS functionality,

F. G. Van Zee and R. A. van de Geijn, “BLIS: A framework for rapidly instantiating BLAS functionality,” ACM Transactions on Mathematical Software , vol. 41, no. 3, pp. 14:1–14:33, June 2015. [Online]. Available: http://doi.acm.org/10.1145/2764454

work page doi:10.1145/2764454 2015
[26]

J. A. Loe, H. K. Thornquist, and E. G. Boman, Polynomial Preconditioned GMRES in Trilinos: Practical Considerations for High-Performance Computing , pp. 35–45. [Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1.9781611976137.4

work page doi:10.1137/1.9781611976137.4
[27]

Two-stage Gauss-Seidel preconditioners and smoothers for Krylov solvers on a GPU cluster,

L. Berger-Vergiat, B. Kelley, S. Rajamanickam, J. J. Hu, K. Swirydowicz, P. Mullowney, S. J. Thomas, and I. Yamazaki, “Two-stage Gauss-Seidel preconditioners and smoothers for Krylov solvers on a GPU cluster,” ArXiv, vol. abs/2104.01196, 2021

work page arXiv 2021
[28]

Openmp: An industry-standard api for shared-memory programming,

L. Dagum and R. Menon, “Openmp: An industry-standard api for shared-memory programming,” IEEE Comput. Sci. Eng. , vol. 5, no. 1, pp. 46–55, Jan. 1998. [Online]. Available: https://doi.org/10.1109/99.660313

work page doi:10.1109/99.660313 1998
[29]

Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems,

Y. Saad and M. H. Schultz, “Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems,” SIAM Journal on Scientific and Statistical Computing , vol. 7, no. 3, pp. 856–869, 1986. [Online]. Available: https://doi.org/10.1137/0907058

work page doi:10.1137/0907058 1986
[30]

Improving the performance of CA-GMRES on multicores with multiple GPUs,

I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra, “Improving the performance of CA-GMRES on multicores with multiple GPUs,” in 2014 IEEE 28th International Parallel and Distributed Processing Symposium , 2014, pp. 382–391

work page 2014
[31]

Minimizing communication in sparse matrix solvers,

M. Mohiyuddin, M. Hoemmen, J. Demmel, and K. Yelick, “Minimizing communication in sparse matrix solvers,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , ser. SC ’09. New York, NY, USA: Association for Computing Machinery, 2009. [Online]. Available: https: //doi.org/10.1145/1654059.1654096

work page doi:10.1145/1654059.1654096 2009
[32]

Amesos2 and belos: Direct and iterative solvers for large sparse linear systems,

E. Bavier, M. Hoemmen, S. Rajamanickam, and H. Thornquist, “Amesos2 and belos: Direct and iterative solvers for large sparse linear systems,” Sci. Program., vol. 20, pp. 241–255, 2012

work page 2012
[33]

Parallel S.O.R. iterative methods,

D. Evans, “Parallel S.O.R. iterative methods,” Parallel Computing , vol. 1, no. 1, pp. 3–18, 1984. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0167819184903806

work page 1984
[34]

Solving sparse triangular linear systems on parallel computers,

E. Anderson and Y. Saad, “Solving sparse triangular linear systems on parallel computers,” Int. J. High Speed Comput. , vol. 1, no. 1, p. 73–95, apr 1989. [Online]. Available: https://doi.org/10.1142/S0129053389000056 ALGEBRAIC TEMPORAL BLOCKING 25

work page doi:10.1142/s0129053389000056 1989
[35]

Convergence of nested classical iterative methods for linear systems,

P. J. Lanzkron, D. J. Rose, and D. B. Szyld, “Convergence of nested classical iterative methods for linear systems,” Numerische Mathematik , vol. 58, no. 1, pp. 685–702, 1990. [Online]. Available: https://doi.org/10.1007/BF01385649

work page doi:10.1007/bf01385649 1990
[36]

Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning,

E. Chow, H. Anzt, J. Scott, and J. Dongarra, “Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning,” Journal of Parallel and Distributed Computing , vol. 119, pp. 219–230, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731518303034

work page 2018
[37]

Ifpack2 User’s Guide 1.0,

A. Prokopenko, C. M. Siefert, J. J. Hu, M. Hoemmen, and A. Klinvex, “Ifpack2 User’s Guide 1.0,” Sandia National Labs, Tech. Rep. SAND2016-5338, 2016

work page 2016
[38]

Polynomial preconditioners for conjugate gradient calculations,

O. G. Johnson, C. A. Micchelli, and G. Paul, “Polynomial preconditioners for conjugate gradient calculations,” SIAM Journal on Numerical Analysis , vol. 20, no. 2, pp. 362–376,

work page
[39]

Available: https://doi.org/10.1137/0720025

[Online]. Available: https://doi.org/10.1137/0720025

work page doi:10.1137/0720025
[40]

Least squares polynomials in the complex plane and their use for solving nonsymmetric linear systems,

Y. Saad, “Least squares polynomials in the complex plane and their use for solving nonsymmetric linear systems,” SIAM Journal on Numerical Analysis , vol. 24, no. 1, pp. 155–169, 1987. [Online]. Available: http://www.jstor.org/stable/2157392

work page arXiv 1987
[41]

Toward efficient polynomial preconditioning for gmres,

J. A. Loe and R. B. Morgan, “Toward efficient polynomial preconditioning for gmres,” Numerical Linear Algebra with Applications , vol. 29, no. 4, p. e2427, 2022. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.2427

work page doi:10.1002/nla.2427 2022
[42]

Proxy-gmres: Preconditioning via gmres in polynomial space,

X. Ye, Y. Xi, and Y. Saad, “Proxy-gmres: Preconditioning via gmres in polynomial space,” SIAM Journal on Matrix Analysis and Applications , vol. 42, no. 3, pp. 1248–1267, 2021. [Online]. Available: https://doi.org/10.1137/20M1342562

work page doi:10.1137/20m1342562 2021
[43]

Improved seed methods for symmetric positive definite linear equations with multiple right-hand sides,

A. M. Abdel-Rehim, R. B. Morgan, and W. Wilcox, “Improved seed methods for symmetric positive definite linear equations with multiple right-hand sides,” Numerical Linear Algebra with Applications , vol. 21, no. 3, pp. 453–471, 2014. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.1892

work page doi:10.1002/nla.1892 2014
[44]

Multi-level adaptive solutions to boundary-value problems,

A. Brandt, “Multi-level adaptive solutions to boundary-value problems,” Mathematics of Computation , vol. 31, no. 138, pp. 333–390, 1977. [Online]. Available: http: //www.jstor.org/stable/2006422

work page arXiv 1977
[45]

An introduction to algebraic multigrid,

R. Falgout, “An introduction to algebraic multigrid,” Computing in Science & Engineering , vol. 8, no. 6, pp. 24–33, 2006

work page 2006
[46]

A comparison of classical and aggregation-based algebraic multigrid preconditioners for high-fidelity simulation of wind turbine incompressible flows,

S. J. Thomas, S. Ananthan, S. Yellapantula, J. J. Hu, M. Lawson, and M. A. Sprague, “A comparison of classical and aggregation-based algebraic multigrid preconditioners for high-fidelity simulation of wind turbine incompressible flows,” SIAM Journal on Scientific Computing , vol. 41, no. 5, pp. S196–S219, 2019. [Online]. Available: https://doi.org/10.1137...

work page doi:10.1137/18m1179018 2019
[47]

Acceleration of convergence of a two-level algebraic algorithm by aggregation in smoothing process,

S. M´ ıka and P. Vanˇ ek, “Acceleration of convergence of a two-level algebraic algorithm by aggregation in smoothing process,” Applications of Mathematics , vol. 37, no. 5, pp. 343–356, 1992. [Online]. Available: http://eudml.org/doc/15720

work page 1992
[48]

MueLu user’s guide,

L. Berger-Vergiat, C. A. Glusa, J. J. Hu, M. Mayr, A. Prokopenko, C. M. Siefert, R. S. Tuminaro, and T. A. Wiesner, “MueLu user’s guide,” Sandia National Laboratories, Tech. Rep. SAND2019-0537, 2019

work page 2019
[49]

Parallel multigrid smoothing: polynomial versus Gauss–Seidel,

M. Adams, M. Brezina, J. Hu, and R. Tuminaro, “Parallel multigrid smoothing: polynomial versus Gauss–Seidel,” Journal of Computational Physics , vol. 188, no. 2, pp. 593–610, 2003. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S0021999103001943

work page 2003
[50]

N.-W. D. Team, Nalu-Wind Documentation, Release 1.2.0 , November 2022. [Online]. Available: https://nalu-wind.readthedocs.io/ /downloads/en/latest/pdf/

work page 2022
[51]

Performance portability of an spmv kernel across scientific computing and data science applications,

S. L. Olivier, N. D. Ellingwood, J. Berry, and D. M. Dunlavy, “Performance portability of an spmv kernel across scientific computing and data science applications,” in 2021 IEEE High Performance Extreme Computing Conference (HPEC) , 2021, pp. 1–8

work page 2021
[52]

Kokkos kernels: Performance portable sparse/dense linear algebra and graph kernels,

S. Rajamanickam, S. Acer, L. Berger-Vergiat, V. Dang, N. Ellingwood, E. Harvey, B. Kelley, C. R. Trott, J. Wilke, and I. Yamazaki, “Kokkos kernels: Performance portable sparse/dense linear algebra and graph kernels,” 2021. [Online]. Available: https://arxiv.org/abs/2103.11991

work page arXiv 2021
[53]

A recursive algebraic coloring technique for hardware-efficient symmetric sparse matrix-vector multiplication,

C. Alappat, A. Basermann, A. R. Bishop, H. Fehske, G. Hager, O. Schenk, J. Thies, and G. Wellein, “A recursive algebraic coloring technique for hardware-efficient symmetric sparse matrix-vector multiplication,” ACM Trans. Parallel Comput. , vol. 7, no. 3, Jun

work page
[54]

Available: https://doi.org/10.1145/3399732

[Online]. Available: https://doi.org/10.1145/3399732

work page doi:10.1145/3399732

[1] [1]

[Online]

The Trilinos Project Team, The Trilinos Project Website , 2021 (acccessed Aug 6, 2021). [Online]. Available: https://trilinos.github.io

work page 2021

[2] [2]

Preconditioning,

A. J. Wathen, “Preconditioning,” Acta Numerica, vol. 24, p. 329–376, 2015

work page 2015

[3] [3]

A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units,

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, “A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units,” SIAM Journal on Scientific Computing , vol. 36, no. 5, pp. C401–C423, 2014. [Online]. Available: https://doi.org/10.1137/130930352

work page doi:10.1137/130930352 2014

[4] [4]

A parallel GMRES version for general sparse matrices,

J. Erhel, “A parallel GMRES version for general sparse matrices,” Electronic Transactions on Numerical Analysis, vol. 3, pp. 160–176, 1995

work page 1995

[5] [5]

s-step iterative methods for symmetric linear systems,

A. Chronopoulos and C. Gear, “s-step iterative methods for symmetric linear systems,” Journal of Computational and Applied Mathematics , vol. 25, no. 2, pp. 153–168, 1989. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0377042789900459

work page arXiv 1989

[6] [6]

s-step iterative methods for (non)symmetric (in)definite linear systems,

A. T. Chronopoulos, “s-step iterative methods for (non)symmetric (in)definite linear systems,” SIAM Journal on Numerical Analysis , vol. 28, no. 6, pp. 1776–1789, 1991. [Online]. Available: https://doi.org/10.1137/0728088

work page doi:10.1137/0728088 1991

[7] [7]

s-step orthomin and gmres implemented on parallel computers,

A. T. Chronopoulos and S. K. Kim, “s-step orthomin and gmres implemented on parallel computers,” 2020. [Online]. Available: https://arxiv.org/abs/2001.04886

work page arXiv 2020

[8] [8]

Avoiding communication in sparse matrix computations,

J. Demmel, M. Hoemmen, M. Mohiyuddin, and K. Yelick, “Avoiding communication in sparse matrix computations,” in 2008 IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1–12

work page 2008

[9] [9]

Communication-avoiding krylov subspace methods,

M. Hoemmen, “Communication-avoiding krylov subspace methods,” Ph.D. dissertation, USA, 2010, aAI3413388

work page 2010

[10] [10]

Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU/GPU cluster,

I. Yamazaki, S. Rajamanickam, E. G. Boman, M. Hoemmen, M. A. Heroux, and S. Tomov, “Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU/GPU cluster,” in SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , 2014, pp. 933–944

work page 2014

[11] [11]

With extreme computing, the rules have changed,

J. Dongarra, S. Tomov, P. Luszczek, J. Kurzak, M. Gates, I. Yamazaki, H. Anzt, A. Haidar, and A. Abdelfattah, “With extreme computing, the rules have changed,” Computing in Science Engineering, vol. 19, no. 3, pp. 52–62, 2017

work page 2017

[12] [12]

Improving performance of GMRES by reducing communication and pipelining global collectives,

I. Yamazaki, M. Hoemmen, P. Luszczek, and J. Dongarra, “Improving performance of GMRES by reducing communication and pipelining global collectives,” in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , 2017, pp. 1118– 1127. 3AOCL-BLIS was compiled with gcc v10.2.0 as the library did not support our de-facto Intel c...

work page 2017

[13] [13]

Auto-tuning stencil codes for cache-based multicore platforms,

K. Datta, “Auto-tuning stencil codes for cache-based multicore platforms,” Ph.D. dissertation, USA, 2009, aAI3411221

work page 2009

[14] [14]

Level-based blocking for sparse matrices: Sparse matrix-power-vector multiplication,

C. L. Alappat, G. Hager, O. Schenk, and G. Wellein, “Level-based blocking for sparse matrices: Sparse matrix-power-vector multiplication,” 2022. [Online]. Available: https://arxiv.org/abs/2205.01598

work page arXiv 2022

[15] [15]

Alappat, Recursive Algebraic Coloring Engine library , 2019 (acccessed May 2, 2022)

C. Alappat, Recursive Algebraic Coloring Engine library , 2019 (acccessed May 2, 2022). [Online]. Available: https://github.com/RRZE-HPC/RACE

work page 2019

[16] [16]

Exawind: A multifidelity modeling and simulation environment for wind energy,

M. A. Sprague, S. Ananthan, G. Vijayakumar, and M. Robinson, “Exawind: A multifidelity modeling and simulation environment for wind energy,” Journal of Physics: Conference Series , vol. 1452, no. 1, p. 012071, jan 2020. [Online]. Available: https://dx.doi.org/10.1088/1742-6596/1452/1/012071

work page doi:10.1088/1742-6596/1452/1/012071 2020

[17] [17]

Top 500: June 2022 list

“Top 500: June 2022 list.” [Online]. Available: https://top500.org/lists/top500/2022/06/

work page 2022

[18] [18]

10 Almut Demel, Dominik Dürrschnabel, Tamara Mchedlidze, Marcel Radermacher, and Lasse Wulf

T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,” ACM Trans. Math. Softw. , vol. 38, no. 1, pp. 1:1–1:25, Dec. 2011, website: http://suitesparse-collection-website.herokuapp.com. [Online]. Available: http://doi.acm. org/10.1145/2049662.2049663

work page doi:10.1145/2049662.2049663 2011

[19] [19]

Understanding HPC benchmark performance on Intel Broadwell and Cascade Lake processors,

C. L. Alappat, J. Hofmann, G. Hager, H. Fehske, A. R. Bishop, and G. Wellein, “Understanding HPC benchmark performance on Intel Broadwell and Cascade Lake processors,” in High Performance Computing, P. Sadayappan, B. L. Chamberlain, G. Juckeland, and H. Ltaief, Eds. Cham: Springer International Publishing, 2020, pp. 412–433

work page 2020

[20] [20]

Race version used for experiments

“Race version used for experiments.” [Online]. Available: https://github.com/RRZE-HPC/ RACE/tree/v0.8.0

work page

[21] [21]

Modified trilinos version used for experiments

“Modified trilinos version used for experiments.” [Online]. Available: https://github.com/ christiealappatt/TrilRACE/commit/119adc404d5c5d7f965970d86ec8a91205ab247a

work page

[22] [22]

Intel Math Kernel Library,

Intel, “Intel Math Kernel Library,” 2022. [Online]. Available: https://www.intel.com/content/ www/us/en/developer/tools/oneapi/onemkl.html

work page 2022

[23] [23]

MKL hack for AMD CPUs,

“MKL hack for AMD CPUs,” accessed on 27.03.2023. [Online]. Available: https: //doc.zih.tu-dresden.de/jobs and resources/rome nodes/

work page 2023

[24] [24]

AOCL-BLIS,

AMD, “AOCL-BLIS,” 2022. [Online]. Available: https://developer.amd.com/amd-aocl/ blas-library/

work page 2022

[25] [25]

BLIS: A framework for rapidly instantiating BLAS functionality,

F. G. Van Zee and R. A. van de Geijn, “BLIS: A framework for rapidly instantiating BLAS functionality,” ACM Transactions on Mathematical Software , vol. 41, no. 3, pp. 14:1–14:33, June 2015. [Online]. Available: http://doi.acm.org/10.1145/2764454

work page doi:10.1145/2764454 2015

[26] [26]

J. A. Loe, H. K. Thornquist, and E. G. Boman, Polynomial Preconditioned GMRES in Trilinos: Practical Considerations for High-Performance Computing , pp. 35–45. [Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1.9781611976137.4

work page doi:10.1137/1.9781611976137.4

[27] [27]

Two-stage Gauss-Seidel preconditioners and smoothers for Krylov solvers on a GPU cluster,

L. Berger-Vergiat, B. Kelley, S. Rajamanickam, J. J. Hu, K. Swirydowicz, P. Mullowney, S. J. Thomas, and I. Yamazaki, “Two-stage Gauss-Seidel preconditioners and smoothers for Krylov solvers on a GPU cluster,” ArXiv, vol. abs/2104.01196, 2021

work page arXiv 2021

[28] [28]

Openmp: An industry-standard api for shared-memory programming,

L. Dagum and R. Menon, “Openmp: An industry-standard api for shared-memory programming,” IEEE Comput. Sci. Eng. , vol. 5, no. 1, pp. 46–55, Jan. 1998. [Online]. Available: https://doi.org/10.1109/99.660313

work page doi:10.1109/99.660313 1998

[29] [29]

Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems,

Y. Saad and M. H. Schultz, “Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems,” SIAM Journal on Scientific and Statistical Computing , vol. 7, no. 3, pp. 856–869, 1986. [Online]. Available: https://doi.org/10.1137/0907058

work page doi:10.1137/0907058 1986

[30] [30]

Improving the performance of CA-GMRES on multicores with multiple GPUs,

I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra, “Improving the performance of CA-GMRES on multicores with multiple GPUs,” in 2014 IEEE 28th International Parallel and Distributed Processing Symposium , 2014, pp. 382–391

work page 2014

[31] [31]

Minimizing communication in sparse matrix solvers,

M. Mohiyuddin, M. Hoemmen, J. Demmel, and K. Yelick, “Minimizing communication in sparse matrix solvers,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , ser. SC ’09. New York, NY, USA: Association for Computing Machinery, 2009. [Online]. Available: https: //doi.org/10.1145/1654059.1654096

work page doi:10.1145/1654059.1654096 2009

[32] [32]

Amesos2 and belos: Direct and iterative solvers for large sparse linear systems,

E. Bavier, M. Hoemmen, S. Rajamanickam, and H. Thornquist, “Amesos2 and belos: Direct and iterative solvers for large sparse linear systems,” Sci. Program., vol. 20, pp. 241–255, 2012

work page 2012

[33] [33]

Parallel S.O.R. iterative methods,

D. Evans, “Parallel S.O.R. iterative methods,” Parallel Computing , vol. 1, no. 1, pp. 3–18, 1984. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0167819184903806

work page 1984

[34] [34]

Solving sparse triangular linear systems on parallel computers,

E. Anderson and Y. Saad, “Solving sparse triangular linear systems on parallel computers,” Int. J. High Speed Comput. , vol. 1, no. 1, p. 73–95, apr 1989. [Online]. Available: https://doi.org/10.1142/S0129053389000056 ALGEBRAIC TEMPORAL BLOCKING 25

work page doi:10.1142/s0129053389000056 1989

[35] [35]

Convergence of nested classical iterative methods for linear systems,

P. J. Lanzkron, D. J. Rose, and D. B. Szyld, “Convergence of nested classical iterative methods for linear systems,” Numerische Mathematik , vol. 58, no. 1, pp. 685–702, 1990. [Online]. Available: https://doi.org/10.1007/BF01385649

work page doi:10.1007/bf01385649 1990

[36] [36]

Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning,

E. Chow, H. Anzt, J. Scott, and J. Dongarra, “Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning,” Journal of Parallel and Distributed Computing , vol. 119, pp. 219–230, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731518303034

work page 2018

[37] [37]

Ifpack2 User’s Guide 1.0,

A. Prokopenko, C. M. Siefert, J. J. Hu, M. Hoemmen, and A. Klinvex, “Ifpack2 User’s Guide 1.0,” Sandia National Labs, Tech. Rep. SAND2016-5338, 2016

work page 2016

[38] [38]

Polynomial preconditioners for conjugate gradient calculations,

O. G. Johnson, C. A. Micchelli, and G. Paul, “Polynomial preconditioners for conjugate gradient calculations,” SIAM Journal on Numerical Analysis , vol. 20, no. 2, pp. 362–376,

work page

[39] [39]

Available: https://doi.org/10.1137/0720025

[Online]. Available: https://doi.org/10.1137/0720025

work page doi:10.1137/0720025

[40] [40]

Least squares polynomials in the complex plane and their use for solving nonsymmetric linear systems,

Y. Saad, “Least squares polynomials in the complex plane and their use for solving nonsymmetric linear systems,” SIAM Journal on Numerical Analysis , vol. 24, no. 1, pp. 155–169, 1987. [Online]. Available: http://www.jstor.org/stable/2157392

work page arXiv 1987

[41] [41]

Toward efficient polynomial preconditioning for gmres,

J. A. Loe and R. B. Morgan, “Toward efficient polynomial preconditioning for gmres,” Numerical Linear Algebra with Applications , vol. 29, no. 4, p. e2427, 2022. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.2427

work page doi:10.1002/nla.2427 2022

[42] [42]

Proxy-gmres: Preconditioning via gmres in polynomial space,

X. Ye, Y. Xi, and Y. Saad, “Proxy-gmres: Preconditioning via gmres in polynomial space,” SIAM Journal on Matrix Analysis and Applications , vol. 42, no. 3, pp. 1248–1267, 2021. [Online]. Available: https://doi.org/10.1137/20M1342562

work page doi:10.1137/20m1342562 2021

[43] [43]

Improved seed methods for symmetric positive definite linear equations with multiple right-hand sides,

A. M. Abdel-Rehim, R. B. Morgan, and W. Wilcox, “Improved seed methods for symmetric positive definite linear equations with multiple right-hand sides,” Numerical Linear Algebra with Applications , vol. 21, no. 3, pp. 453–471, 2014. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.1892

work page doi:10.1002/nla.1892 2014

[44] [44]

Multi-level adaptive solutions to boundary-value problems,

A. Brandt, “Multi-level adaptive solutions to boundary-value problems,” Mathematics of Computation , vol. 31, no. 138, pp. 333–390, 1977. [Online]. Available: http: //www.jstor.org/stable/2006422

work page arXiv 1977

[45] [45]

An introduction to algebraic multigrid,

R. Falgout, “An introduction to algebraic multigrid,” Computing in Science & Engineering , vol. 8, no. 6, pp. 24–33, 2006

work page 2006

[46] [46]

A comparison of classical and aggregation-based algebraic multigrid preconditioners for high-fidelity simulation of wind turbine incompressible flows,

S. J. Thomas, S. Ananthan, S. Yellapantula, J. J. Hu, M. Lawson, and M. A. Sprague, “A comparison of classical and aggregation-based algebraic multigrid preconditioners for high-fidelity simulation of wind turbine incompressible flows,” SIAM Journal on Scientific Computing , vol. 41, no. 5, pp. S196–S219, 2019. [Online]. Available: https://doi.org/10.1137...

work page doi:10.1137/18m1179018 2019

[47] [47]

Acceleration of convergence of a two-level algebraic algorithm by aggregation in smoothing process,

S. M´ ıka and P. Vanˇ ek, “Acceleration of convergence of a two-level algebraic algorithm by aggregation in smoothing process,” Applications of Mathematics , vol. 37, no. 5, pp. 343–356, 1992. [Online]. Available: http://eudml.org/doc/15720

work page 1992

[48] [48]

MueLu user’s guide,

L. Berger-Vergiat, C. A. Glusa, J. J. Hu, M. Mayr, A. Prokopenko, C. M. Siefert, R. S. Tuminaro, and T. A. Wiesner, “MueLu user’s guide,” Sandia National Laboratories, Tech. Rep. SAND2019-0537, 2019

work page 2019

[49] [49]

Parallel multigrid smoothing: polynomial versus Gauss–Seidel,

M. Adams, M. Brezina, J. Hu, and R. Tuminaro, “Parallel multigrid smoothing: polynomial versus Gauss–Seidel,” Journal of Computational Physics , vol. 188, no. 2, pp. 593–610, 2003. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S0021999103001943

work page 2003

[50] [50]

N.-W. D. Team, Nalu-Wind Documentation, Release 1.2.0 , November 2022. [Online]. Available: https://nalu-wind.readthedocs.io/ /downloads/en/latest/pdf/

work page 2022

[51] [51]

Performance portability of an spmv kernel across scientific computing and data science applications,

S. L. Olivier, N. D. Ellingwood, J. Berry, and D. M. Dunlavy, “Performance portability of an spmv kernel across scientific computing and data science applications,” in 2021 IEEE High Performance Extreme Computing Conference (HPEC) , 2021, pp. 1–8

work page 2021

[52] [52]

Kokkos kernels: Performance portable sparse/dense linear algebra and graph kernels,

S. Rajamanickam, S. Acer, L. Berger-Vergiat, V. Dang, N. Ellingwood, E. Harvey, B. Kelley, C. R. Trott, J. Wilke, and I. Yamazaki, “Kokkos kernels: Performance portable sparse/dense linear algebra and graph kernels,” 2021. [Online]. Available: https://arxiv.org/abs/2103.11991

work page arXiv 2021

[53] [53]

A recursive algebraic coloring technique for hardware-efficient symmetric sparse matrix-vector multiplication,

C. Alappat, A. Basermann, A. R. Bishop, H. Fehske, G. Hager, O. Schenk, J. Thies, and G. Wellein, “A recursive algebraic coloring technique for hardware-efficient symmetric sparse matrix-vector multiplication,” ACM Trans. Parallel Comput. , vol. 7, no. 3, Jun

work page

[54] [54]

Available: https://doi.org/10.1145/3399732

[Online]. Available: https://doi.org/10.1145/3399732

work page doi:10.1145/3399732