Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels
Pith reviewed 2026-05-24 01:18 UTC · model grok-4.3
The pith
Extending RACE cache blocking to distributed memory by interleaving it with MPI communication yields up to 4x speed-up for matrix power kernels on large clusters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors propose and implement a Distributed Level-Blocked MPK that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme fulfilling all data dependencies among processes. This removes the prior restriction of RACE to single nodes and produces substantial speed-ups relative to traditional distributed-memory MPK on Intel and AMD architectures for a wide range of sparse matrices from scientific applications, including a speed-up of up to 4x on 832 cores of an Intel Sapphire Rapids cluster in a quantum physics example.
What carries the argument
Interleaving of RACE level-based blocking with an explicit MPI communication and synchronization scheme for neighboring levels in the graph-based SpMV formulation.
If this is right
- Substantial speed-ups on modern Intel and AMD architectures across a wide range of sparse matrices from scientific applications.
- Speed-up of up to 4x achieved on 832 cores of an Intel Sapphire Rapids cluster.
- Applicability shown for a modern quantum physics problem.
- Flexible handling of data dependencies that previously restricted cache blocking to shared memory.
Where Pith is reading between the lines
- The interleaving approach could extend RACE-style reuse to other distributed graph algorithms that rely on level traversals.
- Performance gains may reduce overall runtime and energy use in large-scale simulations that repeatedly call matrix power kernels.
- Scalability tests on clusters with different interconnects would clarify where communication overhead begins to dominate.
Load-bearing premise
Data dependencies among neighboring levels can be satisfied by explicit MPI communication interleaved with RACE blocking without prohibitive overhead.
What would settle it
A run of the new method on a matrix with high inter-process communication volume where total time exceeds that of the traditional distributed MPK due to synchronization costs.
Figures
read the original abstract
Sparse matrix-vector products (SpMVs) are a bottleneck in many scientific codes. Due to the heavy strain on the main memory interface from loading the sparse matrix and the possibly irregular memory access pattern, SpMV typically exhibits low arithmetic intensity. Repeating these products multiple times with the same matrix is required in many algorithms. This so-called matrix power kernel (MPK) provides an opportunity for data reuse since the same matrix data is loaded from main memory multiple times, an opportunity that has only recently been exploited successfully with the Recursive Algebraic Coloring Engine (RACE). Using RACE, one considers a graph based formulation of the SpMV and employs s level-based implementation of SpMV for reuse of relevant matrix data. However, the underlying data dependencies have restricted the use of this concept to shared memory parallelization and thus to single compute nodes. Enabling cache blocking for distributed-memory parallelization of MPK is challenging due to the need for explicit communication and synchronization of data in neighboring levels. In this work, we propose and implement a flexible method that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme that fulfills all data dependencies among processes. Compared to a "traditional" distributed memory parallel MPK, our new Distributed Level-Blocked MPK yields substantial speed-ups on modern Intel and AMD architectures across a wide range of sparse matrices from various scientific applications. Finally, we address a modern quantum physics problem to demonstrate the applicability of our method, achieving a speed-up of up to 4x on 832 cores of an Intel Sapphire Rapids cluster.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends the Recursive Algebraic Coloring Engine (RACE) cache-blocking approach for matrix power kernels (MPK) from shared-memory to distributed-memory settings. It proposes interleaving RACE's level-based SpMV formulation with an explicit MPI communication and synchronization scheme to resolve data dependencies among neighboring levels in the graph representation of the sparse matrix. The central claim is that this Distributed Level-Blocked MPK delivers substantial speed-ups (up to 4x on 832 cores of an Intel Sapphire Rapids cluster) versus a traditional distributed-memory parallel MPK across a range of sparse matrices from scientific applications, including a quantum-physics demonstration.
Significance. If the performance claims hold after verification that communication overhead remains sub-dominant, the work would meaningfully broaden the applicability of RACE-style cache blocking to large-scale distributed systems. This addresses a previously stated limitation of the technique and could benefit iterative solvers and eigenvalue methods that rely on repeated SpMV operations.
major comments (1)
- [implementation of the MPI communication scheme] The description of the interleaved MPI scheme (in the section presenting the Distributed Level-Blocked MPK) does not specify the number of synchronization points or halo-exchange volume per power iteration, nor does it report the fraction of runtime spent in MPI on the 832-core experiments. Without this breakdown, it is impossible to confirm that the explicit communication required to satisfy neighbor-level dependencies adds only negligible overhead relative to the cache-reuse gains, which is load-bearing for the 4x speedup attribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's significance. We address the single major comment below and will revise the manuscript to incorporate the requested details on the MPI scheme.
read point-by-point responses
-
Referee: [implementation of the MPI communication scheme] The description of the interleaved MPI scheme (in the section presenting the Distributed Level-Blocked MPK) does not specify the number of synchronization points or halo-exchange volume per power iteration, nor does it report the fraction of runtime spent in MPI on the 832-core experiments. Without this breakdown, it is impossible to confirm that the explicit communication required to satisfy neighbor-level dependencies adds only negligible overhead relative to the cache-reuse gains, which is load-bearing for the 4x speedup attribution.
Authors: We agree that quantifying the communication overhead is essential to substantiate the 4x speedup claims. The current manuscript focuses on the algorithmic interleaving of RACE levels with MPI but omits the requested metrics. In the revised version we will add a dedicated subsection (or table) in the experimental evaluation reporting: (i) the exact number of synchronization points per power iteration, (ii) the halo-exchange volume in bytes per iteration for the tested matrices, and (iii) the measured MPI time fraction on the 832-core Sapphire Rapids runs. These additions will allow direct verification that communication remains sub-dominant relative to the cache-reuse benefits. revision: yes
Circularity Check
No circularity in empirical implementation and benchmarking paper
full rationale
The manuscript presents an engineering contribution: an implementation that interleaves RACE level blocking with explicit MPI communication to enable distributed-memory MPK. All performance claims (speed-ups up to 4x) rest on direct wall-clock timing measurements across a suite of sparse matrices and two hardware platforms. No equations, first-principles derivations, fitted parameters, or analytic predictions appear in the provided text. Consequently no self-definitional, fitted-input, or self-citation-load-bearing steps exist that could reduce the reported results to their own inputs by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The graph formulation of SpMV permits level-based blocking whose cross-process data dependencies can be resolved by explicit communication without destroying cache reuse benefits.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...
-
[3]
Available: https://doi.org/10.1145/3399732
Alappat C, Basermann A, Bishop AR, Fehske H, Hager G, Schenk O, Thies J and Wellein G (2020 a ) A recursive algebraic coloring technique for hardware-efficient symmetric sparse matrix-vector multiplication. ACM Trans. Parallel Comput. 7(3). doi:10.1145/3399732
-
[4]
IEEE Transactions on Parallel and Distributed Systems 34(2): 1--18
Alappat C, Hager G, Schenk O and Wellein G (2022) Level-based blocking for sparse matrices: Sparse matrix-power-vector multiplication. IEEE Transactions on Parallel and Distributed Systems 34(2): 1--18. doi:10.1109/TPDS.2022.3223512
-
[5]
Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs
Alappat C, Thies J, Hager G, Fehske H and Wellein G (2023) Algebraic temporal blocking for sparse iterative solvers on multi-core CPUs . arXiv:2309.02228, Submitted
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
In: Sadayappan P, Chamberlain BL, Juckeland G and Ltaief H (eds.) High Performance Computing
Alappat CL, Hofmann J, Hager G, Fehske H, Bishop AR and Wellein G (2020 b ) Understanding HPC benchmark performance on I ntel B roadwell and C ascade L ake processors. In: Sadayappan P, Chamberlain BL, Juckeland G and Ltaief H (eds.) High Performance Computing. Cham: Springer International Publishing. ISBN 978-3-030-50743-5, pp. 412--433
work page 2020
-
[7]
Anderson PW (1958) Absence of diffusion in certain random lattices. Phys. Rev. 109: 1492--1505. doi:10.1103/PhysRev.109.1492
-
[8]
10 Almut Demel, Dominik Dürrschnabel, Tamara Mchedlidze, Marcel Radermacher, and Lasse Wulf
Davis TA and Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1). doi:10.1145/2049662.2049663
-
[9]
In: 2008 IEEE International Symposium on Parallel and Distributed Processing
Demmel J, Hoemmen M, Mohiyuddin M and Yelick K (2008) Avoiding communication in sparse matrix computations. In: 2008 IEEE International Symposium on Parallel and Distributed Processing. pp. 1--12. doi:10.1109/IPDPS.2008.4536305
-
[10]
Physics Letters A 373(25): 2182--2188
Fehske H, Schleede J, Schubert G, Wellein G, Filinov VS and Bishop AR (2009) Numerical approaches to time evolution of complex quantum systems. Physics Letters A 373(25): 2182--2188. doi:10.1016/j.physleta.2009.04.022
-
[11]
Janarek J, Delande D, Cherroret N and Zakrzewski J (2020) Quantum boomerang effect for interacting particles. Phys. Rev. A 102: 013303. doi:10.1103/PhysRevA.102.013303
-
[12]
Janarek J, Gr\'emaud B, Zakrzewski J and Delande D (2022) Quantum boomerang effect in systems without time-reversal symmetry. Phys. Rev. B 105: L180202. doi:10.1103/PhysRevB.105.L180202
-
[13]
Karypis G and Kumar V (1998) METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices
work page 1998
-
[14]
Kreutzer M, Hager G, Wellein G, Fehske H and Bishop AR (2014) A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36(5): C401--C423. doi:10.1137/130930352
-
[15]
In: 2019 Computing in Cardiology (CinC)
Langguth J, Arevalo H, Hustad KG and Cai X (2019) Towards detailed real-time simulations of cardiac arrhythmia. In: 2019 Computing in Cardiology (CinC). pp. Page 1--Page 4. doi:10.22489/CinC.2019.301
-
[16]
Langguth J, Sourouri M, Lines GT, Baden SB and Cai X (2015) Scalable heterogeneous CPU - GPU computations for unstructured tetrahedral meshes. IEEE Micro 35(4): 6--15. doi:10.1109/MM.2015.70
-
[17]
In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP)
Loe JA, Thornquist HK and Boman EG (2020) Polynomial preconditioned gmres in trilinos: Practical considerations for high-performance computing. In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP). SIAM, pp. 35--45. doi:10.1137/1.9781611976137.4
-
[18]
Statistics and Computing 17: 395--416
Luxburg U (2004) A tutorial on spectral clustering. Statistics and Computing 17: 395--416. doi:10.1007/s11222-007-9033-z
-
[19]
Journal of Machine Learning Research 17(148): 1--5
McQueen J, Meil a M, VanderPlas J and Zhang Z (2016) Megaman: Scalable manifold learning in P ython. Journal of Machine Learning Research 17(148): 1--5. ://jmlr.org/papers/v17/16-109.html
work page 2016
-
[20]
In: 32th USENIX Security Symposium (USENIX Security 2023)
Moghimi D (2023) Downfall : Exploiting speculative data gathering. In: 32th USENIX Security Symposium (USENIX Security 2023)
work page 2023
-
[21]
Minimizing communication in sparse matrix solvers,
Mohiyuddin M, Hoemmen M, Demmel J and Yelick K (2009) Minimizing communication in sparse matrix solvers. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09. New York, NY, USA: Association for Computing Machinery. ISBN 9781605587448, pp. 1--12. doi:10.1145/1654059.1654096
-
[22]
Prat T, Delande D and Cherroret N (2019) Quantum boomeranglike effect of wave packets in random media. Phys. Rev. A 99: 023629. doi:10.1103/PhysRevA.99.023629
-
[23]
In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC '18
Simpson T, Pasadakis D, Kourounis D, Fujita K, Yamaguchi T, Ichimura T and Schenk O (2018) Balanced graph partition refinement using the graph p-laplacian. In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC '18. New York, NY, USA: Association for Computing Machinery. ISBN 9781450358910, pp. 1--11. doi:10.1145/3218176.3218232
-
[24]
The Journal of Chemical Physics 81(9): 3967--3971
Tal‐Ezer H and Kosloff R (1984) An accurate and efficient scheme for propagating the time dependent S chr \"o dinger equation. The Journal of Chemical Physics 81(9): 3967--3971. doi:10.1063/1.448136
-
[25]
Treibig J, Hager G and Wellein G (2010) LIKWID : A lightweight performance-oriented tool suite for x86 multicore environments. In: 2010 39th International Conference on Parallel Processing Workshops. pp. 207--216. doi:10.1109/ICPPW.2010.38
-
[26]
Vatai E, Singhal U and Suda R (2020) Diamond matrix powers kernels. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia2020. New York, NY, USA: Association for Computing Machinery. ISBN 9781450372367, p. 102–113. doi:10.1145/3368474.3368494
-
[27]
Vuduc RW and Demmel JW (2003) Automatic performance tuning of sparse matrix kernels. PhD Thesis. AAI3121741
work page 2003
-
[28]
Williams S, Waterman A and Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4): 65–76. doi:10.1145/1498765.1498785
-
[29]
In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium
Yamazaki I, Anzt H, Tomov S, Hoemmen M and Dongarra J (2014 a ) Improving the performance of CA - GMRES on multicores with multiple GPUs . In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium. pp. 382--391. doi:10.1109/IPDPS.2014.48
-
[30]
Yamazaki I, Rajamanickam S, Boman EG, Hoemmen M, Heroux MA and Tomov S (2014 b ) Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU / GPU cluster. In: SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 933--944. doi:10.1109/SC.2014.81
-
[31]
Zambetaki I, Li Q, Economou EN and Soukoulis CM (1997) Localization in weakly coupled planes and weakly coupled wires. Phys. Rev. B 56: 12221--12231. doi:10.1103/PhysRevB.56.12221
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.