pith. sign in

arxiv: 2405.12525 · v2 · submitted 2024-05-21 · 💻 cs.DC · cs.PF

Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

Pith reviewed 2026-05-24 01:18 UTC · model grok-4.3

classification 💻 cs.DC cs.PF
keywords matrix power kernelcache blockingdistributed memoryMPI communicationsparse matrix-vector productRACEperformance optimization
0
0 comments X

The pith

Extending RACE cache blocking to distributed memory by interleaving it with MPI communication yields up to 4x speed-up for matrix power kernels on large clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method for cache blocking in matrix power kernels that span multiple compute nodes. It extends the Recursive Algebraic Coloring Engine, previously limited to shared memory, by pairing its level-based data reuse with an MPI scheme that meets dependencies between processes. This matters because repeated sparse matrix-vector products strain memory bandwidth in many scientific codes, and reuse opportunities had been blocked by the distributed setting. The resulting Distributed Level-Blocked MPK delivers consistent gains on modern Intel and AMD systems across matrices from varied applications. One demonstration reaches a 4x improvement on 832 cores for a quantum physics calculation.

Core claim

The authors propose and implement a Distributed Level-Blocked MPK that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme fulfilling all data dependencies among processes. This removes the prior restriction of RACE to single nodes and produces substantial speed-ups relative to traditional distributed-memory MPK on Intel and AMD architectures for a wide range of sparse matrices from scientific applications, including a speed-up of up to 4x on 832 cores of an Intel Sapphire Rapids cluster in a quantum physics example.

What carries the argument

Interleaving of RACE level-based blocking with an explicit MPI communication and synchronization scheme for neighboring levels in the graph-based SpMV formulation.

If this is right

  • Substantial speed-ups on modern Intel and AMD architectures across a wide range of sparse matrices from scientific applications.
  • Speed-up of up to 4x achieved on 832 cores of an Intel Sapphire Rapids cluster.
  • Applicability shown for a modern quantum physics problem.
  • Flexible handling of data dependencies that previously restricted cache blocking to shared memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The interleaving approach could extend RACE-style reuse to other distributed graph algorithms that rely on level traversals.
  • Performance gains may reduce overall runtime and energy use in large-scale simulations that repeatedly call matrix power kernels.
  • Scalability tests on clusters with different interconnects would clarify where communication overhead begins to dominate.

Load-bearing premise

Data dependencies among neighboring levels can be satisfied by explicit MPI communication interleaved with RACE blocking without prohibitive overhead.

What would settle it

A run of the new method on a matrix with high inter-process communication volume where total time exceeds that of the traditional distributed MPK due to synchronization costs.

Figures

Figures reproduced from arXiv: 2405.12525 by Christie L. Alappat, Dane C. Lacey, Florian Lange, Georg Hager, Gerhard Wellein, Holger Fehske.

Figure 1
Figure 1. Figure 1: Graph (a) and sparsity pattern (b) of the matrix associated with a modified 5-point stencil. Graph (c) shows the permuted graph and (d) the sparsity pattern of the matrix after applying Breadth First Search (BFS) reordering. The vertices (rows) of the graph (matrix) that belong to a level are represented with the same color. 0 2 5 9 14 1 4 8 13 19 3 7 12 18 24 6 11 17 23 29 10 16 22 28 34 15 21 27 33 39 20… view at source ↗
Figure 2
Figure 2. Figure 2: Lp diagram with 10 levels (L(0), . . . , L(9)) and a maximum power of pm = 5. Level colors are the same as in Figure 1c. Each node in the Lp diagram is numbered according to the execution order. For p = 4 and level L(6), the explicit dependencies to levels at p = 3 are indicated with red arrows. The nodes highlighted in orange fulfill i + p = 6 (“diagonal”). indices of the non-zero elements in row v have a… view at source ↗
Figure 3
Figure 3. Figure 3: The global matrix A from Figure 1b and some RHS vector x are partitioned in a row-wise manner over two MPI processes in (a). The gray boxed-out regions show, on each MPI process, which elements are “local” (inside the gray region). The edge corresponding to the remote data dependency, i.e., the edge crossing the MPI boundary, is highlighted in blue in (b). The rows at global indices 8 and 12 are highlighte… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of three MPK implementations for the computation of A 3 x on a 1D tri-diagonal stencil matrix, distributed across two MPI processes, where the execution order is written in each node. The traditional MPK implementation of back-to-back SpMVs is shown in (a), the “Communication Avoiding” MPK with redundant SpMVs in (b), and our implementation of DLB-MPK is shown in (c). Each dot represents a verte… view at source ↗
Figure 5
Figure 5. Figure 5: Overheads in CA-MPK associated with the Serena matrix, partitioned over 10 and 15 MPI processes for powers p ∈ {1, 2, . . . , 12}. Left: additional halo elements relative to the total number of rows Nr. Right: recomputed elements relative to the total number of non-zero elements Nnz. one step. This remainder phase is repeated for a total of pm − 1 times to ensure all internal vertices reach power pm. Figur… view at source ↗
Figure 7
Figure 7. Figure 7: Full-node measured load bandwidths in GB/s (y-axis) vs. data set size. The higher solid horizontal line represents the estimated L3 cache bandwidths and the lower one represents the estimated bandwidth from main memory. The dashed red line marks the overall L2 cache size for the entire node, while the dotted blue line represents the aggregate L2+L3 cache size for the entire node. The widest SIMD registers … view at source ↗
Figure 8
Figure 8. Figure 8: Parameter study with ML Geer on one ICL node, scanning p ∈ {1, 2, . . . , 10} and C ∈ {30, 35, . . . , 75}. maximum performance of DLB-MPK with and without recursion as the representative performance metric. All numerical results are validated against Intel’s Math Kernel Library6 . Benchmarks are repeated several times, and the median performance is taken as the representative performance metric. Error bar… view at source ↗
Figure 9
Figure 9. Figure 9: Node-level performance summary for benchmark matrices in [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Single- (left) and multi-node (right) strong scaling performance and overhead results for Lynx1151 and nlpkkt240 on SPR nodes. Prepared using sagej.cls [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Weak scaling investigation using the Anderson matrices in [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 11
Figure 11. Figure 11: Time evolution of a wave packet (Eq. (9)) with width σ = 20 and momentum k0 = π 2 ex. Panel (a) shows the time-dependent density distribution in the localized regime with parameters t⊥/t = 0.001 and W/t = 1. The center-of-mass motion is displayed in panel (b), which also includes data for a delocalized system with t⊥/t = 0.1. We used a finite rectangular system with dimensions Ly = Lz = 100 and Lx = 3000 … view at source ↗
read the original abstract

Sparse matrix-vector products (SpMVs) are a bottleneck in many scientific codes. Due to the heavy strain on the main memory interface from loading the sparse matrix and the possibly irregular memory access pattern, SpMV typically exhibits low arithmetic intensity. Repeating these products multiple times with the same matrix is required in many algorithms. This so-called matrix power kernel (MPK) provides an opportunity for data reuse since the same matrix data is loaded from main memory multiple times, an opportunity that has only recently been exploited successfully with the Recursive Algebraic Coloring Engine (RACE). Using RACE, one considers a graph based formulation of the SpMV and employs s level-based implementation of SpMV for reuse of relevant matrix data. However, the underlying data dependencies have restricted the use of this concept to shared memory parallelization and thus to single compute nodes. Enabling cache blocking for distributed-memory parallelization of MPK is challenging due to the need for explicit communication and synchronization of data in neighboring levels. In this work, we propose and implement a flexible method that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme that fulfills all data dependencies among processes. Compared to a "traditional" distributed memory parallel MPK, our new Distributed Level-Blocked MPK yields substantial speed-ups on modern Intel and AMD architectures across a wide range of sparse matrices from various scientific applications. Finally, we address a modern quantum physics problem to demonstrate the applicability of our method, achieving a speed-up of up to 4x on 832 cores of an Intel Sapphire Rapids cluster.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper extends the Recursive Algebraic Coloring Engine (RACE) cache-blocking approach for matrix power kernels (MPK) from shared-memory to distributed-memory settings. It proposes interleaving RACE's level-based SpMV formulation with an explicit MPI communication and synchronization scheme to resolve data dependencies among neighboring levels in the graph representation of the sparse matrix. The central claim is that this Distributed Level-Blocked MPK delivers substantial speed-ups (up to 4x on 832 cores of an Intel Sapphire Rapids cluster) versus a traditional distributed-memory parallel MPK across a range of sparse matrices from scientific applications, including a quantum-physics demonstration.

Significance. If the performance claims hold after verification that communication overhead remains sub-dominant, the work would meaningfully broaden the applicability of RACE-style cache blocking to large-scale distributed systems. This addresses a previously stated limitation of the technique and could benefit iterative solvers and eigenvalue methods that rely on repeated SpMV operations.

major comments (1)
  1. [implementation of the MPI communication scheme] The description of the interleaved MPI scheme (in the section presenting the Distributed Level-Blocked MPK) does not specify the number of synchronization points or halo-exchange volume per power iteration, nor does it report the fraction of runtime spent in MPI on the 832-core experiments. Without this breakdown, it is impossible to confirm that the explicit communication required to satisfy neighbor-level dependencies adds only negligible overhead relative to the cache-reuse gains, which is load-bearing for the 4x speedup attribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address the single major comment below and will revise the manuscript to incorporate the requested details on the MPI scheme.

read point-by-point responses
  1. Referee: [implementation of the MPI communication scheme] The description of the interleaved MPI scheme (in the section presenting the Distributed Level-Blocked MPK) does not specify the number of synchronization points or halo-exchange volume per power iteration, nor does it report the fraction of runtime spent in MPI on the 832-core experiments. Without this breakdown, it is impossible to confirm that the explicit communication required to satisfy neighbor-level dependencies adds only negligible overhead relative to the cache-reuse gains, which is load-bearing for the 4x speedup attribution.

    Authors: We agree that quantifying the communication overhead is essential to substantiate the 4x speedup claims. The current manuscript focuses on the algorithmic interleaving of RACE levels with MPI but omits the requested metrics. In the revised version we will add a dedicated subsection (or table) in the experimental evaluation reporting: (i) the exact number of synchronization points per power iteration, (ii) the halo-exchange volume in bytes per iteration for the tested matrices, and (iii) the measured MPI time fraction on the 832-core Sapphire Rapids runs. These additions will allow direct verification that communication remains sub-dominant relative to the cache-reuse benefits. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical implementation and benchmarking paper

full rationale

The manuscript presents an engineering contribution: an implementation that interleaves RACE level blocking with explicit MPI communication to enable distributed-memory MPK. All performance claims (speed-ups up to 4x) rest on direct wall-clock timing measurements across a suite of sparse matrices and two hardware platforms. No equations, first-principles derivations, fitted parameters, or analytic predictions appear in the provided text. Consequently no self-definitional, fitted-input, or self-citation-load-bearing steps exist that could reduce the reported results to their own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are identifiable from the abstract; the method relies on existing RACE graph formulation and standard MPI primitives.

axioms (1)
  • domain assumption The graph formulation of SpMV permits level-based blocking whose cross-process data dependencies can be resolved by explicit communication without destroying cache reuse benefits.
    This assumption enables the interleaving strategy described as the key technical contribution.

pith-pipeline@v0.9.0 · 5829 in / 1247 out tokens · 33865 ms · 2026-05-24T01:18:45.628213+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

  3. [3]

    Available: https://doi.org/10.1145/3399732

    Alappat C, Basermann A, Bishop AR, Fehske H, Hager G, Schenk O, Thies J and Wellein G (2020 a ) A recursive algebraic coloring technique for hardware-efficient symmetric sparse matrix-vector multiplication. ACM Trans. Parallel Comput. 7(3). doi:10.1145/3399732

  4. [4]

    IEEE Transactions on Parallel and Distributed Systems 34(2): 1--18

    Alappat C, Hager G, Schenk O and Wellein G (2022) Level-based blocking for sparse matrices: Sparse matrix-power-vector multiplication. IEEE Transactions on Parallel and Distributed Systems 34(2): 1--18. doi:10.1109/TPDS.2022.3223512

  5. [5]

    Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs

    Alappat C, Thies J, Hager G, Fehske H and Wellein G (2023) Algebraic temporal blocking for sparse iterative solvers on multi-core CPUs . arXiv:2309.02228, Submitted

  6. [6]

    In: Sadayappan P, Chamberlain BL, Juckeland G and Ltaief H (eds.) High Performance Computing

    Alappat CL, Hofmann J, Hager G, Fehske H, Bishop AR and Wellein G (2020 b ) Understanding HPC benchmark performance on I ntel B roadwell and C ascade L ake processors. In: Sadayappan P, Chamberlain BL, Juckeland G and Ltaief H (eds.) High Performance Computing. Cham: Springer International Publishing. ISBN 978-3-030-50743-5, pp. 412--433

  7. [7]

    Anderson PW (1958) Absence of diffusion in certain random lattices. Phys. Rev. 109: 1492--1505. doi:10.1103/PhysRev.109.1492

  8. [8]

    10 Almut Demel, Dominik Dürrschnabel, Tamara Mchedlidze, Marcel Radermacher, and Lasse Wulf

    Davis TA and Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1). doi:10.1145/2049662.2049663

  9. [9]

    In: 2008 IEEE International Symposium on Parallel and Distributed Processing

    Demmel J, Hoemmen M, Mohiyuddin M and Yelick K (2008) Avoiding communication in sparse matrix computations. In: 2008 IEEE International Symposium on Parallel and Distributed Processing. pp. 1--12. doi:10.1109/IPDPS.2008.4536305

  10. [10]

    Physics Letters A 373(25): 2182--2188

    Fehske H, Schleede J, Schubert G, Wellein G, Filinov VS and Bishop AR (2009) Numerical approaches to time evolution of complex quantum systems. Physics Letters A 373(25): 2182--2188. doi:10.1016/j.physleta.2009.04.022

  11. [11]

    Janarek J, Delande D, Cherroret N and Zakrzewski J (2020) Quantum boomerang effect for interacting particles. Phys. Rev. A 102: 013303. doi:10.1103/PhysRevA.102.013303

  12. [12]

    Janarek J, Gr\'emaud B, Zakrzewski J and Delande D (2022) Quantum boomerang effect in systems without time-reversal symmetry. Phys. Rev. B 105: L180202. doi:10.1103/PhysRevB.105.L180202

  13. [13]

    Karypis G and Kumar V (1998) METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices

  14. [14]

    A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units,

    Kreutzer M, Hager G, Wellein G, Fehske H and Bishop AR (2014) A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36(5): C401--C423. doi:10.1137/130930352

  15. [15]

    In: 2019 Computing in Cardiology (CinC)

    Langguth J, Arevalo H, Hustad KG and Cai X (2019) Towards detailed real-time simulations of cardiac arrhythmia. In: 2019 Computing in Cardiology (CinC). pp. Page 1--Page 4. doi:10.22489/CinC.2019.301

  16. [16]

    IEEE Micro 35(4): 6--15

    Langguth J, Sourouri M, Lines GT, Baden SB and Cai X (2015) Scalable heterogeneous CPU - GPU computations for unstructured tetrahedral meshes. IEEE Micro 35(4): 6--15. doi:10.1109/MM.2015.70

  17. [17]

    In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP)

    Loe JA, Thornquist HK and Boman EG (2020) Polynomial preconditioned gmres in trilinos: Practical considerations for high-performance computing. In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP). SIAM, pp. 35--45. doi:10.1137/1.9781611976137.4

  18. [18]

    Statistics and Computing 17: 395--416

    Luxburg U (2004) A tutorial on spectral clustering. Statistics and Computing 17: 395--416. doi:10.1007/s11222-007-9033-z

  19. [19]

    Journal of Machine Learning Research 17(148): 1--5

    McQueen J, Meil a M, VanderPlas J and Zhang Z (2016) Megaman: Scalable manifold learning in P ython. Journal of Machine Learning Research 17(148): 1--5. ://jmlr.org/papers/v17/16-109.html

  20. [20]

    In: 32th USENIX Security Symposium (USENIX Security 2023)

    Moghimi D (2023) Downfall : Exploiting speculative data gathering. In: 32th USENIX Security Symposium (USENIX Security 2023)

  21. [21]

    Minimizing communication in sparse matrix solvers,

    Mohiyuddin M, Hoemmen M, Demmel J and Yelick K (2009) Minimizing communication in sparse matrix solvers. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09. New York, NY, USA: Association for Computing Machinery. ISBN 9781605587448, pp. 1--12. doi:10.1145/1654059.1654096

  22. [22]

    Prat T, Delande D and Cherroret N (2019) Quantum boomeranglike effect of wave packets in random media. Phys. Rev. A 99: 023629. doi:10.1103/PhysRevA.99.023629

  23. [23]

    In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC '18

    Simpson T, Pasadakis D, Kourounis D, Fujita K, Yamaguchi T, Ichimura T and Schenk O (2018) Balanced graph partition refinement using the graph p-laplacian. In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC '18. New York, NY, USA: Association for Computing Machinery. ISBN 9781450358910, pp. 1--11. doi:10.1145/3218176.3218232

  24. [24]

    The Journal of Chemical Physics 81(9): 3967--3971

    Tal‐Ezer H and Kosloff R (1984) An accurate and efficient scheme for propagating the time dependent S chr \"o dinger equation. The Journal of Chemical Physics 81(9): 3967--3971. doi:10.1063/1.448136

  25. [25]

    Treibig, G

    Treibig J, Hager G and Wellein G (2010) LIKWID : A lightweight performance-oriented tool suite for x86 multicore environments. In: 2010 39th International Conference on Parallel Processing Workshops. pp. 207--216. doi:10.1109/ICPPW.2010.38

  26. [26]

    In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia2020

    Vatai E, Singhal U and Suda R (2020) Diamond matrix powers kernels. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia2020. New York, NY, USA: Association for Computing Machinery. ISBN 9781450372367, p. 102–113. doi:10.1145/3368474.3368494

  27. [27]

    PhD Thesis

    Vuduc RW and Demmel JW (2003) Automatic performance tuning of sparse matrix kernels. PhD Thesis. AAI3121741

  28. [28]

    Williams S, Waterman A and Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4): 65–76. doi:10.1145/1498765.1498785

  29. [29]

    In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium

    Yamazaki I, Anzt H, Tomov S, Hoemmen M and Dongarra J (2014 a ) Improving the performance of CA - GMRES on multicores with multiple GPUs . In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium. pp. 382--391. doi:10.1109/IPDPS.2014.48

  30. [30]

    In: SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

    Yamazaki I, Rajamanickam S, Boman EG, Hoemmen M, Heroux MA and Tomov S (2014 b ) Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU / GPU cluster. In: SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 933--944. doi:10.1109/SC.2014.81

  31. [31]

    Zambetaki I, Li Q, Economou EN and Soukoulis CM (1997) Localization in weakly coupled planes and weakly coupled wires. Phys. Rev. B 56: 12221--12231. doi:10.1103/PhysRevB.56.12221