Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

Christie L. Alappat; Dane C. Lacey; Florian Lange; Georg Hager; Gerhard Wellein; Holger Fehske

arxiv: 2405.12525 · v2 · submitted 2024-05-21 · 💻 cs.DC · cs.PF

Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

Dane C. Lacey , Christie L. Alappat , Florian Lange , Georg Hager , Holger Fehske , Gerhard Wellein This is my paper

Pith reviewed 2026-05-24 01:18 UTC · model grok-4.3

classification 💻 cs.DC cs.PF

keywords matrix power kernelcache blockingdistributed memoryMPI communicationsparse matrix-vector productRACEperformance optimization

0 comments

The pith

Extending RACE cache blocking to distributed memory by interleaving it with MPI communication yields up to 4x speed-up for matrix power kernels on large clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method for cache blocking in matrix power kernels that span multiple compute nodes. It extends the Recursive Algebraic Coloring Engine, previously limited to shared memory, by pairing its level-based data reuse with an MPI scheme that meets dependencies between processes. This matters because repeated sparse matrix-vector products strain memory bandwidth in many scientific codes, and reuse opportunities had been blocked by the distributed setting. The resulting Distributed Level-Blocked MPK delivers consistent gains on modern Intel and AMD systems across matrices from varied applications. One demonstration reaches a 4x improvement on 832 cores for a quantum physics calculation.

Core claim

The authors propose and implement a Distributed Level-Blocked MPK that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme fulfilling all data dependencies among processes. This removes the prior restriction of RACE to single nodes and produces substantial speed-ups relative to traditional distributed-memory MPK on Intel and AMD architectures for a wide range of sparse matrices from scientific applications, including a speed-up of up to 4x on 832 cores of an Intel Sapphire Rapids cluster in a quantum physics example.

What carries the argument

Interleaving of RACE level-based blocking with an explicit MPI communication and synchronization scheme for neighboring levels in the graph-based SpMV formulation.

If this is right

Substantial speed-ups on modern Intel and AMD architectures across a wide range of sparse matrices from scientific applications.
Speed-up of up to 4x achieved on 832 cores of an Intel Sapphire Rapids cluster.
Applicability shown for a modern quantum physics problem.
Flexible handling of data dependencies that previously restricted cache blocking to shared memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The interleaving approach could extend RACE-style reuse to other distributed graph algorithms that rely on level traversals.
Performance gains may reduce overall runtime and energy use in large-scale simulations that repeatedly call matrix power kernels.
Scalability tests on clusters with different interconnects would clarify where communication overhead begins to dominate.

Load-bearing premise

Data dependencies among neighboring levels can be satisfied by explicit MPI communication interleaved with RACE blocking without prohibitive overhead.

What would settle it

A run of the new method on a matrix with high inter-process communication volume where total time exceeds that of the traditional distributed MPK due to synchronization costs.

Figures

Figures reproduced from arXiv: 2405.12525 by Christie L. Alappat, Dane C. Lacey, Florian Lange, Georg Hager, Gerhard Wellein, Holger Fehske.

**Figure 1.** Figure 1: Graph (a) and sparsity pattern (b) of the matrix associated with a modified 5-point stencil. Graph (c) shows the permuted graph and (d) the sparsity pattern of the matrix after applying Breadth First Search (BFS) reordering. The vertices (rows) of the graph (matrix) that belong to a level are represented with the same color. 0 2 5 9 14 1 4 8 13 19 3 7 12 18 24 6 11 17 23 29 10 16 22 28 34 15 21 27 33 39 20… view at source ↗

**Figure 2.** Figure 2: Lp diagram with 10 levels (L(0), . . . , L(9)) and a maximum power of pm = 5. Level colors are the same as in Figure 1c. Each node in the Lp diagram is numbered according to the execution order. For p = 4 and level L(6), the explicit dependencies to levels at p = 3 are indicated with red arrows. The nodes highlighted in orange fulfill i + p = 6 (“diagonal”). indices of the non-zero elements in row v have a… view at source ↗

**Figure 3.** Figure 3: The global matrix A from Figure 1b and some RHS vector x are partitioned in a row-wise manner over two MPI processes in (a). The gray boxed-out regions show, on each MPI process, which elements are “local” (inside the gray region). The edge corresponding to the remote data dependency, i.e., the edge crossing the MPI boundary, is highlighted in blue in (b). The rows at global indices 8 and 12 are highlighte… view at source ↗

**Figure 4.** Figure 4: Comparison of three MPK implementations for the computation of A 3 x on a 1D tri-diagonal stencil matrix, distributed across two MPI processes, where the execution order is written in each node. The traditional MPK implementation of back-to-back SpMVs is shown in (a), the “Communication Avoiding” MPK with redundant SpMVs in (b), and our implementation of DLB-MPK is shown in (c). Each dot represents a verte… view at source ↗

**Figure 5.** Figure 5: Overheads in CA-MPK associated with the Serena matrix, partitioned over 10 and 15 MPI processes for powers p ∈ {1, 2, . . . , 12}. Left: additional halo elements relative to the total number of rows Nr. Right: recomputed elements relative to the total number of non-zero elements Nnz. one step. This remainder phase is repeated for a total of pm − 1 times to ensure all internal vertices reach power pm. Figur… view at source ↗

**Figure 7.** Figure 7: Full-node measured load bandwidths in GB/s (y-axis) vs. data set size. The higher solid horizontal line represents the estimated L3 cache bandwidths and the lower one represents the estimated bandwidth from main memory. The dashed red line marks the overall L2 cache size for the entire node, while the dotted blue line represents the aggregate L2+L3 cache size for the entire node. The widest SIMD registers … view at source ↗

**Figure 8.** Figure 8: Parameter study with ML Geer on one ICL node, scanning p ∈ {1, 2, . . . , 10} and C ∈ {30, 35, . . . , 75}. maximum performance of DLB-MPK with and without recursion as the representative performance metric. All numerical results are validated against Intel’s Math Kernel Library6 . Benchmarks are repeated several times, and the median performance is taken as the representative performance metric. Error bar… view at source ↗

**Figure 9.** Figure 9: Node-level performance summary for benchmark matrices in [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Single- (left) and multi-node (right) strong scaling performance and overhead results for Lynx1151 and nlpkkt240 on SPR nodes. Prepared using sagej.cls [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 12.** Figure 12: Weak scaling investigation using the Anderson matrices in [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 11.** Figure 11: Time evolution of a wave packet (Eq. (9)) with width σ = 20 and momentum k0 = π 2 ex. Panel (a) shows the time-dependent density distribution in the localized regime with parameters t⊥/t = 0.001 and W/t = 1. The center-of-mass motion is displayed in panel (b), which also includes data for a delocalized system with t⊥/t = 0.1. We used a finite rectangular system with dimensions Ly = Lz = 100 and Lx = 3000 … view at source ↗

read the original abstract

Sparse matrix-vector products (SpMVs) are a bottleneck in many scientific codes. Due to the heavy strain on the main memory interface from loading the sparse matrix and the possibly irregular memory access pattern, SpMV typically exhibits low arithmetic intensity. Repeating these products multiple times with the same matrix is required in many algorithms. This so-called matrix power kernel (MPK) provides an opportunity for data reuse since the same matrix data is loaded from main memory multiple times, an opportunity that has only recently been exploited successfully with the Recursive Algebraic Coloring Engine (RACE). Using RACE, one considers a graph based formulation of the SpMV and employs s level-based implementation of SpMV for reuse of relevant matrix data. However, the underlying data dependencies have restricted the use of this concept to shared memory parallelization and thus to single compute nodes. Enabling cache blocking for distributed-memory parallelization of MPK is challenging due to the need for explicit communication and synchronization of data in neighboring levels. In this work, we propose and implement a flexible method that interleaves the cache-blocking capabilities of RACE with an MPI communication scheme that fulfills all data dependencies among processes. Compared to a "traditional" distributed memory parallel MPK, our new Distributed Level-Blocked MPK yields substantial speed-ups on modern Intel and AMD architectures across a wide range of sparse matrices from various scientific applications. Finally, we address a modern quantum physics problem to demonstrate the applicability of our method, achieving a speed-up of up to 4x on 832 cores of an Intel Sapphire Rapids cluster.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The distributed extension of RACE via interleaved MPI looks workable on paper but the overhead numbers are still the key missing piece for the 4x claim.

read the letter

The main point here is that the authors have taken the shared-memory RACE level blocking for matrix powers and made it work across MPI processes by interleaving the level traversals with explicit communication to satisfy neighbor dependencies. That interleaving step is the actual new piece; earlier RACE papers stayed inside one node. They test on a range of sparse matrices and show the method on both Intel Sapphire Rapids and AMD hardware, with a real quantum-physics application reaching 4x over a conventional distributed MPK on 832 cores. That is useful engineering if the numbers hold up under scrutiny. The baseline comparison to a traditional distributed MPK is the right one, and applying the technique to an actual simulation code is a step in the right direction. The implementation appears to be described at a level that other groups could try to reproduce. The soft spot is exactly the one the stress-test note flags. The abstract says the MPI scheme adds no prohibitive overhead, yet it gives no counts of synchronization points per power step, no breakdown of time spent in halo exchanges versus computation, and no indication of how message volume scales with the number of levels or the graph cut size. Without those measurements it is hard to know whether the reported speedups come from the cache reuse or simply from a communication pattern that happens to be cheap on the tested matrices and core counts. If the full paper includes per-run MPI time fractions and shows they stay low even when the level count or process count increases, the central claim strengthens; if those fractions are missing or high, the 4x figure needs re-examination. This is targeted at people who already run distributed sparse kernels and are looking for cache-blocking options. A performance engineer or someone maintaining an MPK inside a larger code would get practical value from the approach. It is worth sending to referees because the idea is concrete and the scaling results, once the communication details are checked, could be adopted. I would not desk-reject it.

Referee Report

1 major / 0 minor

Summary. The paper extends the Recursive Algebraic Coloring Engine (RACE) cache-blocking approach for matrix power kernels (MPK) from shared-memory to distributed-memory settings. It proposes interleaving RACE's level-based SpMV formulation with an explicit MPI communication and synchronization scheme to resolve data dependencies among neighboring levels in the graph representation of the sparse matrix. The central claim is that this Distributed Level-Blocked MPK delivers substantial speed-ups (up to 4x on 832 cores of an Intel Sapphire Rapids cluster) versus a traditional distributed-memory parallel MPK across a range of sparse matrices from scientific applications, including a quantum-physics demonstration.

Significance. If the performance claims hold after verification that communication overhead remains sub-dominant, the work would meaningfully broaden the applicability of RACE-style cache blocking to large-scale distributed systems. This addresses a previously stated limitation of the technique and could benefit iterative solvers and eigenvalue methods that rely on repeated SpMV operations.

major comments (1)

[implementation of the MPI communication scheme] The description of the interleaved MPI scheme (in the section presenting the Distributed Level-Blocked MPK) does not specify the number of synchronization points or halo-exchange volume per power iteration, nor does it report the fraction of runtime spent in MPI on the 832-core experiments. Without this breakdown, it is impossible to confirm that the explicit communication required to satisfy neighbor-level dependencies adds only negligible overhead relative to the cache-reuse gains, which is load-bearing for the 4x speedup attribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address the single major comment below and will revise the manuscript to incorporate the requested details on the MPI scheme.

read point-by-point responses

Referee: [implementation of the MPI communication scheme] The description of the interleaved MPI scheme (in the section presenting the Distributed Level-Blocked MPK) does not specify the number of synchronization points or halo-exchange volume per power iteration, nor does it report the fraction of runtime spent in MPI on the 832-core experiments. Without this breakdown, it is impossible to confirm that the explicit communication required to satisfy neighbor-level dependencies adds only negligible overhead relative to the cache-reuse gains, which is load-bearing for the 4x speedup attribution.

Authors: We agree that quantifying the communication overhead is essential to substantiate the 4x speedup claims. The current manuscript focuses on the algorithmic interleaving of RACE levels with MPI but omits the requested metrics. In the revised version we will add a dedicated subsection (or table) in the experimental evaluation reporting: (i) the exact number of synchronization points per power iteration, (ii) the halo-exchange volume in bytes per iteration for the tested matrices, and (iii) the measured MPI time fraction on the 832-core Sapphire Rapids runs. These additions will allow direct verification that communication remains sub-dominant relative to the cache-reuse benefits. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical implementation and benchmarking paper

full rationale

The manuscript presents an engineering contribution: an implementation that interleaves RACE level blocking with explicit MPI communication to enable distributed-memory MPK. All performance claims (speed-ups up to 4x) rest on direct wall-clock timing measurements across a suite of sparse matrices and two hardware platforms. No equations, first-principles derivations, fitted parameters, or analytic predictions appear in the provided text. Consequently no self-definitional, fitted-input, or self-citation-load-bearing steps exist that could reduce the reported results to their own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are identifiable from the abstract; the method relies on existing RACE graph formulation and standard MPI primitives.

axioms (1)

domain assumption The graph formulation of SpMV permits level-based blocking whose cross-process data dependencies can be resolved by explicit communication without destroying cache reuse benefits.
This assumption enables the interleaving strategy described as the key technical contribution.

pith-pipeline@v0.9.0 · 5829 in / 1247 out tokens · 33865 ms · 2026-05-24T01:18:45.628213+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

work page
[3]

Available: https://doi.org/10.1145/3399732

Alappat C, Basermann A, Bishop AR, Fehske H, Hager G, Schenk O, Thies J and Wellein G (2020 a ) A recursive algebraic coloring technique for hardware-efficient symmetric sparse matrix-vector multiplication. ACM Trans. Parallel Comput. 7(3). doi:10.1145/3399732

work page doi:10.1145/3399732 2020
[4]

IEEE Transactions on Parallel and Distributed Systems 34(2): 1--18

Alappat C, Hager G, Schenk O and Wellein G (2022) Level-based blocking for sparse matrices: Sparse matrix-power-vector multiplication. IEEE Transactions on Parallel and Distributed Systems 34(2): 1--18. doi:10.1109/TPDS.2022.3223512

work page doi:10.1109/tpds.2022.3223512 2022
[5]

Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs

Alappat C, Thies J, Hager G, Fehske H and Wellein G (2023) Algebraic temporal blocking for sparse iterative solvers on multi-core CPUs . arXiv:2309.02228, Submitted

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

In: Sadayappan P, Chamberlain BL, Juckeland G and Ltaief H (eds.) High Performance Computing

Alappat CL, Hofmann J, Hager G, Fehske H, Bishop AR and Wellein G (2020 b ) Understanding HPC benchmark performance on I ntel B roadwell and C ascade L ake processors. In: Sadayappan P, Chamberlain BL, Juckeland G and Ltaief H (eds.) High Performance Computing. Cham: Springer International Publishing. ISBN 978-3-030-50743-5, pp. 412--433

work page 2020
[7]

Anderson PW (1958) Absence of diffusion in certain random lattices. Phys. Rev. 109: 1492--1505. doi:10.1103/PhysRev.109.1492

work page doi:10.1103/physrev.109.1492 1958
[8]

10 Almut Demel, Dominik Dürrschnabel, Tamara Mchedlidze, Marcel Radermacher, and Lasse Wulf

Davis TA and Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1). doi:10.1145/2049662.2049663

work page doi:10.1145/2049662.2049663 2011
[9]

In: 2008 IEEE International Symposium on Parallel and Distributed Processing

Demmel J, Hoemmen M, Mohiyuddin M and Yelick K (2008) Avoiding communication in sparse matrix computations. In: 2008 IEEE International Symposium on Parallel and Distributed Processing. pp. 1--12. doi:10.1109/IPDPS.2008.4536305

work page doi:10.1109/ipdps.2008.4536305 2008
[10]

Physics Letters A 373(25): 2182--2188

Fehske H, Schleede J, Schubert G, Wellein G, Filinov VS and Bishop AR (2009) Numerical approaches to time evolution of complex quantum systems. Physics Letters A 373(25): 2182--2188. doi:10.1016/j.physleta.2009.04.022

work page doi:10.1016/j.physleta.2009.04.022 2009
[11]

Janarek J, Delande D, Cherroret N and Zakrzewski J (2020) Quantum boomerang effect for interacting particles. Phys. Rev. A 102: 013303. doi:10.1103/PhysRevA.102.013303

work page doi:10.1103/physreva.102.013303 2020
[12]

Janarek J, Gr\'emaud B, Zakrzewski J and Delande D (2022) Quantum boomerang effect in systems without time-reversal symmetry. Phys. Rev. B 105: L180202. doi:10.1103/PhysRevB.105.L180202

work page doi:10.1103/physrevb.105.l180202 2022
[13]

Karypis G and Kumar V (1998) METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices

work page 1998
[14]

A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units,

Kreutzer M, Hager G, Wellein G, Fehske H and Bishop AR (2014) A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36(5): C401--C423. doi:10.1137/130930352

work page doi:10.1137/130930352 2014
[15]

In: 2019 Computing in Cardiology (CinC)

Langguth J, Arevalo H, Hustad KG and Cai X (2019) Towards detailed real-time simulations of cardiac arrhythmia. In: 2019 Computing in Cardiology (CinC). pp. Page 1--Page 4. doi:10.22489/CinC.2019.301

work page doi:10.22489/cinc.2019.301 2019
[16]

IEEE Micro 35(4): 6--15

Langguth J, Sourouri M, Lines GT, Baden SB and Cai X (2015) Scalable heterogeneous CPU - GPU computations for unstructured tetrahedral meshes. IEEE Micro 35(4): 6--15. doi:10.1109/MM.2015.70

work page doi:10.1109/mm.2015.70 2015
[17]

In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP)

Loe JA, Thornquist HK and Boman EG (2020) Polynomial preconditioned gmres in trilinos: Practical considerations for high-performance computing. In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP). SIAM, pp. 35--45. doi:10.1137/1.9781611976137.4

work page doi:10.1137/1.9781611976137.4 2020
[18]

Statistics and Computing 17: 395--416

Luxburg U (2004) A tutorial on spectral clustering. Statistics and Computing 17: 395--416. doi:10.1007/s11222-007-9033-z

work page doi:10.1007/s11222-007-9033-z 2004
[19]

Journal of Machine Learning Research 17(148): 1--5

McQueen J, Meil a M, VanderPlas J and Zhang Z (2016) Megaman: Scalable manifold learning in P ython. Journal of Machine Learning Research 17(148): 1--5. ://jmlr.org/papers/v17/16-109.html

work page 2016
[20]

In: 32th USENIX Security Symposium (USENIX Security 2023)

Moghimi D (2023) Downfall : Exploiting speculative data gathering. In: 32th USENIX Security Symposium (USENIX Security 2023)

work page 2023
[21]

Minimizing communication in sparse matrix solvers,

Mohiyuddin M, Hoemmen M, Demmel J and Yelick K (2009) Minimizing communication in sparse matrix solvers. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09. New York, NY, USA: Association for Computing Machinery. ISBN 9781605587448, pp. 1--12. doi:10.1145/1654059.1654096

work page doi:10.1145/1654059.1654096 2009
[22]

Prat T, Delande D and Cherroret N (2019) Quantum boomeranglike effect of wave packets in random media. Phys. Rev. A 99: 023629. doi:10.1103/PhysRevA.99.023629

work page doi:10.1103/physreva.99.023629 2019
[23]

In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC '18

Simpson T, Pasadakis D, Kourounis D, Fujita K, Yamaguchi T, Ichimura T and Schenk O (2018) Balanced graph partition refinement using the graph p-laplacian. In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC '18. New York, NY, USA: Association for Computing Machinery. ISBN 9781450358910, pp. 1--11. doi:10.1145/3218176.3218232

work page doi:10.1145/3218176.3218232 2018
[24]

The Journal of Chemical Physics 81(9): 3967--3971

Tal‐Ezer H and Kosloff R (1984) An accurate and efficient scheme for propagating the time dependent S chr \"o dinger equation. The Journal of Chemical Physics 81(9): 3967--3971. doi:10.1063/1.448136

work page doi:10.1063/1.448136 1984
[25]

Treibig, G

Treibig J, Hager G and Wellein G (2010) LIKWID : A lightweight performance-oriented tool suite for x86 multicore environments. In: 2010 39th International Conference on Parallel Processing Workshops. pp. 207--216. doi:10.1109/ICPPW.2010.38

work page doi:10.1109/icppw.2010.38 2010
[26]

In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia2020

Vatai E, Singhal U and Suda R (2020) Diamond matrix powers kernels. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia2020. New York, NY, USA: Association for Computing Machinery. ISBN 9781450372367, p. 102–113. doi:10.1145/3368474.3368494

work page doi:10.1145/3368474.3368494 2020
[27]

PhD Thesis

Vuduc RW and Demmel JW (2003) Automatic performance tuning of sparse matrix kernels. PhD Thesis. AAI3121741

work page 2003
[28]

Williams S, Waterman A and Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4): 65–76. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009
[29]

In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium

Yamazaki I, Anzt H, Tomov S, Hoemmen M and Dongarra J (2014 a ) Improving the performance of CA - GMRES on multicores with multiple GPUs . In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium. pp. 382--391. doi:10.1109/IPDPS.2014.48

work page doi:10.1109/ipdps.2014.48 2014
[30]

In: SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Yamazaki I, Rajamanickam S, Boman EG, Hoemmen M, Heroux MA and Tomov S (2014 b ) Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU / GPU cluster. In: SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 933--944. doi:10.1109/SC.2014.81

work page doi:10.1109/sc.2014.81 2014
[31]

Zambetaki I, Li Q, Economou EN and Soukoulis CM (1997) Localization in weakly coupled planes and weakly coupled wires. Phys. Rev. B 56: 12221--12231. doi:10.1103/PhysRevB.56.12221

work page doi:10.1103/physrevb.56.12221 1997

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

work page

[3] [3]

Available: https://doi.org/10.1145/3399732

Alappat C, Basermann A, Bishop AR, Fehske H, Hager G, Schenk O, Thies J and Wellein G (2020 a ) A recursive algebraic coloring technique for hardware-efficient symmetric sparse matrix-vector multiplication. ACM Trans. Parallel Comput. 7(3). doi:10.1145/3399732

work page doi:10.1145/3399732 2020

[4] [4]

IEEE Transactions on Parallel and Distributed Systems 34(2): 1--18

Alappat C, Hager G, Schenk O and Wellein G (2022) Level-based blocking for sparse matrices: Sparse matrix-power-vector multiplication. IEEE Transactions on Parallel and Distributed Systems 34(2): 1--18. doi:10.1109/TPDS.2022.3223512

work page doi:10.1109/tpds.2022.3223512 2022

[5] [5]

Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs

Alappat C, Thies J, Hager G, Fehske H and Wellein G (2023) Algebraic temporal blocking for sparse iterative solvers on multi-core CPUs . arXiv:2309.02228, Submitted

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

In: Sadayappan P, Chamberlain BL, Juckeland G and Ltaief H (eds.) High Performance Computing

Alappat CL, Hofmann J, Hager G, Fehske H, Bishop AR and Wellein G (2020 b ) Understanding HPC benchmark performance on I ntel B roadwell and C ascade L ake processors. In: Sadayappan P, Chamberlain BL, Juckeland G and Ltaief H (eds.) High Performance Computing. Cham: Springer International Publishing. ISBN 978-3-030-50743-5, pp. 412--433

work page 2020

[7] [7]

Anderson PW (1958) Absence of diffusion in certain random lattices. Phys. Rev. 109: 1492--1505. doi:10.1103/PhysRev.109.1492

work page doi:10.1103/physrev.109.1492 1958

[8] [8]

10 Almut Demel, Dominik Dürrschnabel, Tamara Mchedlidze, Marcel Radermacher, and Lasse Wulf

Davis TA and Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1). doi:10.1145/2049662.2049663

work page doi:10.1145/2049662.2049663 2011

[9] [9]

In: 2008 IEEE International Symposium on Parallel and Distributed Processing

Demmel J, Hoemmen M, Mohiyuddin M and Yelick K (2008) Avoiding communication in sparse matrix computations. In: 2008 IEEE International Symposium on Parallel and Distributed Processing. pp. 1--12. doi:10.1109/IPDPS.2008.4536305

work page doi:10.1109/ipdps.2008.4536305 2008

[10] [10]

Physics Letters A 373(25): 2182--2188

Fehske H, Schleede J, Schubert G, Wellein G, Filinov VS and Bishop AR (2009) Numerical approaches to time evolution of complex quantum systems. Physics Letters A 373(25): 2182--2188. doi:10.1016/j.physleta.2009.04.022

work page doi:10.1016/j.physleta.2009.04.022 2009

[11] [11]

Janarek J, Delande D, Cherroret N and Zakrzewski J (2020) Quantum boomerang effect for interacting particles. Phys. Rev. A 102: 013303. doi:10.1103/PhysRevA.102.013303

work page doi:10.1103/physreva.102.013303 2020

[12] [12]

Janarek J, Gr\'emaud B, Zakrzewski J and Delande D (2022) Quantum boomerang effect in systems without time-reversal symmetry. Phys. Rev. B 105: L180202. doi:10.1103/PhysRevB.105.L180202

work page doi:10.1103/physrevb.105.l180202 2022

[13] [13]

Karypis G and Kumar V (1998) METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices

work page 1998

[14] [14]

A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units,

Kreutzer M, Hager G, Wellein G, Fehske H and Bishop AR (2014) A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scientific Computing 36(5): C401--C423. doi:10.1137/130930352

work page doi:10.1137/130930352 2014

[15] [15]

In: 2019 Computing in Cardiology (CinC)

Langguth J, Arevalo H, Hustad KG and Cai X (2019) Towards detailed real-time simulations of cardiac arrhythmia. In: 2019 Computing in Cardiology (CinC). pp. Page 1--Page 4. doi:10.22489/CinC.2019.301

work page doi:10.22489/cinc.2019.301 2019

[16] [16]

IEEE Micro 35(4): 6--15

Langguth J, Sourouri M, Lines GT, Baden SB and Cai X (2015) Scalable heterogeneous CPU - GPU computations for unstructured tetrahedral meshes. IEEE Micro 35(4): 6--15. doi:10.1109/MM.2015.70

work page doi:10.1109/mm.2015.70 2015

[17] [17]

In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP)

Loe JA, Thornquist HK and Boman EG (2020) Polynomial preconditioned gmres in trilinos: Practical considerations for high-performance computing. In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP). SIAM, pp. 35--45. doi:10.1137/1.9781611976137.4

work page doi:10.1137/1.9781611976137.4 2020

[18] [18]

Statistics and Computing 17: 395--416

Luxburg U (2004) A tutorial on spectral clustering. Statistics and Computing 17: 395--416. doi:10.1007/s11222-007-9033-z

work page doi:10.1007/s11222-007-9033-z 2004

[19] [19]

Journal of Machine Learning Research 17(148): 1--5

McQueen J, Meil a M, VanderPlas J and Zhang Z (2016) Megaman: Scalable manifold learning in P ython. Journal of Machine Learning Research 17(148): 1--5. ://jmlr.org/papers/v17/16-109.html

work page 2016

[20] [20]

In: 32th USENIX Security Symposium (USENIX Security 2023)

Moghimi D (2023) Downfall : Exploiting speculative data gathering. In: 32th USENIX Security Symposium (USENIX Security 2023)

work page 2023

[21] [21]

Minimizing communication in sparse matrix solvers,

Mohiyuddin M, Hoemmen M, Demmel J and Yelick K (2009) Minimizing communication in sparse matrix solvers. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09. New York, NY, USA: Association for Computing Machinery. ISBN 9781605587448, pp. 1--12. doi:10.1145/1654059.1654096

work page doi:10.1145/1654059.1654096 2009

[22] [22]

Prat T, Delande D and Cherroret N (2019) Quantum boomeranglike effect of wave packets in random media. Phys. Rev. A 99: 023629. doi:10.1103/PhysRevA.99.023629

work page doi:10.1103/physreva.99.023629 2019

[23] [23]

In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC '18

Simpson T, Pasadakis D, Kourounis D, Fujita K, Yamaguchi T, Ichimura T and Schenk O (2018) Balanced graph partition refinement using the graph p-laplacian. In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC '18. New York, NY, USA: Association for Computing Machinery. ISBN 9781450358910, pp. 1--11. doi:10.1145/3218176.3218232

work page doi:10.1145/3218176.3218232 2018

[24] [24]

The Journal of Chemical Physics 81(9): 3967--3971

Tal‐Ezer H and Kosloff R (1984) An accurate and efficient scheme for propagating the time dependent S chr \"o dinger equation. The Journal of Chemical Physics 81(9): 3967--3971. doi:10.1063/1.448136

work page doi:10.1063/1.448136 1984

[25] [25]

Treibig, G

Treibig J, Hager G and Wellein G (2010) LIKWID : A lightweight performance-oriented tool suite for x86 multicore environments. In: 2010 39th International Conference on Parallel Processing Workshops. pp. 207--216. doi:10.1109/ICPPW.2010.38

work page doi:10.1109/icppw.2010.38 2010

[26] [26]

In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia2020

Vatai E, Singhal U and Suda R (2020) Diamond matrix powers kernels. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia2020. New York, NY, USA: Association for Computing Machinery. ISBN 9781450372367, p. 102–113. doi:10.1145/3368474.3368494

work page doi:10.1145/3368474.3368494 2020

[27] [27]

PhD Thesis

Vuduc RW and Demmel JW (2003) Automatic performance tuning of sparse matrix kernels. PhD Thesis. AAI3121741

work page 2003

[28] [28]

Williams S, Waterman A and Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4): 65–76. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009

[29] [29]

In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium

Yamazaki I, Anzt H, Tomov S, Hoemmen M and Dongarra J (2014 a ) Improving the performance of CA - GMRES on multicores with multiple GPUs . In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium. pp. 382--391. doi:10.1109/IPDPS.2014.48

work page doi:10.1109/ipdps.2014.48 2014

[30] [30]

In: SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Yamazaki I, Rajamanickam S, Boman EG, Hoemmen M, Heroux MA and Tomov S (2014 b ) Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU / GPU cluster. In: SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 933--944. doi:10.1109/SC.2014.81

work page doi:10.1109/sc.2014.81 2014

[31] [31]

Zambetaki I, Li Q, Economou EN and Soukoulis CM (1997) Localization in weakly coupled planes and weakly coupled wires. Phys. Rev. B 56: 12221--12231. doi:10.1103/PhysRevB.56.12221

work page doi:10.1103/physrevb.56.12221 1997