Parallel Sparse and Data-Sparse Factorization-based Linear Solvers

Xiaoye Sherry Li; Yang Liu

arxiv: 2602.14289 · v2 · pith:LQTMVLV6new · submitted 2026-02-15 · 💻 cs.MS · cs.DC· cs.NA· math.NA

Parallel Sparse and Data-Sparse Factorization-based Linear Solvers

Xiaoye Sherry Li , Yang Liu This is my paper

Pith reviewed 2026-05-25 07:02 UTC · model grok-4.3

classification 💻 cs.MS cs.DCcs.NAmath.NA

keywords sparse direct solversparallel linear solverslow-rank compressioncommunication reductionhierarchical matricesfactorizationscalable solversheterogeneous computing

0 comments

The pith

Direct solvers remain essential for robust large-scale linear systems via advances in parallel communication reduction and low-rank compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews recent advances in sparse direct solvers for large-scale, ill-conditioned algebraic equations needed in multiphysics simulations, machine learning, and data science. It examines progress along two axes: reducing communication and latency costs in task- and data-parallel settings, and lowering computational complexity through low-rank and hierarchical matrix compression. Direct solvers are highlighted for their robustness and accuracy as key building blocks in scalable solver toolchains. The review covers algorithmic principles along with parallelization challenges and practices for achieving high speed and reliability on modern heterogeneous machines.

Core claim

Because of their robustness and accuracy, direct solvers are crucial components in building a scalable solver toolchain, and the key recent advances worth highlighting are techniques for communication reduction and low-rank compression in parallel sparse direct solvers.

What carries the argument

Sparse direct solvers that combine task- and data-parallel communication reduction with low-rank and hierarchical matrix algebra compression to handle factorization of large ill-conditioned systems.

If this is right

Direct solvers can solve ill-conditioned and indefinite equations more efficiently in parallel environments.
Scalable solver toolchains become feasible for applications in multiphysics, machine learning, and data science.
High speed and reliability are delivered on heterogeneous parallel machines through targeted parallelization practices.
Computational complexity drops via low-rank approximations without sacrificing the accuracy of factorization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These advances may expand the range of problems where direct solvers are preferred over iterative alternatives due to guaranteed robustness.
Implementation details from the review could guide development of hybrid solver libraries that mix direct and other methods.
The techniques suggest potential extensions to time-dependent or nonlinear problems where repeated factorizations occur.

Load-bearing premise

The reviewed techniques in communication reduction and low-rank compression constitute the key recent advances worth highlighting for parallel sparse direct solvers.

What would settle it

Demonstration on large ill-conditioned systems that communication reduction or low-rank compression techniques fail to improve scalability, accuracy, or reliability of direct solvers compared to prior methods.

Figures

Figures reproduced from arXiv: 2602.14289 by Xiaoye Sherry Li, Yang Liu.

**Figure 2.** Figure 2: Illustration of level set in a lower triangular SpTRSV, [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The view of the logical 3D process grid and an example of 18 processes arranged as a 3x3x2 process grid. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Two-level etree partition and the matrix view of the submatrix mapping to four 2D process grids. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Asymptotic per process communication volumes given in Tables 1 and 2, with different [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of several types of hierarchical matrices (HODLR, [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: [44, figs. 1 and 4] (a) Block partitioning of a hierarchical matrix for a 3D problem of size [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: [44, fig. 5] The time of the CPU and GPU implementations of the construction algorithm [44] for the kernel and [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: [43, figs. 1 and 4-12] Illustration of the factorization algorithm of [43] on the [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: [43, figs. 13 and 16] The factorization time, memory and solve time of the shared-memory parallel [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: For a hierarchical matrix, the most common parallel layout is perhaps the 1D block layout of [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 11.** Figure 11: (Top) Parallel process layouts for distributing a hierarchical matrix on 8 MPI processes: (a) 1D block layout [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Strong scaling (on distributed-memory systems) of various phases of several hierarchical matrix algorithms: [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: [60, fig. 5] The assembly tree of rank-structured multifrontal solvers using different compression algorithms. [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: [60, figs. 5 and 8] Characteristics of the factorization phase including flops, CPU time and memory for a [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: (a) [60, fig. 6] Strong scaling of the parallel HSS-multifrontal solver of STRUMPACK. (b) [59, fig. 1] [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗

read the original abstract

Efficient solutions of large-scale, ill-conditioned and indefinite algebraic equations are ubiquitously needed in numerous computational fields, including multiphysics simulations, machine learning, and data science. Because of their robustness and accuracy, direct solvers are crucial components in building a scalable solver toolchain. In this chapter, we will review recent advances of sparse direct solvers along two axes: 1) reducing communication and latency costs in both task- and data-parallel settings, and 2) reducing computational complexity via low-rank and other compression techniques such as hierarchical matrix algebra. In addition to algorithmic principles, we also illustrate the key parallelization challenges and best practices to deliver high speed and reliability on modern heterogeneous parallel machines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competent review chapter on parallel sparse direct solvers with no new contributions.

read the letter

Hi there, The punchline on this one is that it's a survey chapter, not a paper with new algorithms or proofs. The authors review recent work on making sparse direct solvers more efficient in parallel environments by reducing communication costs and by applying low-rank approximations to lower the overall complexity. It does well in describing the key ideas in both areas and in pointing out the implementation hurdles for current hardware. The discussion of best practices for high speed and reliability is the most practical part. It also distinguishes between task- and data-parallel settings for communication reduction and gives examples like hierarchical matrix algebra for compression. The soft spots are limited. Since it's a review, the main risk is incomplete coverage or outdated summaries, but the abstract suggests it sticks to established advances without overreaching. The claim about direct solvers being crucial for robustness is not controversial and aligns with common knowledge in the field. No evidence of poor citation patterns or internal inconsistencies from what's described. For who it's for: people who need to choose or implement linear solvers in large codes, especially those dealing with ill-conditioned systems in simulations or data science. It can save time by collecting the main ideas in one place. I would send it out for peer review if the venue accepts review chapters, as it appears to be a competent synthesis that organizes useful knowledge. Cheers,

Referee Report

0 major / 2 minor

Summary. This manuscript is a survey chapter reviewing recent advances in sparse direct solvers for large-scale, ill-conditioned, and indefinite linear systems arising in multiphysics simulations, machine learning, and data science. It organizes the review along two axes—reducing communication and latency costs in task- and data-parallel settings, and reducing computational complexity via low-rank and hierarchical matrix compression techniques—while also covering parallelization challenges and best practices for heterogeneous machines. The central observation is that direct solvers remain crucial for robustness and accuracy in scalable solver toolchains.

Significance. If the coverage is balanced and up-to-date, the chapter could serve as a useful reference for practitioners needing robust direct methods. No new theorems, algorithms, empirical results, machine-checked proofs, or reproducible code are presented; significance therefore rests entirely on the quality and representativeness of the literature synthesis rather than on any novel technical contribution.

minor comments (2)

The abstract states that the two axes constitute 'the key recent advances' but provides no explicit selection criteria or discussion of scope limitations; a short paragraph in the introduction justifying the focus relative to other directions (e.g., hybrid direct-iterative methods) would improve transparency.
Because the work is a survey rather than a research article, the absence of any tables summarizing complexity, communication volume, or software availability for the reviewed packages is a missed opportunity for clarity; adding such a summary table would aid readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation to accept. The assessment accurately captures the manuscript's scope as a literature synthesis on communication reduction and data-sparse techniques in sparse direct solvers.

Circularity Check

0 steps flagged

No significant circularity in survey paper

full rationale

This is a survey chapter reviewing existing advances in sparse direct solvers along axes of communication reduction and low-rank compression. No new theorems, derivations, predictions, fitted parameters, or empirical results are asserted. The strongest claim (direct solvers' robustness) is a standard observation, and the weakest assumption (editorial scope of reviewed techniques) is not a falsifiable premise internal to any derivation chain. No load-bearing steps reduce to self-definition, fitted inputs, or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a review paper; no new free parameters, axioms, or invented entities are introduced by the authors.

pith-pipeline@v0.9.0 · 5644 in / 879 out tokens · 32603 ms · 2026-05-25T07:02:27.318358+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

review recent advances of sparse direct solvers along two axes: 1) reducing communication and latency costs... 2) reducing computational complexity via low-rank and other compression techniques such as hierarchical matrix algebra
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

3D CA algorithm framework... etree... separator tree

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

231 extracted references · 231 canonical work pages · 6 internal anchors

[1]

The International Journal of High 34 Contents Performance Computing Applications38(5), 468–490 (2024)

Abdelfattah, A., Beams, N., Carson, R., Ghysels, P., Kolev, T., Stitt, T., uro Vargas, A., Tomov, S., Dongarra, J.: MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures. The International Journal of High 34 Contents Performance Computing Applications38(5), 468–490 (2024). doi:10.1177/10943420241261960

work page doi:10.1177/10943420241261960 2024
[2]

Neutrino Production via $e^-e^+$ Collision at $Z$-boson Peak

Abdelfattah, A., Ghysels, P., Boukaram, W., Tomov, S., Li, X.S., Dongarra, J.: Addressing irregular patterns of matrix computations on GPUs and their impact on applications powered by sparse direct solvers. In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE (2022). doi:10.1109/SC41404.2022.00031

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41404.2022.00031 2022
[3]

Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Ser.180(1) (2009)

work page 2009
[4]

AHMED.https://www.wr.uni-bayreuth.de/en/software/ahmed/index.html

work page
[5]

In: International Conference on High Performance Computing, pp

Al-Harthi, N., Alomairy, R., Akbudak, K., Chen, R., Ltaief, H., Bagci, H., Keyes, D.: Solving acoustic boundary integral equations using high performance tile low-rank LU factorization. In: International Conference on High Performance Computing, pp. 209–229. Springer (2020)

work page 2020
[6]

The University of Texas at Austin (2019)

Alger, N.V.: Data-scalable Hessian preconditioning for distributed parameter PDE-constrained inverse problems. The University of Texas at Austin (2019)

work page 2019
[7]

In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp

Aliaga, J.I., Carratal ´a-S´aez, R., Kriemann, R., Quintana-Ort ´ı, E.S.: Task-parallel LU factorization of hierarchical matrices using OmpSs. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1148–1157. IEEE (2017)

work page 2017
[8]

SIAM Journal on Scientific Computing42(5), A3397–A3426 (2020)

Ambartsumyan, I., Boukaram, W., Bui-Thanh, T., Ghattas, O., Keyes, D., Stadler, G., Turkiyyah, G., Zampini, S.: Hierarchical matrix approximations of Hessians arising in inverse problems governed by PDEs. SIAM Journal on Scientific Computing42(5), A3397–A3426 (2020)

work page 2020
[9]

Journal of Scientific Computing57(3), 477–501 (2013)

Ambikasaran, S., Darve, E.: An𝑂(𝑁log𝑁)fast direct solver for partial hierarchically semi-separable matrices: with application to radial basis function interpolation. Journal of Scientific Computing57(3), 477–501 (2013)

work page 2013
[10]

The Inverse Fast Multipole Method

Ambikasaran, S., Darve, E.: The inverse fast multipole method. arXiv preprint arXiv:1407.1572 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[11]

IEEE transactions on pattern analysis and machine intelligence38(2), 252–265 (2015)

Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D.W., O’Neil, M.: Fast direct methods for Gaussian processes. IEEE transactions on pattern analysis and machine intelligence38(2), 252–265 (2015)

work page 2015
[12]

Fast symmetric factorization of hierarchical matrices with applications

Ambikasaran, S., O’Neil, M., Singh, K.R.: Fast symmetric factorization of hierarchical matrices with applications. arXiv preprint arXiv:1405.0223 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[13]

Amestoy, P., Ashcraft, C., Boiteau, O., Buttari, A., L ’Excellent, J.Y., Weisbecker, C.: Improving multifrontal methods by means of block low-rank representations. SIAM J. Sci. Comput.37(3), A1451–A1474 (2015)

work page 2015
[14]

SIAM Journal on Scientific Computing39(4), A1710–A1740 (2017)

Amestoy, P., Buttari, A., L ’Excellent, J.Y., Mary, T.: On the complexity of the block low-rank multifrontal factorization. SIAM Journal on Scientific Computing39(4), A1710–A1740 (2017)

work page 2017
[15]

ACM Trans

Amestoy, P.R., Buttari, A., L ’Excellent, J.Y., Mary, T.: Performance and scalability of the block low-rank multifrontal factorization on multicore architectures. ACM Trans. Math. Softw.45(1) (2019). doi:10.1145/3242094

work page doi:10.1145/3242094 2019
[16]

Amestoy, P.R., Buttari, A., L ’Excellent, J.Y., Mary, T.A.: Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel block low-rank format. SIAM J. Sci. Comput.41(3), A1414–A1442 (2019). doi:10.1137/18M1182760

work page doi:10.1137/18m1182760 2019
[17]

ACM Trans

Amestoy, P.R., Duff, I.S., L ’excellent, J.Y., Li, X.S.: Analysis and comparison of two general sparse solvers for distributed memory computers. ACM Trans. Math. Softw.27(4), 388–421 (2001). doi:10.1145/504210.504212. URLhttps://doi.org/10.1145/ 504210.504212

work page doi:10.1145/504210.504212 2001
[18]

In: International Workshop on Applied Parallel Computing, pp

Amestoy, P.R., Duff, I.S., L ’Excellent, J.Y., Koster, J.: MUMPS: a general purpose distributed memory sparse solver. In: International Workshop on Applied Parallel Computing, pp. 121–130. Springer (2000)

work page 2000
[19]

SIAM Journal on Matrix Anal

Amestoy, P.R., Duff, I.S., L ’Excellent, J.Y., Koster, J.: A fully asynchronous multi-frontal solver using distributed dynamic scheduling. SIAM Journal on Matrix Anal. Appl.23, 15–41 (2001). doi:10.1137/S0895479899358194

work page doi:10.1137/s0895479899358194 2001
[20]

Aminfar, A., Ambikasaran, S., Darve, E.: A fast block low-rank dense solver with applications to finite-element matrices. J. Comput. Phys.304, 170–188 (2016)

work page 2016
[21]

Anderson, E., Saad, Y.: Solving sparse triangular linear systems on parallel computers. Int. J. High Speed Comput.1(1), 73–95 (1989). doi:10.1142/S0129053389000056. URLhttps://doi.org/10.1142/S0129053389000056

work page doi:10.1142/s0129053389000056 1989
[22]

Advances in Computational Mathematics49, 1–46 (2021)

Angleitner, N., Faustmann, M., Melenk, J.M.:H-inverses for RBF interpolation. Advances in Computational Mathematics49, 1–46 (2021). URLhttps://api.semanticscholar.org/CorpusID:237540970

work page 2021
[23]

ACM Transactions on Mathematical Software48(1), 2:1–2:33 (2022)

Anzt, H., Cojean, T., Flegar, G., G ¨obel, F., Gr¨ utzmacher, T., Nayak, P., Ribizel, T., Tsai, Y.M., Quintana-Ort´ı, E.S.: Ginkgo: a modern linear operator algebra framework for high performance computing. ACM Transactions on Mathematical Software48(1), 2:1–2:33 (2022). doi:10.1145/3480935. URLhttps://doi.org/10.1145/3480935

work page doi:10.1145/3480935 2022
[24]

SIAM Journal on Matrix Analysis and Applications42(2), 990–1010 (2021)

Ashcraft, C., Buttari, A., Mary, T.: Block low-rank matrices with shared bases: potential and limitations of the BLR 2 format. SIAM Journal on Matrix Analysis and Applications42(2), 990–1010 (2021)

work page 2021
[25]

ASKIT.https://padas.oden.utexas.edu/libaskit

work page
[26]

Starpu: A uniﬁed platform for task scheduling on heterogeneous multicore architectures,

Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience23(2), 187–198 (2011). doi:10.1002/cpe.1631. URLhttps: //inria.hal.science/inria-00550877

work page doi:10.1002/cpe.1631 2011
[27]

Acta Numerica23, 1–155 (2014)

Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica23, 1–155 (2014). doi:10.1017/S0962492914000038

work page doi:10.1017/s0962492914000038 2014
[28]

Cambridge University Press (2025)

Ballard, G., Kolda, T.G.: Tensor decompositions for data science. Cambridge University Press (2025)

work page 2025
[29]

SIAM Journal on Scientific Computing37(4), B519–B542 (2015)

Barnett, A., Wu, B., Veerapaneni, S.: Spectrally accurate quadratures for evaluation of layer potentials close to the boundary for the 2D Stokes and Laplace equations. SIAM Journal on Scientific Computing37(4), B519–B542 (2015)

work page 2015
[30]

Numerische Mathematik86(4), 565–589 (2000)

Bebendorf, M.: Approximation of boundary element matrices. Numerische Mathematik86(4), 565–589 (2000). doi:10.1007/PL00005410

work page doi:10.1007/pl00005410 2000
[31]

Mathematical Methods in the Applied Sciences29(14), 1721–1747 (2006)

Bebendorf, M., Grzhibovskis, R.: Accelerating Galerkin BEM for linear elasticity using adaptive cross approximation. Mathematical Methods in the Applied Sciences29(14), 1721–1747 (2006). doi:10.1002/mma.759

work page doi:10.1002/mma.759 2006
[32]

Numerische Mathematik95(1), 1–28 (2003) Contents 35

Bebendorf, M., Hackbusch, W.: Existence ofH-matrix approximants to the inverse FE-matrix of elliptic operators with L∞- coefficients. Numerische Mathematik95(1), 1–28 (2003) Contents 35

work page 2003
[33]

Numerische Mathematik 121(4), 609–635 (2012)

Bebendorf, M., Venn, R.: Constructing nested bases approximations from the entries of non-local operators. Numerische Mathematik 121(4), 609–635 (2012)

work page 2012
[34]

In: IEEE International Parallel and Distributed Processing Symposium, pp

Belli, R., Hoefler, T.: Notified access: Extending remote memory access programming models for producer-consumer synchronization. In: IEEE International Parallel and Distributed Processing Symposium, pp. 871–881. IEEE (2015)

work page 2015
[35]

Bendoraityte, J., B ¨orm, S.: DistributedH 2-matrices for non-local operators. Comput. Vis. Sci11, 237–249 (2008)

work page 2008
[36]

Survey of Nearest Neighbor Techniques

Bhatia, N., Vandana: Survey of nearest neighbor techniques (2010). URLhttps://arxiv.org/abs/1007.0085

work page internal anchor Pith review Pith/arXiv arXiv 2010
[37]

CRC press (2018)

Birdsall, C.K., Langdon, A.B.: Plasma physics via computer simulation. CRC press (2018)

work page 2018
[38]

SIAM, Philadelphia (1997).http://www.netlib.org/scalapack

Blackford, L.S., Choi, J., D’ Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK users’ guide. SIAM, Philadelphia (1997).http://www.netlib.org/scalapack

work page 1997
[39]

B ¨orm Steffen Grasedyck, L., Hackbusch, W.: Introduction to hierarchical matrices with application. Eng. Anal. Bound. Elem.27, 405–422 (2003)

work page 2003
[40]

Computing and Visualization in Science16, 247–258 (2013)

B ¨orm, S., Reimer, K.: Efficient arithmetic operations for rank-structured matrices based on hierarchical low-rank updates. Computing and Visualization in Science16, 247–258 (2013). doi:10.1007/s00791-014-0236-7

work page doi:10.1007/s00791-014-0236-7 2013
[41]

Parallel Comput.38(1–2), 37–51 (2012)

Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A generic distributed DAG engine for High Performance Computing. Parallel Comput.38(1–2), 37–51 (2012). doi:10.1016/j.parco.2011.10.003. URLhttps: //icl.utk.edu/parsec/

work page doi:10.1016/j.parco.2011.10.003 2012
[42]

The International Journal of High Performance Computing Applications38(6), 585–598 (2024)

Boukaram, W., Hong, Y., Liu, Y., Shi, T., Li, X.S.: Batched sparse direct solver design and evaluation in SuperLU DIST. The International Journal of High Performance Computing Applications38(6), 585–598 (2024). doi:10.1177/10943420241268200

work page doi:10.1177/10943420241268200 2024
[43]

arXiv preprint arXiv:2509.11152 (2025)

Boukaram, W., Keyes, D., Li, S., Liu, Y., Turkiyyah, G.: Linear complexityH 2 direct solver for fine-grained parallel architectures. arXiv preprint arXiv:2509.11152 (2025)

work page arXiv 2025
[44]

In: The 26th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2025)

Boukaram, W., Liu, Y., Ghysels, P., Li, X.S.: Adaptive Sketching Based Construction of H2 Matrices on GPUs. In: The 26th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2025). IEEE (2025). Best Paper Award

work page 2025
[45]

ACM Transactions on Mathematical Software45(1), 1–28 (2019)

Boukaram, W.H., Turkiyyah, G., Keyes, D.: Hierarchical matrix operations on GPUs: Matrix-vector multiplication and compression. ACM Transactions on Mathematical Software45(1), 1–28 (2019). doi:10.1145/3232850

work page doi:10.1145/3232850 2019
[46]

Bradley, A.M.: A Hybrid Multithreaded Direct Sparse Triangular Solver, pp. 13–22. doi:10.1137/1.9781611974690.ch2

work page doi:10.1137/1.9781611974690.ch2
[47]

B ¨orm, S.: DirectionalH 2-matrix compression for high-frequency problems. Numer. Linear Algebra Appl.24(6), e2112 (2017). doi:10.1002/nla.2112

work page doi:10.1002/nla.2112 2017
[48]

SIAM Journal on Matrix Analysis and Applications41(2), 715–746 (2020)

Cambier, L., Chen, C., Boman, E.G., Rajamanickam, S., Tuminaro, R.S., Darve, E.: An algebraic sparsified nested dissection algorithm using low-rank approximations. SIAM Journal on Matrix Analysis and Applications41(2), 715–746 (2020)

work page 2020
[49]

SIAM Multiscale Model

Cand `es, E., Demanet, L., Ying, L.: A fast butterfly algorithm for the computation of Fourier integral operators. SIAM Multiscale Model. Simul.7(4), 1727–1750 (2009)

work page 2009
[50]

In: 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools), pp

Cao, Q., Pei, Y., Herault, T., Akbudak, K., Mikhalev, A., Bosilca, G., Ltaief, H., Keyes, D., Dongarra, J.: Performance analysis of tile low-rank Cholesky factorization using PARSEC instrumentation tools. In: 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools), pp. 25–32. IEEE (2019)

work page 2019
[51]

An Efficient Solver for Sparse Linear Systems Based on Rank-Structured Cholesky Factorization

Chadwick, J.N., Bindel, D.S.: An efficient solver for sparse linear systems based on rank-structured Cholesky factorization. arXiv preprint arXiv:1507.05593 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[52]

SIAM Journal on Matrix Analysis and Applications29(1), 67–81 (2007)

Chandrasekaran, S., Dewilde, P., Gu, M., Lyons, W., Pals, T.: A fast solver for HSS representations via sparse matrices. SIAM Journal on Matrix Analysis and Applications29(1), 67–81 (2007)

work page 2007
[53]

SIAM Journal on Matrix Anal

Chandrasekaran, S., Dewilde, P., Gu, M., Somasunderam, N.: On the numerical rank of the off-diagonal blocks of Schur complements of discretized elliptic PDEs. SIAM Journal on Matrix Anal. Appl.31, 2261–2290 (2010). doi:10.1137/090775932

work page doi:10.1137/090775932 2010
[54]

SIAM Journal on Matrix Analysis and Applications28(3), 603–622 (2006)

Chandrasekaran, S., Gu, M., Pals, T.: A fast ULV decomposition solver for hierarchically semiseparable representations. SIAM Journal on Matrix Analysis and Applications28(3), 603–622 (2006)

work page 2006
[55]

In: 2020 IEEE International parallel and distributed processing symposium (IPDPS), pp

Ch ´avez, G., Liu, Y., Ghysels, P., Li, X.S., Rebrova, E.: Scalable and memory-efficient kernel ridge regression. In: 2020 IEEE International parallel and distributed processing symposium (IPDPS), pp. 956–965. IEEE (2020)

work page 2020
[56]

In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp

Chen, C., Martinsson, P.G.: Solving linear systems on a GPU with hierarchically off-diagonal low-rank approximations. In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2022)

work page 2022
[57]

In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp

Chenhan, D.Y., March, W.B., Biros, G.: An𝑛log𝑛parallel fast direct solver for kernel matrices. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 886–896. IEEE (2017)

work page 2017
[58]

In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp

Chenhan, D.Y., March, W.B., Xiao, B., Biros, G.: INV-ASKIT: a parallel fast direct solver for kernel matrices. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 161–171. IEEE (2016)

work page 2016
[59]

Claus, L., Ghysels, P., Boukaram, W.H., Li, X.S.: A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

work page
[60]

ACM Transactions on Mathematical Software49(3), 1–28 (2023)

Claus, L., Ghysels, P., Liu, Y., Nhan, T.A., Thirumalaisamy, R., Bhalla, A.P.S., Li, S.: Sparse approximate multifrontal factorization with composite compression methods. ACM Transactions on Mathematical Software49(3), 1–28 (2023)

work page 2023
[61]

Applied and Computational Harmonic Analysis38(2), 284–317 (2015)

Corona, E., Martinsson, P.G., Zorin, D.: An𝑂(𝑁)direct solver for integral equations on the plane. Applied and Computational Harmonic Analysis38(2), 284–317 (2015)

work page 2015
[62]

Corona, E., Rahimian, A., Zorin, D.: A tensor-train accelerated solver for integral equations in complex geometries. J. Comput. Phys. 334, 145–169 (2017)

work page 2017
[63]

SIAM Journal on Scientific Computing39(3), A761–A796 (2017)

Coulier, P., Pouransari, H., Darve, E.: The inverse fast multipole method: using a fast approximate direct solver as a preconditioner for dense linear systems. SIAM Journal on Scientific Computing39(3), A761–A796 (2017)

work page 2017
[64]

3—an unsymmetric-pattern multifrontal method

Davis, T.A.: Algorithm 832: UMFPACK V4. 3—an unsymmetric-pattern multifrontal method. ACM Trans. Math. Softw.30(2), 196–199 (2004)

work page 2004
[65]

In: 2016 ieee international parallel and distributed processing symposium (ipdps), pp

Di, S., Cappello, F.: Fast error-bounded lossy HPC data compression with SZ. In: 2016 ieee international parallel and distributed processing symposium (ipdps), pp. 730–739. IEEE (2016) 36 Contents

work page 2016
[66]

DINFMM.https://github.com/Tianyu-Liang/DINFMM/tree/main

work page
[67]

In: Proceedings of the 2021 SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), pp

Ding, N., Liu, Y., Williams, S., Li, X.S.: A message-driven, multi-GPU parallel sparse triangular solver. In: Proceedings of the 2021 SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), pp. 147–159. doi:10.1137/1.9781611976830.14

work page doi:10.1137/1.9781611976830.14 2021
[68]

In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Com puting (PP), pp

Ding, N., Williams, S., Liu, Y., Li, X.S.: Leveraging one-sided communication for sparse triangular solvers. In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Com puting (PP), pp. 93–105. doi:10.1137/1.9781611976137.9

work page doi:10.1137/1.9781611976137.9 2020
[69]

ACM Trans

Duff, I.S.: Ma57—a code for the solution of sparse symmetric definite and indefinite systems. ACM Trans. Math. Softw.30(2), 118–144 (2004). doi:10.1145/992200.992202. URLhttps://doi.org/10.1145/992200.992202

work page doi:10.1145/992200.992202 2004
[70]

Engquist, B., Ying, L.: Fast directional multilevel algorithms for oscillatory kernels. SIAM J. Sci. Comput.29(4), 1710–1737 (2007)

work page 2007
[71]

Communications on Pure and Applied Mathematics71(11), 2220–2274 (2018)

Engquist, B., Zhao, H.: Approximate separability of the Green’s function of the Helmholtz equation in the high frequency limit. Communications on Pure and Applied Mathematics71(11), 2220–2274 (2018)

work page 2018
[72]

Mathematics of Computation85(297), 119–152 (2016)

Faustmann, M., Melenk, J., Praetorius, D.: Existence ofH-matrix approximants to the inverses of BEM matrices: The simple-layer operator. Mathematics of Computation85(297), 119–152 (2016)

work page 2016
[73]

In: Spectral and High Order Methods for Partial Differential Equations-ICOSAHOM 2012: Selected papers from the ICOSAHOM conference, June 25-29, 2012, Gammarth, Tunisia, pp

Faustmann, M., Melenk, J.M., Praetorius, D.: A new proof for existence of H-matrix approximants to the inverse of FEM matrices: the Dirichlet problem for the Laplacian. In: Spectral and High Order Methods for Partial Differential Equations-ICOSAHOM 2012: Selected papers from the ICOSAHOM conference, June 25-29, 2012, Gammarth, Tunisia, pp. 249–259. Spring...

work page 2012
[74]

Ima Journal of Numerical Analysis37, 1211–1244 (2015)

Faustmann, M., Melenk, J.M., Praetorius, D.: Existence ofH-matrix approximants to the inverse of BEM matrices: the hyper- singular integral operator. Ima Journal of Numerical Analysis37, 1211–1244 (2015). URLhttps://api.semanticscholar.org/ CorpusID:116945940

work page 2015
[75]

Faustmann, M., Melenk, J.M., Praetorius, D.:H-matrix approximability of the inverses of FEM matrices. Numer. Math.131(4), 615–642 (2015). doi:10.1007/s00211-015-0706-9. URLhttps://doi.org/10.1007/s00211-015-0706-9

work page doi:10.1007/s00211-015-0706-9 2015
[76]

In: Workshop on Fast Direct Solvers

Faverge, M., Pichon, G., Ramet, P., Roman, J.: On the use of H-matrix arithmetic in PaStiX: a preliminary study. In: Workshop on Fast Direct Solvers. Toulouse, France (2015). URLhttps://inria.hal.science/hal-01187882

work page 2015
[77]

Communications in Mathematical Sciences18(1), 91–108 (2020)

Feliu-Fab `a, J., Ho, K.L., Ying, L.: Recursively preconditioned hierarchical interpolative factorization for elliptic partial differential equations. Communications in Mathematical Sciences18(1), 91–108 (2020)

work page 2020
[78]

Feliu-Fab `a, J., Ying, L.: Approximate inversion of discrete Fourier integral operators. J. Comput. Phys.446, 110654 (2021)

work page 2021
[79]

Electron

Feng, Y., Xiao, J., Gu, M.: Flip-flop spectrum-revealing QR factorization and its applications to singular value decomposition. Electron. Trans. Numer. Anal.51, 469–494 (2019). doi:10.1553/etna vol51s469

work page doi:10.1553/etna 2019
[80]

In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23

Fu, X., Zhang, B., Wang, T., Li, W., Lu, Y., Yi, E., Zhao, J., Geng, X., Li, F., Zhang, J., Jin, Z., Liu, W.: PanguLU: a scalable regular two-dimensional block-cyclic sparse direct solver on distributed heterogeneous systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23. Associati...

work page doi:10.1145/3581784.3607050 2023

Showing first 80 references.

[1] [1]

The International Journal of High 34 Contents Performance Computing Applications38(5), 468–490 (2024)

Abdelfattah, A., Beams, N., Carson, R., Ghysels, P., Kolev, T., Stitt, T., uro Vargas, A., Tomov, S., Dongarra, J.: MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures. The International Journal of High 34 Contents Performance Computing Applications38(5), 468–490 (2024). doi:10.1177/10943420241261960

work page doi:10.1177/10943420241261960 2024

[2] [2]

Neutrino Production via $e^-e^+$ Collision at $Z$-boson Peak

Abdelfattah, A., Ghysels, P., Boukaram, W., Tomov, S., Li, X.S., Dongarra, J.: Addressing irregular patterns of matrix computations on GPUs and their impact on applications powered by sparse direct solvers. In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE (2022). doi:10.1109/SC41404.2022.00031

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41404.2022.00031 2022

[3] [3]

Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Ser.180(1) (2009)

work page 2009

[4] [4]

AHMED.https://www.wr.uni-bayreuth.de/en/software/ahmed/index.html

work page

[5] [5]

In: International Conference on High Performance Computing, pp

Al-Harthi, N., Alomairy, R., Akbudak, K., Chen, R., Ltaief, H., Bagci, H., Keyes, D.: Solving acoustic boundary integral equations using high performance tile low-rank LU factorization. In: International Conference on High Performance Computing, pp. 209–229. Springer (2020)

work page 2020

[6] [6]

The University of Texas at Austin (2019)

Alger, N.V.: Data-scalable Hessian preconditioning for distributed parameter PDE-constrained inverse problems. The University of Texas at Austin (2019)

work page 2019

[7] [7]

In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp

Aliaga, J.I., Carratal ´a-S´aez, R., Kriemann, R., Quintana-Ort ´ı, E.S.: Task-parallel LU factorization of hierarchical matrices using OmpSs. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1148–1157. IEEE (2017)

work page 2017

[8] [8]

SIAM Journal on Scientific Computing42(5), A3397–A3426 (2020)

Ambartsumyan, I., Boukaram, W., Bui-Thanh, T., Ghattas, O., Keyes, D., Stadler, G., Turkiyyah, G., Zampini, S.: Hierarchical matrix approximations of Hessians arising in inverse problems governed by PDEs. SIAM Journal on Scientific Computing42(5), A3397–A3426 (2020)

work page 2020

[9] [9]

Journal of Scientific Computing57(3), 477–501 (2013)

Ambikasaran, S., Darve, E.: An𝑂(𝑁log𝑁)fast direct solver for partial hierarchically semi-separable matrices: with application to radial basis function interpolation. Journal of Scientific Computing57(3), 477–501 (2013)

work page 2013

[10] [10]

The Inverse Fast Multipole Method

Ambikasaran, S., Darve, E.: The inverse fast multipole method. arXiv preprint arXiv:1407.1572 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[11] [11]

IEEE transactions on pattern analysis and machine intelligence38(2), 252–265 (2015)

Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D.W., O’Neil, M.: Fast direct methods for Gaussian processes. IEEE transactions on pattern analysis and machine intelligence38(2), 252–265 (2015)

work page 2015

[12] [12]

Fast symmetric factorization of hierarchical matrices with applications

Ambikasaran, S., O’Neil, M., Singh, K.R.: Fast symmetric factorization of hierarchical matrices with applications. arXiv preprint arXiv:1405.0223 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[13] [13]

Amestoy, P., Ashcraft, C., Boiteau, O., Buttari, A., L ’Excellent, J.Y., Weisbecker, C.: Improving multifrontal methods by means of block low-rank representations. SIAM J. Sci. Comput.37(3), A1451–A1474 (2015)

work page 2015

[14] [14]

SIAM Journal on Scientific Computing39(4), A1710–A1740 (2017)

Amestoy, P., Buttari, A., L ’Excellent, J.Y., Mary, T.: On the complexity of the block low-rank multifrontal factorization. SIAM Journal on Scientific Computing39(4), A1710–A1740 (2017)

work page 2017

[15] [15]

ACM Trans

Amestoy, P.R., Buttari, A., L ’Excellent, J.Y., Mary, T.: Performance and scalability of the block low-rank multifrontal factorization on multicore architectures. ACM Trans. Math. Softw.45(1) (2019). doi:10.1145/3242094

work page doi:10.1145/3242094 2019

[16] [16]

Amestoy, P.R., Buttari, A., L ’Excellent, J.Y., Mary, T.A.: Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel block low-rank format. SIAM J. Sci. Comput.41(3), A1414–A1442 (2019). doi:10.1137/18M1182760

work page doi:10.1137/18m1182760 2019

[17] [17]

ACM Trans

Amestoy, P.R., Duff, I.S., L ’excellent, J.Y., Li, X.S.: Analysis and comparison of two general sparse solvers for distributed memory computers. ACM Trans. Math. Softw.27(4), 388–421 (2001). doi:10.1145/504210.504212. URLhttps://doi.org/10.1145/ 504210.504212

work page doi:10.1145/504210.504212 2001

[18] [18]

In: International Workshop on Applied Parallel Computing, pp

Amestoy, P.R., Duff, I.S., L ’Excellent, J.Y., Koster, J.: MUMPS: a general purpose distributed memory sparse solver. In: International Workshop on Applied Parallel Computing, pp. 121–130. Springer (2000)

work page 2000

[19] [19]

SIAM Journal on Matrix Anal

Amestoy, P.R., Duff, I.S., L ’Excellent, J.Y., Koster, J.: A fully asynchronous multi-frontal solver using distributed dynamic scheduling. SIAM Journal on Matrix Anal. Appl.23, 15–41 (2001). doi:10.1137/S0895479899358194

work page doi:10.1137/s0895479899358194 2001

[20] [20]

Aminfar, A., Ambikasaran, S., Darve, E.: A fast block low-rank dense solver with applications to finite-element matrices. J. Comput. Phys.304, 170–188 (2016)

work page 2016

[21] [21]

Anderson, E., Saad, Y.: Solving sparse triangular linear systems on parallel computers. Int. J. High Speed Comput.1(1), 73–95 (1989). doi:10.1142/S0129053389000056. URLhttps://doi.org/10.1142/S0129053389000056

work page doi:10.1142/s0129053389000056 1989

[22] [22]

Advances in Computational Mathematics49, 1–46 (2021)

Angleitner, N., Faustmann, M., Melenk, J.M.:H-inverses for RBF interpolation. Advances in Computational Mathematics49, 1–46 (2021). URLhttps://api.semanticscholar.org/CorpusID:237540970

work page 2021

[23] [23]

ACM Transactions on Mathematical Software48(1), 2:1–2:33 (2022)

Anzt, H., Cojean, T., Flegar, G., G ¨obel, F., Gr¨ utzmacher, T., Nayak, P., Ribizel, T., Tsai, Y.M., Quintana-Ort´ı, E.S.: Ginkgo: a modern linear operator algebra framework for high performance computing. ACM Transactions on Mathematical Software48(1), 2:1–2:33 (2022). doi:10.1145/3480935. URLhttps://doi.org/10.1145/3480935

work page doi:10.1145/3480935 2022

[24] [24]

SIAM Journal on Matrix Analysis and Applications42(2), 990–1010 (2021)

Ashcraft, C., Buttari, A., Mary, T.: Block low-rank matrices with shared bases: potential and limitations of the BLR 2 format. SIAM Journal on Matrix Analysis and Applications42(2), 990–1010 (2021)

work page 2021

[25] [25]

ASKIT.https://padas.oden.utexas.edu/libaskit

work page

[26] [26]

Starpu: A uniﬁed platform for task scheduling on heterogeneous multicore architectures,

Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience23(2), 187–198 (2011). doi:10.1002/cpe.1631. URLhttps: //inria.hal.science/inria-00550877

work page doi:10.1002/cpe.1631 2011

[27] [27]

Acta Numerica23, 1–155 (2014)

Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica23, 1–155 (2014). doi:10.1017/S0962492914000038

work page doi:10.1017/s0962492914000038 2014

[28] [28]

Cambridge University Press (2025)

Ballard, G., Kolda, T.G.: Tensor decompositions for data science. Cambridge University Press (2025)

work page 2025

[29] [29]

SIAM Journal on Scientific Computing37(4), B519–B542 (2015)

Barnett, A., Wu, B., Veerapaneni, S.: Spectrally accurate quadratures for evaluation of layer potentials close to the boundary for the 2D Stokes and Laplace equations. SIAM Journal on Scientific Computing37(4), B519–B542 (2015)

work page 2015

[30] [30]

Numerische Mathematik86(4), 565–589 (2000)

Bebendorf, M.: Approximation of boundary element matrices. Numerische Mathematik86(4), 565–589 (2000). doi:10.1007/PL00005410

work page doi:10.1007/pl00005410 2000

[31] [31]

Mathematical Methods in the Applied Sciences29(14), 1721–1747 (2006)

Bebendorf, M., Grzhibovskis, R.: Accelerating Galerkin BEM for linear elasticity using adaptive cross approximation. Mathematical Methods in the Applied Sciences29(14), 1721–1747 (2006). doi:10.1002/mma.759

work page doi:10.1002/mma.759 2006

[32] [32]

Numerische Mathematik95(1), 1–28 (2003) Contents 35

Bebendorf, M., Hackbusch, W.: Existence ofH-matrix approximants to the inverse FE-matrix of elliptic operators with L∞- coefficients. Numerische Mathematik95(1), 1–28 (2003) Contents 35

work page 2003

[33] [33]

Numerische Mathematik 121(4), 609–635 (2012)

Bebendorf, M., Venn, R.: Constructing nested bases approximations from the entries of non-local operators. Numerische Mathematik 121(4), 609–635 (2012)

work page 2012

[34] [34]

In: IEEE International Parallel and Distributed Processing Symposium, pp

Belli, R., Hoefler, T.: Notified access: Extending remote memory access programming models for producer-consumer synchronization. In: IEEE International Parallel and Distributed Processing Symposium, pp. 871–881. IEEE (2015)

work page 2015

[35] [35]

Bendoraityte, J., B ¨orm, S.: DistributedH 2-matrices for non-local operators. Comput. Vis. Sci11, 237–249 (2008)

work page 2008

[36] [36]

Survey of Nearest Neighbor Techniques

Bhatia, N., Vandana: Survey of nearest neighbor techniques (2010). URLhttps://arxiv.org/abs/1007.0085

work page internal anchor Pith review Pith/arXiv arXiv 2010

[37] [37]

CRC press (2018)

Birdsall, C.K., Langdon, A.B.: Plasma physics via computer simulation. CRC press (2018)

work page 2018

[38] [38]

SIAM, Philadelphia (1997).http://www.netlib.org/scalapack

Blackford, L.S., Choi, J., D’ Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK users’ guide. SIAM, Philadelphia (1997).http://www.netlib.org/scalapack

work page 1997

[39] [39]

B ¨orm Steffen Grasedyck, L., Hackbusch, W.: Introduction to hierarchical matrices with application. Eng. Anal. Bound. Elem.27, 405–422 (2003)

work page 2003

[40] [40]

Computing and Visualization in Science16, 247–258 (2013)

B ¨orm, S., Reimer, K.: Efficient arithmetic operations for rank-structured matrices based on hierarchical low-rank updates. Computing and Visualization in Science16, 247–258 (2013). doi:10.1007/s00791-014-0236-7

work page doi:10.1007/s00791-014-0236-7 2013

[41] [41]

Parallel Comput.38(1–2), 37–51 (2012)

Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A generic distributed DAG engine for High Performance Computing. Parallel Comput.38(1–2), 37–51 (2012). doi:10.1016/j.parco.2011.10.003. URLhttps: //icl.utk.edu/parsec/

work page doi:10.1016/j.parco.2011.10.003 2012

[42] [42]

The International Journal of High Performance Computing Applications38(6), 585–598 (2024)

Boukaram, W., Hong, Y., Liu, Y., Shi, T., Li, X.S.: Batched sparse direct solver design and evaluation in SuperLU DIST. The International Journal of High Performance Computing Applications38(6), 585–598 (2024). doi:10.1177/10943420241268200

work page doi:10.1177/10943420241268200 2024

[43] [43]

arXiv preprint arXiv:2509.11152 (2025)

Boukaram, W., Keyes, D., Li, S., Liu, Y., Turkiyyah, G.: Linear complexityH 2 direct solver for fine-grained parallel architectures. arXiv preprint arXiv:2509.11152 (2025)

work page arXiv 2025

[44] [44]

In: The 26th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2025)

Boukaram, W., Liu, Y., Ghysels, P., Li, X.S.: Adaptive Sketching Based Construction of H2 Matrices on GPUs. In: The 26th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2025). IEEE (2025). Best Paper Award

work page 2025

[45] [45]

ACM Transactions on Mathematical Software45(1), 1–28 (2019)

Boukaram, W.H., Turkiyyah, G., Keyes, D.: Hierarchical matrix operations on GPUs: Matrix-vector multiplication and compression. ACM Transactions on Mathematical Software45(1), 1–28 (2019). doi:10.1145/3232850

work page doi:10.1145/3232850 2019

[46] [46]

Bradley, A.M.: A Hybrid Multithreaded Direct Sparse Triangular Solver, pp. 13–22. doi:10.1137/1.9781611974690.ch2

work page doi:10.1137/1.9781611974690.ch2

[47] [47]

B ¨orm, S.: DirectionalH 2-matrix compression for high-frequency problems. Numer. Linear Algebra Appl.24(6), e2112 (2017). doi:10.1002/nla.2112

work page doi:10.1002/nla.2112 2017

[48] [48]

SIAM Journal on Matrix Analysis and Applications41(2), 715–746 (2020)

Cambier, L., Chen, C., Boman, E.G., Rajamanickam, S., Tuminaro, R.S., Darve, E.: An algebraic sparsified nested dissection algorithm using low-rank approximations. SIAM Journal on Matrix Analysis and Applications41(2), 715–746 (2020)

work page 2020

[49] [49]

SIAM Multiscale Model

Cand `es, E., Demanet, L., Ying, L.: A fast butterfly algorithm for the computation of Fourier integral operators. SIAM Multiscale Model. Simul.7(4), 1727–1750 (2009)

work page 2009

[50] [50]

In: 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools), pp

Cao, Q., Pei, Y., Herault, T., Akbudak, K., Mikhalev, A., Bosilca, G., Ltaief, H., Keyes, D., Dongarra, J.: Performance analysis of tile low-rank Cholesky factorization using PARSEC instrumentation tools. In: 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools), pp. 25–32. IEEE (2019)

work page 2019

[51] [51]

An Efficient Solver for Sparse Linear Systems Based on Rank-Structured Cholesky Factorization

Chadwick, J.N., Bindel, D.S.: An efficient solver for sparse linear systems based on rank-structured Cholesky factorization. arXiv preprint arXiv:1507.05593 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[52] [52]

SIAM Journal on Matrix Analysis and Applications29(1), 67–81 (2007)

Chandrasekaran, S., Dewilde, P., Gu, M., Lyons, W., Pals, T.: A fast solver for HSS representations via sparse matrices. SIAM Journal on Matrix Analysis and Applications29(1), 67–81 (2007)

work page 2007

[53] [53]

SIAM Journal on Matrix Anal

Chandrasekaran, S., Dewilde, P., Gu, M., Somasunderam, N.: On the numerical rank of the off-diagonal blocks of Schur complements of discretized elliptic PDEs. SIAM Journal on Matrix Anal. Appl.31, 2261–2290 (2010). doi:10.1137/090775932

work page doi:10.1137/090775932 2010

[54] [54]

SIAM Journal on Matrix Analysis and Applications28(3), 603–622 (2006)

Chandrasekaran, S., Gu, M., Pals, T.: A fast ULV decomposition solver for hierarchically semiseparable representations. SIAM Journal on Matrix Analysis and Applications28(3), 603–622 (2006)

work page 2006

[55] [55]

In: 2020 IEEE International parallel and distributed processing symposium (IPDPS), pp

Ch ´avez, G., Liu, Y., Ghysels, P., Li, X.S., Rebrova, E.: Scalable and memory-efficient kernel ridge regression. In: 2020 IEEE International parallel and distributed processing symposium (IPDPS), pp. 956–965. IEEE (2020)

work page 2020

[56] [56]

In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp

Chen, C., Martinsson, P.G.: Solving linear systems on a GPU with hierarchically off-diagonal low-rank approximations. In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2022)

work page 2022

[57] [57]

In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp

Chenhan, D.Y., March, W.B., Biros, G.: An𝑛log𝑛parallel fast direct solver for kernel matrices. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 886–896. IEEE (2017)

work page 2017

[58] [58]

In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp

Chenhan, D.Y., March, W.B., Xiao, B., Biros, G.: INV-ASKIT: a parallel fast direct solver for kernel matrices. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 161–171. IEEE (2016)

work page 2016

[59] [59]

Claus, L., Ghysels, P., Boukaram, W.H., Li, X.S.: A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

work page

[60] [60]

ACM Transactions on Mathematical Software49(3), 1–28 (2023)

Claus, L., Ghysels, P., Liu, Y., Nhan, T.A., Thirumalaisamy, R., Bhalla, A.P.S., Li, S.: Sparse approximate multifrontal factorization with composite compression methods. ACM Transactions on Mathematical Software49(3), 1–28 (2023)

work page 2023

[61] [61]

Applied and Computational Harmonic Analysis38(2), 284–317 (2015)

Corona, E., Martinsson, P.G., Zorin, D.: An𝑂(𝑁)direct solver for integral equations on the plane. Applied and Computational Harmonic Analysis38(2), 284–317 (2015)

work page 2015

[62] [62]

Corona, E., Rahimian, A., Zorin, D.: A tensor-train accelerated solver for integral equations in complex geometries. J. Comput. Phys. 334, 145–169 (2017)

work page 2017

[63] [63]

SIAM Journal on Scientific Computing39(3), A761–A796 (2017)

Coulier, P., Pouransari, H., Darve, E.: The inverse fast multipole method: using a fast approximate direct solver as a preconditioner for dense linear systems. SIAM Journal on Scientific Computing39(3), A761–A796 (2017)

work page 2017

[64] [64]

3—an unsymmetric-pattern multifrontal method

Davis, T.A.: Algorithm 832: UMFPACK V4. 3—an unsymmetric-pattern multifrontal method. ACM Trans. Math. Softw.30(2), 196–199 (2004)

work page 2004

[65] [65]

In: 2016 ieee international parallel and distributed processing symposium (ipdps), pp

Di, S., Cappello, F.: Fast error-bounded lossy HPC data compression with SZ. In: 2016 ieee international parallel and distributed processing symposium (ipdps), pp. 730–739. IEEE (2016) 36 Contents

work page 2016

[66] [66]

DINFMM.https://github.com/Tianyu-Liang/DINFMM/tree/main

work page

[67] [67]

In: Proceedings of the 2021 SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), pp

Ding, N., Liu, Y., Williams, S., Li, X.S.: A message-driven, multi-GPU parallel sparse triangular solver. In: Proceedings of the 2021 SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), pp. 147–159. doi:10.1137/1.9781611976830.14

work page doi:10.1137/1.9781611976830.14 2021

[68] [68]

In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Com puting (PP), pp

Ding, N., Williams, S., Liu, Y., Li, X.S.: Leveraging one-sided communication for sparse triangular solvers. In: Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Com puting (PP), pp. 93–105. doi:10.1137/1.9781611976137.9

work page doi:10.1137/1.9781611976137.9 2020

[69] [69]

ACM Trans

Duff, I.S.: Ma57—a code for the solution of sparse symmetric definite and indefinite systems. ACM Trans. Math. Softw.30(2), 118–144 (2004). doi:10.1145/992200.992202. URLhttps://doi.org/10.1145/992200.992202

work page doi:10.1145/992200.992202 2004

[70] [70]

Engquist, B., Ying, L.: Fast directional multilevel algorithms for oscillatory kernels. SIAM J. Sci. Comput.29(4), 1710–1737 (2007)

work page 2007

[71] [71]

Communications on Pure and Applied Mathematics71(11), 2220–2274 (2018)

Engquist, B., Zhao, H.: Approximate separability of the Green’s function of the Helmholtz equation in the high frequency limit. Communications on Pure and Applied Mathematics71(11), 2220–2274 (2018)

work page 2018

[72] [72]

Mathematics of Computation85(297), 119–152 (2016)

Faustmann, M., Melenk, J., Praetorius, D.: Existence ofH-matrix approximants to the inverses of BEM matrices: The simple-layer operator. Mathematics of Computation85(297), 119–152 (2016)

work page 2016

[73] [73]

In: Spectral and High Order Methods for Partial Differential Equations-ICOSAHOM 2012: Selected papers from the ICOSAHOM conference, June 25-29, 2012, Gammarth, Tunisia, pp

Faustmann, M., Melenk, J.M., Praetorius, D.: A new proof for existence of H-matrix approximants to the inverse of FEM matrices: the Dirichlet problem for the Laplacian. In: Spectral and High Order Methods for Partial Differential Equations-ICOSAHOM 2012: Selected papers from the ICOSAHOM conference, June 25-29, 2012, Gammarth, Tunisia, pp. 249–259. Spring...

work page 2012

[74] [74]

Ima Journal of Numerical Analysis37, 1211–1244 (2015)

Faustmann, M., Melenk, J.M., Praetorius, D.: Existence ofH-matrix approximants to the inverse of BEM matrices: the hyper- singular integral operator. Ima Journal of Numerical Analysis37, 1211–1244 (2015). URLhttps://api.semanticscholar.org/ CorpusID:116945940

work page 2015

[75] [75]

Faustmann, M., Melenk, J.M., Praetorius, D.:H-matrix approximability of the inverses of FEM matrices. Numer. Math.131(4), 615–642 (2015). doi:10.1007/s00211-015-0706-9. URLhttps://doi.org/10.1007/s00211-015-0706-9

work page doi:10.1007/s00211-015-0706-9 2015

[76] [76]

In: Workshop on Fast Direct Solvers

Faverge, M., Pichon, G., Ramet, P., Roman, J.: On the use of H-matrix arithmetic in PaStiX: a preliminary study. In: Workshop on Fast Direct Solvers. Toulouse, France (2015). URLhttps://inria.hal.science/hal-01187882

work page 2015

[77] [77]

Communications in Mathematical Sciences18(1), 91–108 (2020)

Feliu-Fab `a, J., Ho, K.L., Ying, L.: Recursively preconditioned hierarchical interpolative factorization for elliptic partial differential equations. Communications in Mathematical Sciences18(1), 91–108 (2020)

work page 2020

[78] [78]

Feliu-Fab `a, J., Ying, L.: Approximate inversion of discrete Fourier integral operators. J. Comput. Phys.446, 110654 (2021)

work page 2021

[79] [79]

Electron

Feng, Y., Xiao, J., Gu, M.: Flip-flop spectrum-revealing QR factorization and its applications to singular value decomposition. Electron. Trans. Numer. Anal.51, 469–494 (2019). doi:10.1553/etna vol51s469

work page doi:10.1553/etna 2019

[80] [80]

In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23

Fu, X., Zhang, B., Wang, T., Li, W., Lu, Y., Yi, E., Zhao, J., Geng, X., Li, F., Zhang, J., Jin, Z., Liu, W.: PanguLU: a scalable regular two-dimensional block-cyclic sparse direct solver on distributed heterogeneous systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23. Associati...

work page doi:10.1145/3581784.3607050 2023