Performant Tridiagonal Factorization of Skew-Symmetric Matrices

Chao Yin; Devin A. Matthews; Ishna Satyarth; Maggie Myers; Robert van de Geijn; RuQing G. Xu

arxiv: 2411.09859 · v3 · submitted 2024-11-15 · 💻 cs.MS

Performant Tridiagonal Factorization of Skew-Symmetric Matrices

Ishna Satyarth , Chao Yin , Devin A. Matthews , Maggie Myers , Robert van de Geijn , RuQing G. Xu This is my paper

Pith reviewed 2026-05-23 17:53 UTC · model grok-4.3

classification 💻 cs.MS

keywords skew-symmetric matricestridiagonal factorizationLTL^T decompositionBLIS frameworkPfaffianFLAME methodologylinear algebra performancepivoting

0 comments

The pith

Skew-symmetric matrices admit an LTL^T factorization that runs substantially faster when level-2 and level-3 operations are fused with BLIS casting and blocking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops algorithms for the LTL^T factorization of skew-symmetric matrices, where L is unit lower triangular and T is tridiagonal. This operation supports efficient determinant computation via the Pfaffian of T and the solution of linear systems in domains such as quantum electronic structure and machine learning. Algorithms are derived systematically via the FLAME methodology and realized in parallel CPU code by fusing operations to cut memory traffic, using newly added BLIS capabilities for casting and cache blocking. A concise C++ API translates the derived algorithms into code. Experiments show the resulting implementations greatly exceed the performance of earlier methods while supporting pivoting for stability.

Core claim

The LTL^T decomposition of a skew-symmetric matrix X, with L unit lower triangular and T tridiagonal, can be computed with lower memory traffic by fusing level-2 and level-3 operations through BLIS casting, hierarchical parallelism, and cache blocking, producing parallel implementations that outperform prior work.

What carries the argument

FLAME-derived triangular tridiagonalization algorithms fused at multiple levels via BLIS gemm kernels and new casting support for level-2 and level-3 operations.

If this is right

Determinants of skew-symmetric matrices become the square of the Pfaffian of the resulting tridiagonal matrix T and can be obtained at lower cost.
Linear systems with skew-symmetric coefficient matrices can be solved more rapidly after the factorization.
Applications in quantum electronic structure and machine learning obtain faster runtimes for the same accuracy.
The prototype C++ API enables direct translation of additional correct-by-construction algorithms into executable code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion strategy could be applied to factorizations of other structured matrices if analogous BLIS casting support is added.
Performance gains may compound when the resulting tridiagonal T is used inside larger iterative solvers or eigenvalue routines.
Adoption inside existing dense linear-algebra libraries would make the method available without requiring users to call a separate API.
The approach may extend to distributed-memory settings if the cache-blocking and casting layers are replicated across nodes.

Load-bearing premise

Fusing operations at multiple levels via new BLIS casting and cache-blocking capabilities will deliver the claimed speedups while preserving numerical stability when pivoting is required.

What would settle it

Benchmark the new implementation against prior codes on a suite of skew-symmetric matrices that require pivoting; if runtimes show no substantial improvement or residuals exceed expected bounds, the performance and stability claims are falsified.

read the original abstract

The factorization of skew-symmetric matrices is a critically understudied area of dense linear algebra, particularly in comparison to that of general and symmetric matrices. While some algorithms can be adapted from the symmetric case, the cost of algorithms can be reduced by exploiting skew-symmetry. This work examines the factorization of a skew-symmetric matrix $X$ into its $LTL^T$ decomposition, where $L$ is unit lower triangular and $T$ is tridiagonal. This is also known as a triangular tridiagonalization. This operation is a means for computing the determinant of $X$ as the square of the (cheaply-computed) Pfaffian of the skew-symmetric tridiagonal matrix $T$ as well as for solving systems of equations, across fields such as quantum electronic structure and machine learning. Its application also often requires pivoting in order to improve numerical stability. We compare and contrast previously-published algorithms with those systematically derived using the FLAME methodology. Performant parallel CPU implementations are achieved by fusing operations at multiple levels in order to reduce memory traffic overhead. A key factor is the employment of new capabilities of the BLAS-like Library Instantion Software (BLIS) framework, which now supports casting level-2 and level-3 BLAS-like operations by leveraging its gemm and other kernels, hierarchical parallelism, and cache blocking. A prototype, concise C++ API facilitates the translation of correct-by-construction algorithms into correct code. Experiments verify that the resulting implementations greatly exceed the performance of previous work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical FLAME-derived implementation for skew-symmetric tridiagonal factorization with BLIS speedups, but the pivoting performance needs explicit verification.

read the letter

The one thing to know is that this paper extends the FLAME approach to skew-symmetric tridiagonal factorization and uses new BLIS casting to get fused implementations that run faster than earlier codes. They systematically derive the algorithms for factoring a skew-symmetric matrix into LTL^T form. This lets you compute the Pfaffian cheaply for determinants and solve systems. The key engineering move is fusing operations at multiple levels with BLIS's updated support for level-2 and level-3 casting, hierarchical parallelism, and cache blocking. They also give a short C++ API to turn the correct-by-construction algorithms into code. This is solid work on the implementation side. The performance experiments back up the claim of exceeding previous algorithms, and the focus on reducing memory traffic makes sense for these dense operations. The soft spot is the handling of pivoting. The abstract says applications often need pivoting for stability, but it is not clear from the given information whether the reported timings include pivoted versions or if the fusion works the same way when pivoting is added. That matters because the FLAME correctness is noted for the base case. No detailed error analysis or benchmark tables are visible in the abstract, though the full paper might have them. This paper is for people who write or use high-performance linear algebra libraries, particularly those dealing with skew-symmetric matrices in quantum electronic structure or machine learning. A reader interested in BLIS extensions or FLAME derivations would get practical value from it. It is worth a serious referee because it delivers new code and performance data on a narrow but useful operation. I would send it to review with a note to clarify the pivoted case.

Referee Report

2 major / 2 minor

Summary. The paper claims that FLAME-derived algorithms for the LTL^T factorization of skew-symmetric matrices, when implemented with fused level-2/3 operations via new BLIS casting, cache-blocking, and hierarchical parallelism, yield CPU implementations that greatly exceed the performance of prior published algorithms while supporting pivoting for numerical stability; a concise C++ API is provided to translate the algorithms into code, with applications to determinant computation via the Pfaffian and linear systems in quantum electronic structure and machine learning.

Significance. If the performance claims hold when pivoting is included, the work would advance dense linear algebra for skew-symmetric matrices by reducing memory traffic through BLIS-enabled fusion and providing a systematic derivation route; the FLAME methodology supplies a strength in producing correct-by-construction algorithms, and the prototype API aids reproducibility.

major comments (2)

[Abstract / Implementation] Abstract and implementation description: the central performance claim rests on fused BLIS kernels, yet the abstract notes that applications 'often requires pivoting in order to improve numerical stability'; it is unclear whether the reported experiments integrate pivoting into the cache-blocked fused kernels or measure only the non-pivoted case, which directly affects whether the headline result supports the highlighted practical use-case.
[Experiments] Experiments: no visible error analysis, benchmark tables with pivoting enabled, or quantitative stability metrics are referenced, leaving the claim that the new BLIS capabilities 'deliver the claimed speedups while preserving numerical stability' without direct supporting data in the reported results.

minor comments (2)

[Abstract] The abstract's phrasing 'greatly exceed the performance of previous work' would be strengthened by citing specific speed-up factors or tables from the experiments section.
Notation for the LTL^T decomposition and Pfaffian computation could include a brief equation reference for readers unfamiliar with skew-symmetric factorizations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments, which help clarify the scope of our performance claims and the need for supporting stability data. We address each major comment below and will revise the manuscript to strengthen the presentation of pivoting and numerical results.

read point-by-point responses

Referee: [Abstract / Implementation] Abstract and implementation description: the central performance claim rests on fused BLIS kernels, yet the abstract notes that applications 'often requires pivoting in order to improve numerical stability'; it is unclear whether the reported experiments integrate pivoting into the cache-blocked fused kernels or measure only the non-pivoted case, which directly affects whether the headline result supports the highlighted practical use-case.

Authors: The experiments isolate the performance of the fused level-2/3 BLIS kernels and cache-blocking on the core LTL^T factorization without pivoting, allowing direct comparison to prior non-pivoted implementations. The FLAME-derived algorithms and C++ API are structured to support pivoting (as stated in the abstract and introduction), with the additional permutation and scaling steps incurring only modest overhead that can be fused similarly. To ensure the headline results address the practical use-case, we will add performance tables and timing breakdowns with pivoting enabled in the revised version. revision: yes
Referee: [Experiments] Experiments: no visible error analysis, benchmark tables with pivoting enabled, or quantitative stability metrics are referenced, leaving the claim that the new BLIS capabilities 'deliver the claimed speedups while preserving numerical stability' without direct supporting data in the reported results.

Authors: The current experiments emphasize performance gains; stability is assumed to follow from standard partial pivoting strategies already validated in the literature for skew-symmetric tridiagonalization. We agree that explicit verification is needed. In revision we will add (i) residual-norm and backward-error tables comparing pivoted and non-pivoted runs against reference LAPACK-style implementations, (ii) forward-error metrics on the Pfaffian/determinant, and (iii) benchmark tables that include pivoting, thereby directly supporting the stability claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on external benchmarks and independent implementation.

full rationale

The paper presents FLAME-derived algorithms for LTL^T factorization of skew-symmetric matrices, implemented via BLIS fusion and cache blocking, then validates via timing experiments against prior published algorithms. No load-bearing equations, parameters, or uniqueness claims reduce by construction to fitted inputs or self-citations. The central result (performance improvement) is measured against external baselines rather than derived from the paper's own assumptions. Minor FLAME self-reference is not load-bearing for the empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on established linear-algebra theory for skew-symmetric matrices and the FLAME/BLIS software stack; no new free parameters, ad-hoc axioms, or invented entities are introduced.

axioms (1)

domain assumption A skew-symmetric matrix admits an LTL^T factorization with L unit lower triangular and T tridiagonal (possibly after pivoting).
This is the core operation whose performant realization is the paper's goal; it is taken from standard dense linear algebra.

pith-pipeline@v0.9.0 · 5823 in / 1148 out tokens · 25251 ms · 2026-05-23T17:53:06.255729+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Proposed Framework for Advanced (Multi)Linear Infrastructure in Engineering and Science (FAMLIES)
cs.MS 2026-04 unverdicted novelty 4.0

Proposes FAMLIES as a new flexible framework that vertically integrates linear and multi-linear algebra software stacks for high-performance computing on diverse architectures.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper

[1]

J. O. Aasen , On the reduction of a symmetric matrix to tridiagonal form , BIT, 11 (1971), pp. 223–242

work page 1971
[2]

J. A. Acosta, T. M. Low, and D. N. Parikh , Families of butterfly counting algorithms for bipartite graphs, in 2022 IEEE International Parallel and Distributed Processing Sympo- sium Workshops (IPDPSW), 2022, pp. 304–313, https://doi.org/10.1109/IPDPSW55747. 2022.00060

work page doi:10.1109/ipdpsw55747 2022
[3]

K. R. Apt and T. Hoare , Edsger Wybe Dijkstra: His Life,Work, and Legacy , vol. 45, Asso- ciation for Computing Machinery, New York, NY, USA, 1 ed., 2022

work page 2022
[4]

Bajdich and L

M. Bajdich and L. Mitas , Electronic structure quantum monte carlo , acta physica slovaca, 59 (2009), pp. 81–168

work page 2009
[5]

Ballard, D

G. Ballard, D. Becker, J. Demmel, J. Dongarra, A. Druinsky, I. Peled, O. Schwartz, S. Toledo, and I. Yamazaki, Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures, in 2013 IEEE 27th International Symposium on Par- allel and Distributed Processing, May 2013, pp. 895–907, https://doi.org/10.1109/IPDPS. 2013.98. ISSN:...

work page doi:10.1109/ipdps 2013
[6]

Ballard, J

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, Communication-optimal parallel and sequential cholesky decomposition: extended abstract , in Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, SPAA ’09, New York, NY, USA, 2009, Association for Computing Machinery, p. 245–252, https://doi.org/10. 1145/1583991.1584054

work page arXiv 2009
[7]

Bientinesi, E

P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintana-Ort ´ı, and R. A. van de Geijn, The science of deriving dense linear algebra algorithms , ACM Trans. Math. Soft., 31 (2005), pp. 1–26, http://doi.acm.org/10.1145/1055531.1055532

work page doi:10.1145/1055531.1055532 2005
[9]

Bientinesi, B

P. Bientinesi, B. Gunter, and R. A. van de Geijn , Families of algorithms related to the inversion of a symmetric positive definite matrix , ACM Trans. Math. Softw., 35 (2008), pp. 3:1–3:22, http://doi.acm.org/10.1145/1377603.1377606

work page doi:10.1145/1377603.1377606 2008
[10]

Bientinesi, E

P. Bientinesi, E. S. Quintana-Ort ´ı, and R. A. van de Geijn , Representing linear algebra algorithms in code: The FLAME application programming interfaces , ACM Trans. Math. Soft., 31 (2005), pp. 27–59, http://doi.acm.org/10.1145/1055531.1055533

work page doi:10.1145/1055531.1055533 2005
[11]

Dagum and R

L. Dagum and R. Menon , OpenMP: an industry standard api for shared-memory program- ming, IEEE computational science and engineering, 5 (1998), pp. 46–55

work page 1998
[12]

J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff , A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Soft., 16 (1990), pp. 1–17

work page 1990
[13]

Eijkhout, P

V. Eijkhout, P. Bientinesi, and R. van de Geijn , Towards mechanical derivation of krylov solver libraries, Procedia Computer Science, 1 (2010), pp. 1805 – 1813, https://doi.org/10. 1016/j.procs.2010.04.202. ICCS 2010

work page 2010
[14]

Guennebaud, B

G. Guennebaud, B. Jacob, et al. , Eigen v3. http://eigen.tuxfamily.org, 2010

work page 2010
[15]

J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn , FLAME: Formal Linear Algebra Methods Environment , ACM Trans. Math. Soft., 27 (2001), pp. 422–455, http://doi.acm.org/10.1145/504210.504213

work page doi:10.1145/504210.504213 2001
[16]

J. A. Gunnels, G. M. Henry, and R. A. van de Geijn , Formal Linear Algebra Methods Environment (FLAME): Overview, FLAME Working Note #1 CS-TR-00-28, Department of Computer Sciences, The University of Texas at Austin, Nov. 2000

work page 2000
[17]

J. A. Gunnels and R. A. van de Geijn , Formal methods for high-performance linear algebra libraries, in The Architecture of Scientific Software, R. F. Boisvert and P. T. P. Tang, eds., Kluwer Academic Press, 2001, pp. 193–210. 26 ISHNA SATYARTH, CHAO YIN, ET AL

work page 2001
[18]

Lee and T

M. Lee and T. M. Low, A family of provably correct algorithms for exact triangle counting , in Proceedings of the First International Workshop on Software Correctness for HPC Appli- cations, Correctness’17, New York, NY, USA, 2017, Association for Computing Machinery, p. 14–20, https://doi.org/10.1145/3145344.3145484

work page doi:10.1145/3145344.3145484 2017
[19]

https://github.com/flame/libflame, 2023

libflame. https://github.com/flame/libflame, 2023

work page 2023
[20]

D. A. Matthews, MArray. http://github.com/devinamatthews/marray, 2024

work page 2024
[21]

D. A. Matthews, R. van de Geijn, M. Myers, and D. Parikh , BLIS: Extending BLAS functionality, SIAM News, 54 (2024), p. 8

work page 2024
[22]

B. N. Parlett and W. T. Reid, On the solution of a system of linear equations whose matrix is symmetric but not definite , BIT, 10 (1970), pp. 386–397

work page 1970
[23]

E. S. Quintana, G. Quintana, X. Sun, and R. van de Geijn , A note on parallel matrix inversion, SIAM J. Sci. Comput., 22 (2001), pp. 1762–1771

work page 2001
[24]

E. S. Quintana-Ort ´ı and R. A. van de Geijn , Formal derivation of algorithms: The triangular Sylvester equation , ACM Trans. Math. Soft., 29 (2003), pp. 218–243, http: //doi.acm.org/10.1145/779359.779365

work page doi:10.1145/779359.779365 2003
[25]

Rozloˇzn´ık, Miroslav and Shklarski, Gil and Toledo, Sivan , Partitioned triangular tridi- agonalization, ACM Trans. Math. Softw., 37 (2011), https://doi.org/10.1145/1916461. 1916462

work page doi:10.1145/1916461 2011
[26]

and Curtin, R

C. Sanderson and R. Curtin , Armadillo: a template-based c++ library for linear algebra , Journal of Open Source Software, 1 (2016), p. 26, https://doi.org/10.21105/joss.00026

work page doi:10.21105/joss.00026 2016
[27]

C. K. Thomas and A. A. Middleton, Exact algorithm for sampling the two-dimensional ising spin glass, Physical Review E, 80 (2009), p. 046708

work page 2009
[28]

van de Geijn and M

R. van de Geijn and M. Myers , Applying Dijkstra’s Vision to Numerical Software , As- sociation for Computing Machinery, 2022, p. 215–230, https://doi.org/10.1145/3544585. 3544597

work page doi:10.1145/3544585 2022
[29]

van de Geijn, M

R. van de Geijn, M. Myers, R. G. Xu, and D. Matthews, Deriving algorithms for triangular tridiagonalization a (skew-)symmetric matrix , 2023, https://arxiv.org/abs/2311.10700

work page arXiv 2023
[30]

R. A. van de Geijn and E. S. Quintana-Ort´ı, The Science of Programming Matrix Compu- tations, http://www.lulu.com/content/1911788, 2008

work page arXiv 2008
[31]

Van Zee, libflame, The Complete Reference , lulu.com, 2009

F. Van Zee, libflame, The Complete Reference , lulu.com, 2009

work page 2009
[32]

Van Zee et al., Implementing high-performance complex matrix multiplication via the 3M and 4M methods , ACM Trans

F. Van Zee et al., Implementing high-performance complex matrix multiplication via the 3M and 4M methods , ACM Trans. Math. Softw., (2017)

work page 2017
[33]

F. G. Van Zee, E. Chan, R. van de Geijn, E. S. Quintana-Ort ´ı, and G. Quintana- Ort´ı, The libflame library for dense matrix computations , IEEE Computation in Science & Engineering, 11 (2009), pp. 56–62

work page 2009
[34]

F. G. Van Zee, T. M. Smith, B. Marker, T. M. Low, R. A. van de Geijn, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Austel, J. A. Gunnels, and L. Killough , The BLIS framework: Experiments in portability , ACM Trans. Math. Softw., (2016)

work page 2016
[35]

F. G. Van Zee and R. A. van de Geijn , BLIS: A framework for rapidly instantiating BLAS functionality, ACM Trans. Math. Softw., (2015)

work page 2015
[36]

T. L. Veldhuizen , Arrays in Blitz++ , in Computing in Object-Oriented Parallel Envi- ronments, D. Caromel, R. R. Oldehoeft, and M. Tholburn, eds., no. 1505 in Lecture Notes in Computer Science, Springer Berlin Heidelberg, December 1998, pp. 223–230, https://doi.org/10.1007/3-540-49372-7 24

work page doi:10.1007/3-540-49372-7 1998
[37]

T. L. Veldhuizen, Blitz++: The Library that Thinks it is a Compiler , in Advances in Software Tools for Scientific Computing, H. P. Langtangen, A. M. Bruaset, and E. Quak, eds., Berlin, Heidelberg, 2000, Springer Berlin Heidelberg, pp. 57–87

work page 2000
[38]

Wimmer, Algorithm 923: Efficient numerical computation of the Pfaffian for dense and banded skew-symmetric matrices , ACM Trans

M. Wimmer, Algorithm 923: Efficient numerical computation of the Pfaffian for dense and banded skew-symmetric matrices , ACM Trans. Math. Softw., 38 (2012), https://doi.org/ 10.1145/2331130.2331138

work page doi:10.1145/2331130.2331138 2012
[39]

Wimmer, PFAPACK

M. Wimmer, PFAPACK. https://michaelwimmer.org/downloads.html, 2012

work page 2012
[40]

R. G. Xu, Pfaffine. https://github.com/xrq-phys/Pfaffine, 2022

work page 2022
[41]

R. G. Xu, T. Okubo, S. Todo, and M. Imada, Optimized implementation for calculation and fast-update of Pfaffians installed to the open-source fermionic variational solver mVMC , Computer Physics Communications, 277 (2022), p. 108375, https://doi.org/https://doi. org/10.1016/j.cpc.2022.108375

work page doi:10.1016/j.cpc.2022.108375 2022

[1] [1]

J. O. Aasen , On the reduction of a symmetric matrix to tridiagonal form , BIT, 11 (1971), pp. 223–242

work page 1971

[2] [2]

J. A. Acosta, T. M. Low, and D. N. Parikh , Families of butterfly counting algorithms for bipartite graphs, in 2022 IEEE International Parallel and Distributed Processing Sympo- sium Workshops (IPDPSW), 2022, pp. 304–313, https://doi.org/10.1109/IPDPSW55747. 2022.00060

work page doi:10.1109/ipdpsw55747 2022

[3] [3]

K. R. Apt and T. Hoare , Edsger Wybe Dijkstra: His Life,Work, and Legacy , vol. 45, Asso- ciation for Computing Machinery, New York, NY, USA, 1 ed., 2022

work page 2022

[4] [4]

Bajdich and L

M. Bajdich and L. Mitas , Electronic structure quantum monte carlo , acta physica slovaca, 59 (2009), pp. 81–168

work page 2009

[5] [5]

Ballard, D

G. Ballard, D. Becker, J. Demmel, J. Dongarra, A. Druinsky, I. Peled, O. Schwartz, S. Toledo, and I. Yamazaki, Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures, in 2013 IEEE 27th International Symposium on Par- allel and Distributed Processing, May 2013, pp. 895–907, https://doi.org/10.1109/IPDPS. 2013.98. ISSN:...

work page doi:10.1109/ipdps 2013

[6] [6]

Ballard, J

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, Communication-optimal parallel and sequential cholesky decomposition: extended abstract , in Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, SPAA ’09, New York, NY, USA, 2009, Association for Computing Machinery, p. 245–252, https://doi.org/10. 1145/1583991.1584054

work page arXiv 2009

[7] [7]

Bientinesi, E

P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintana-Ort ´ı, and R. A. van de Geijn, The science of deriving dense linear algebra algorithms , ACM Trans. Math. Soft., 31 (2005), pp. 1–26, http://doi.acm.org/10.1145/1055531.1055532

work page doi:10.1145/1055531.1055532 2005

[8] [9]

Bientinesi, B

P. Bientinesi, B. Gunter, and R. A. van de Geijn , Families of algorithms related to the inversion of a symmetric positive definite matrix , ACM Trans. Math. Softw., 35 (2008), pp. 3:1–3:22, http://doi.acm.org/10.1145/1377603.1377606

work page doi:10.1145/1377603.1377606 2008

[9] [10]

Bientinesi, E

P. Bientinesi, E. S. Quintana-Ort ´ı, and R. A. van de Geijn , Representing linear algebra algorithms in code: The FLAME application programming interfaces , ACM Trans. Math. Soft., 31 (2005), pp. 27–59, http://doi.acm.org/10.1145/1055531.1055533

work page doi:10.1145/1055531.1055533 2005

[10] [11]

Dagum and R

L. Dagum and R. Menon , OpenMP: an industry standard api for shared-memory program- ming, IEEE computational science and engineering, 5 (1998), pp. 46–55

work page 1998

[11] [12]

J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff , A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Soft., 16 (1990), pp. 1–17

work page 1990

[12] [13]

Eijkhout, P

V. Eijkhout, P. Bientinesi, and R. van de Geijn , Towards mechanical derivation of krylov solver libraries, Procedia Computer Science, 1 (2010), pp. 1805 – 1813, https://doi.org/10. 1016/j.procs.2010.04.202. ICCS 2010

work page 2010

[13] [14]

Guennebaud, B

G. Guennebaud, B. Jacob, et al. , Eigen v3. http://eigen.tuxfamily.org, 2010

work page 2010

[14] [15]

J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn , FLAME: Formal Linear Algebra Methods Environment , ACM Trans. Math. Soft., 27 (2001), pp. 422–455, http://doi.acm.org/10.1145/504210.504213

work page doi:10.1145/504210.504213 2001

[15] [16]

J. A. Gunnels, G. M. Henry, and R. A. van de Geijn , Formal Linear Algebra Methods Environment (FLAME): Overview, FLAME Working Note #1 CS-TR-00-28, Department of Computer Sciences, The University of Texas at Austin, Nov. 2000

work page 2000

[16] [17]

J. A. Gunnels and R. A. van de Geijn , Formal methods for high-performance linear algebra libraries, in The Architecture of Scientific Software, R. F. Boisvert and P. T. P. Tang, eds., Kluwer Academic Press, 2001, pp. 193–210. 26 ISHNA SATYARTH, CHAO YIN, ET AL

work page 2001

[17] [18]

Lee and T

M. Lee and T. M. Low, A family of provably correct algorithms for exact triangle counting , in Proceedings of the First International Workshop on Software Correctness for HPC Appli- cations, Correctness’17, New York, NY, USA, 2017, Association for Computing Machinery, p. 14–20, https://doi.org/10.1145/3145344.3145484

work page doi:10.1145/3145344.3145484 2017

[18] [19]

https://github.com/flame/libflame, 2023

libflame. https://github.com/flame/libflame, 2023

work page 2023

[19] [20]

D. A. Matthews, MArray. http://github.com/devinamatthews/marray, 2024

work page 2024

[20] [21]

D. A. Matthews, R. van de Geijn, M. Myers, and D. Parikh , BLIS: Extending BLAS functionality, SIAM News, 54 (2024), p. 8

work page 2024

[21] [22]

B. N. Parlett and W. T. Reid, On the solution of a system of linear equations whose matrix is symmetric but not definite , BIT, 10 (1970), pp. 386–397

work page 1970

[22] [23]

E. S. Quintana, G. Quintana, X. Sun, and R. van de Geijn , A note on parallel matrix inversion, SIAM J. Sci. Comput., 22 (2001), pp. 1762–1771

work page 2001

[23] [24]

E. S. Quintana-Ort ´ı and R. A. van de Geijn , Formal derivation of algorithms: The triangular Sylvester equation , ACM Trans. Math. Soft., 29 (2003), pp. 218–243, http: //doi.acm.org/10.1145/779359.779365

work page doi:10.1145/779359.779365 2003

[24] [25]

Rozloˇzn´ık, Miroslav and Shklarski, Gil and Toledo, Sivan , Partitioned triangular tridi- agonalization, ACM Trans. Math. Softw., 37 (2011), https://doi.org/10.1145/1916461. 1916462

work page doi:10.1145/1916461 2011

[25] [26]

and Curtin, R

C. Sanderson and R. Curtin , Armadillo: a template-based c++ library for linear algebra , Journal of Open Source Software, 1 (2016), p. 26, https://doi.org/10.21105/joss.00026

work page doi:10.21105/joss.00026 2016

[26] [27]

C. K. Thomas and A. A. Middleton, Exact algorithm for sampling the two-dimensional ising spin glass, Physical Review E, 80 (2009), p. 046708

work page 2009

[27] [28]

van de Geijn and M

R. van de Geijn and M. Myers , Applying Dijkstra’s Vision to Numerical Software , As- sociation for Computing Machinery, 2022, p. 215–230, https://doi.org/10.1145/3544585. 3544597

work page doi:10.1145/3544585 2022

[28] [29]

van de Geijn, M

R. van de Geijn, M. Myers, R. G. Xu, and D. Matthews, Deriving algorithms for triangular tridiagonalization a (skew-)symmetric matrix , 2023, https://arxiv.org/abs/2311.10700

work page arXiv 2023

[29] [30]

R. A. van de Geijn and E. S. Quintana-Ort´ı, The Science of Programming Matrix Compu- tations, http://www.lulu.com/content/1911788, 2008

work page arXiv 2008

[30] [31]

Van Zee, libflame, The Complete Reference , lulu.com, 2009

F. Van Zee, libflame, The Complete Reference , lulu.com, 2009

work page 2009

[31] [32]

Van Zee et al., Implementing high-performance complex matrix multiplication via the 3M and 4M methods , ACM Trans

F. Van Zee et al., Implementing high-performance complex matrix multiplication via the 3M and 4M methods , ACM Trans. Math. Softw., (2017)

work page 2017

[32] [33]

F. G. Van Zee, E. Chan, R. van de Geijn, E. S. Quintana-Ort ´ı, and G. Quintana- Ort´ı, The libflame library for dense matrix computations , IEEE Computation in Science & Engineering, 11 (2009), pp. 56–62

work page 2009

[33] [34]

F. G. Van Zee, T. M. Smith, B. Marker, T. M. Low, R. A. van de Geijn, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Austel, J. A. Gunnels, and L. Killough , The BLIS framework: Experiments in portability , ACM Trans. Math. Softw., (2016)

work page 2016

[34] [35]

F. G. Van Zee and R. A. van de Geijn , BLIS: A framework for rapidly instantiating BLAS functionality, ACM Trans. Math. Softw., (2015)

work page 2015

[35] [36]

T. L. Veldhuizen , Arrays in Blitz++ , in Computing in Object-Oriented Parallel Envi- ronments, D. Caromel, R. R. Oldehoeft, and M. Tholburn, eds., no. 1505 in Lecture Notes in Computer Science, Springer Berlin Heidelberg, December 1998, pp. 223–230, https://doi.org/10.1007/3-540-49372-7 24

work page doi:10.1007/3-540-49372-7 1998

[36] [37]

T. L. Veldhuizen, Blitz++: The Library that Thinks it is a Compiler , in Advances in Software Tools for Scientific Computing, H. P. Langtangen, A. M. Bruaset, and E. Quak, eds., Berlin, Heidelberg, 2000, Springer Berlin Heidelberg, pp. 57–87

work page 2000

[37] [38]

Wimmer, Algorithm 923: Efficient numerical computation of the Pfaffian for dense and banded skew-symmetric matrices , ACM Trans

M. Wimmer, Algorithm 923: Efficient numerical computation of the Pfaffian for dense and banded skew-symmetric matrices , ACM Trans. Math. Softw., 38 (2012), https://doi.org/ 10.1145/2331130.2331138

work page doi:10.1145/2331130.2331138 2012

[38] [39]

Wimmer, PFAPACK

M. Wimmer, PFAPACK. https://michaelwimmer.org/downloads.html, 2012

work page 2012

[39] [40]

R. G. Xu, Pfaffine. https://github.com/xrq-phys/Pfaffine, 2022

work page 2022

[40] [41]

R. G. Xu, T. Okubo, S. Todo, and M. Imada, Optimized implementation for calculation and fast-update of Pfaffians installed to the open-source fermionic variational solver mVMC , Computer Physics Communications, 277 (2022), p. 108375, https://doi.org/https://doi. org/10.1016/j.cpc.2022.108375

work page doi:10.1016/j.cpc.2022.108375 2022