Algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multi-vectors
Pith reviewed 2026-05-25 11:30 UTC · model grok-4.3
The pith
An a-priori support choice for sparse multi-vectors enables matrix-free finite element operators to reach 157 GFlop/s on Intel Cascade Lake with MPI scaling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By fixing an a-priori support for each vector in a sparse multi-vector, the authors construct data structures that support three matrix-free operations: sparse matrix-multivector multiplication, projection of an operator onto the sparse subspace, and post-multiplication of the sparse multi-vector by a square matrix. All three operations are realized via cell-wise quadrature of the Hamiltonian without assembling global matrices. On Intel Cascade Lake this implementation delivers 157 GFlop/s; strong and weak scaling results are shown for typical benchmark problems discretized with quadratic and quartic finite-element bases.
What carries the argument
Matrix-free evaluation of the Hamiltonian operator by cell-wise quadrature, driven by an a-priori chosen support for each vector inside an MPI-parallel sparse multi-vector.
If this is right
- Matrix-free SpMM, inner products, and post-multiplication become practical without assembling global sparse matrices.
- Linear-scaling solution methods for localized problems become feasible inside existing finite-element libraries.
- Node-level performance is limited by the roofline of the target CPU rather than by matrix assembly overhead.
- The same support-based data structures apply to both quadratic and quartic bases on the reported benchmark.
Where Pith is reading between the lines
- The same support-selection idea could be tested on other matrix-free operators that currently rely on dense vectors.
- Adaptive rather than fixed support might further reduce communication if the localization pattern changes during a simulation.
- The approach may transfer to other high-order discretization libraries that already use cell-wise quadrature.
Load-bearing premise
An a-priori chosen support for each vector remains small enough to keep communication volume and fill-in low during the three matrix-free operations in the MPI setting.
What would settle it
A scaling run on a larger number of MPI ranks that shows communication time or extra fill-in growing faster than the arithmetic work would falsify the efficiency claim.
Figures
read the original abstract
Traditional solution approaches for problems in quantum mechanics scale as $\mathcal O(M^3)$, where $M$ is the number of electrons. Various methods have been proposed to address this issue and obtain linear scaling $\mathcal O(M)$. One promising formulation is the direct minimization of energy. Such methods take advantage of physical localization of the solution, namely that the solution can be sought in terms of non-orthogonal orbitals with local support. In this work a numerically efficient implementation of sparse parallel vectors within the open-source finite element library deal.II is proposed. The main algorithmic ingredient is the matrix-free evaluation of the Hamiltonian operator by cell-wise quadrature. Based on an a-priori chosen support for each vector we develop algorithms and data structures to perform (i) matrix-free sparse matrix multivector products (SpMM), (ii) the projection of an operator onto a sparse sub-space (inner products), and (iii) post-multiplication of a sparse multivector with a square matrix. The node-level performance is analyzed using a roofline model. Our matrix-free implementation of finite element operators with sparse multivectors achieves the performance of 157 GFlop/s on Intel Cascade Lake architecture. Strong and weak scaling results are reported for a typical benchmark problem using quadratic and quartic finite element bases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents algorithms and data structures in the deal.II library for MPI-parallel sparse multi-vectors to enable matrix-free finite element operators. The core approach uses a-priori chosen per-vector support to implement (i) sparse matrix multivector products (SpMM), (ii) operator projection via inner products, and (iii) post-multiplication with a square matrix. Node-level performance is analyzed via roofline model, with a reported peak of 157 GFlop/s on Intel Cascade Lake; strong and weak scaling results are shown for quadratic and quartic bases on a benchmark problem.
Significance. If the performance and scaling claims hold under the stated assumptions, the work provides a practical foundation for linear-scaling O(M) methods in quantum mechanics within distributed finite-element frameworks by exploiting orbital localization. The matrix-free quadrature-based operator evaluation and explicit handling of sparse multivectors are concrete contributions that could be adopted in other libraries.
major comments (1)
- [Abstract] Abstract: The central performance claim of 157 GFlop/s and the reported strong/weak scaling rest on the matrix-free SpMM, inner-product, and post-multiplication kernels remaining communication-efficient under an a-priori chosen support. No communication-volume bound, halo-exchange analysis, or fill-in estimate is supplied as a function of process count or polynomial degree; without this, the roofline numbers and scaling curves cannot be verified to demonstrate the claimed efficiency rather than being limited by MPI overhead.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for the positive assessment of its significance for linear-scaling methods in quantum mechanics. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claim of 157 GFlop/s and the reported strong/weak scaling rest on the matrix-free SpMM, inner-product, and post-multiplication kernels remaining communication-efficient under an a-priori chosen support. No communication-volume bound, halo-exchange analysis, or fill-in estimate is supplied as a function of process count or polynomial degree; without this, the roofline numbers and scaling curves cannot be verified to demonstrate the claimed efficiency rather than being limited by MPI overhead.
Authors: We agree that the manuscript would benefit from an explicit communication analysis. The presented scaling results are empirical and rely on the a-priori localized support to ensure that halo exchanges remain limited to neighboring processes. In the revised manuscript we will add a dedicated subsection that derives communication-volume bounds and halo sizes as functions of process count and polynomial degree, using the chosen support pattern. These estimates will be compared against the observed strong- and weak-scaling curves to confirm that MPI overhead does not dominate the reported performance. revision: yes
Circularity Check
No circularity: implementation and benchmarking study
full rationale
The paper describes algorithms, data structures, and MPI-parallel implementations for sparse multi-vectors and matrix-free operators in deal.II. Performance numbers (157 GFlop/s) and scaling results are empirical measurements on Cascade Lake hardware for quadratic/quartic bases. No equations, predictions, or central claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The a-priori support choice is an explicit algorithmic assumption whose validity is tested via reported benchmarks rather than derived from itself. This is a standard self-contained engineering paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Finite-element basis functions have local support that can be exploited by an a-priori chosen vector support pattern
Reference graph
Works this paper leans on
-
[1]
Kadir Akbudak, Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning models for scaling parallel sparse matrix-matrix multiplication. ACM Transactions on Parallel Computing (TOPC) 4, 3 (2018), 13
work page 2018
-
[2]
Giovanni Alze/t_ta, Daniel Arndt, Wolfgang Bangerth, Vishal Boddu, Benjamin Brands, Denis Davydov, Rene Gassm¨oller, Timo Heister, Luca Heltai, Katharina Kormann, Martin Kronbichler, Ma/t_thias Maier, Jean-Paul Pelteret, Bruno Turcksin, and David Wells. 2018. /T_hedeal.II Library, Version 9.0. Journal of Numerical Mathematics 26, 4 (2018), 173–183. h/t_tp...
-
[3]
Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2015. Accelerating the LOBPCG Method on GPUs Using a Blocked Sparse Matrix Vector Product. In Proceedings of the Symposium on High Performance Computing (HPC ’15) . Society for Computer Simulation International, 75–82. h/t_tp://dl.acm.org/citation.cfm?id=2872599.2872609
-
[4]
Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Victor Eijkhout, William D
Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Ma/t_thew G. Knepley, Lois Curfman McInnes, Karl Rupp, Barry F. Smith, Stefano Zampini, and Hong Zhang. 2015.PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.6. Argonne National Laboratory
work page 2015
-
[5]
Gang Bao, Guanghui Hu, and Di Liu. 2012. Numerical Solution of the Kohn-Sham Equation by Finite Element Methods with an Adaptive Mesh Redistribution Technique. Journal of Scienti/f_ic Computing55, 2 (#sep# 2012), 372–391. h/t_tps://doi.org/10.1007/s10915-012-9636-1
-
[6]
/T_homas L. Beck. 2009.Real-Space and Multigrid Methods in Computational Chemistry . John Wiley & Sons, Inc., Hoboken, New Jersey, Chapter 5, 223–285. h/t_tps://doi.org/10.1002/9780470399545.ch5
-
[7]
Nicolas Bock and Ma/t_t Challacombe. 2013. An optimized sparse approximate matrix multiply for matrices with decay.SIAM Journal on Scienti/f_ic Computing 35, 1 (2013), C72–C98
work page 2013
-
[8]
Nicolas Bock, Ma/t_t Challacombe, and Laxmikant V Kal´e. 2016. Solvers for O(N) Electronic Structure in the Strong Scaling Limit. SIAM Journal on Scienti/f_ic Computing38, 1 (2016), C1–C21
work page 2016
-
[9]
Urban Borˇstnik, Joost VandeVondele, Val´ery Weber, and J¨urg Hu/t_ter. 2014. Sparse matrix multiplication: /T_he distributed block-compressed sparse row library. Parallel Comput. 40, 5-6 (2014), 47–58
work page 2014
-
[10]
David R. Bowler, Ian J. Bush, and Michael J. Gillan. 2000. Practical methods for ab initio calculations on thousands of atoms. International Journal of /Q_uantum Chemistry77, 5 (2000), 831–842. Manuscript submi/t_ted to ACM 28 Davydov, D. and Kronbichler, M
work page 2000
-
[11]
Bowler, Tsuyoshi Miyazaki, and Michael J
David R. Bowler, Tsuyoshi Miyazaki, and Michael J. Gillan. 2001. Parallel sparse matrix multiplication for linear scaling electronic structure calculations. Computer physics communications 137, 2 (2001), 255–273
work page 2001
-
[12]
Aydin Buluc and John R Gilbert. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In 37th International Conference on Parallel Processing. IEEE, 503–510
work page 2008
-
[13]
Steven K Burger and Weitao Yang. 2008. Linear-scaling quantum calculations using non-orthogonal localized molecular orbitals. Journal of Physics: Condensed Ma/t_ter20, 29 (2008), 294209
work page 2008
-
[14]
Eric J Bylaska, Michael Holst, and John H Weare. 2009. Adaptive Finite Element Method for Solving the Exact Kohn-Sham Equation of Density Functional /T_heory.Journal of Chemical /T_heory and Computation5, 4 (#apr# 2009), 937–948. h/t_tps://doi.org/10.1021/ct800350j
-
[15]
Ma/t_t Challacombe. 2000. A general parallel sparse-blocked matrix multiply for linear scaling SCF theory.Computer physics communications 128, 1-2 (2000), 93–107
work page 2000
-
[16]
Denis Davydov, Tymo/f_iy Gerasimov, Jean-Paul Pelteret, and Paul Steinmann. 2017. Convergence study of the h-adaptive PUM and the hp-adaptive FEM applied to eigenvalue problems in quantum mechanics. Advanced Modeling and Simulation in Engineering Sciences 4, 1 (12 Dec 2017), 7. h/t_tps://doi.org/10.1186/s40323-017-0093-0
-
[17]
Denis Davydov, Timo Heister, Martin Kronbichler, and Paul Steinmann. 2018. Matrix-free locally adaptive /f_inite element solution of density- functional theory with nonorthogonal orbitals and multigrid preconditioning. Physica Status Solidi B: Basic Solid State Physics 255, 9 (2018). h/t_tps://doi.org/10.1002/pssb.201800069
-
[18]
Jun Fang, Xingyu Gao, and Aihui Zhou. 2012. A Kohn-Sham equation solver based on hexahedral /f_inite elements.J. Comput. Phys. 231, 8 (2012), 3166–3180
work page 2012
-
[19]
Jean-Luc Fa/t_tebert and Jerry Bernholc. 2000. Towards grid-basedO(N) density-functional theory methods: Optimized nonorthogonal orbitals and multigrid acceleration. Phys. Rev. B 62 (Jul 2000), 1713–1722. Issue 3. h/t_tps://doi.org/10.1103/PhysRevB.62.1713
-
[20]
Jean-Luc Fa/t_tebert and Francois Gygi. 2004. Linear scaling /f_irst-principles molecular dynamics with controlled accuracy. Computer Physics Communications 162, 1 (2004), 24–36
work page 2004
-
[21]
Jean-Luc Fa/t_tebert and Francois Gygi. 2006. Linear-scaling /f_irst-principles molecular dynamics with plane-waves accuracy.Phys. Rev. B 73 (Mar 2006), 115124. Issue 11. h/t_tps://doi.org/10.1103/PhysRevB.73.115124
-
[22]
Jean-Luc Fa/t_tebert, Richard D. Hornung, and Andrew M. Wissink. 2007. Finite element approach for density functional theory calculations on locally-re/f_ined meshes.J. Comput. Phys. 223, 2 (#may# 2007), 759–773. h/t_tps://doi.org/10.1016/j.jcp.2006.10.013
-
[23]
J.T. Frey and D.J. Doren. 2011. TubeGen 3.4 (web-interface, h/t_tp://turin.nss.udel.edu/research/tubegenonline.html), University of Delaware, Newark, DE. (2011)
work page 2011
-
[24]
Giulia Galli and Michele Parrinello. 1992. Large scale electronic structure calculations. Physical review le/t_ters69, 24 (1992), 3547
work page 1992
-
[25]
Goringe, Eduardo Hern ´andez, Michael J
Chris M. Goringe, Eduardo Hern ´andez, Michael J. Gillan, and Ian J. Bush. 1997. Linear-scaling DFT-pseudopotential calculations on parallel computers. Computer physics communications 102, 1-3 (1997), 1–16
work page 1997
-
[26]
Fred G Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical So/f_tware (TOMS)4, 3 (1978), 250–269
work page 1978
-
[27]
Michael A. Heroux, Roscoe A. Bartle/t_t, Vicki E. Howle, Robert J. Hoekstra, Jonathan J. Hu, Tamara G. Kolda, Richard B. Lehoucq, Kevin R. Long, Roger P. Pawlowski, Eric T. Phipps, Andrew G. Salinger, Heidi K. /T_hornquist, Ray S. Tuminaro, James M. Willenbring, Alan Williams, and Kendall S. Stanley. 2005. An overview of the Trilinos project. ACM Trans. M...
-
[28]
Kikuji Hirose and Tomoya Ono. 2001. Direct minimization to generate electronic states with proper occupation numbers. Physical Review B 64, 8 (2001), 085105
work page 2001
-
[29]
Kikuji Hirose, Tomoya Ono, Yoshitaka Fujimoto, and Shigeru Tsukamoto. 2005. First-Principles Claculations in Real-Space Formalism . Imperial College Press, 57 Shelton Street, Covent Garden, London WC2H 9HE
work page 2005
-
[30]
Pierre C. Hohenberg and Walter Kohn. 1964. Inhomogeneous electron gas. Physical Review 136, 3B (1964), B864–B871
work page 1964
-
[31]
C ¸ataly¨urek, Srinivasan Parthasarathy, and P
Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, S ¨ureyya Emre Kurt, Israt Nisa, Shivani Sabhlok, ¨Umit V. C ¸ataly¨urek, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Efficient Sparse-matrix Multi-vector Product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing...
-
[32]
Sohrab Ismail-Beigi and Tomas A. Arias. 1999. Locality of the Density Matrix in Metals, Semiconductors, and Insulators. Physical Review Le/t_ters82 (Mar 1999), 2127–2130. Issue 10. h/t_tps://doi.org/10.1103/PhysRevLe/t_t.82.2127
-
[33]
Jeremy Kepner and John Gilbert. 2011. Graph algorithms in the language of linear algebra . SIAM
work page 2011
-
[34]
King-Smith and David Vanderbilt
R.D. King-Smith and David Vanderbilt. 1994. First-principles investigation of ferroelectricity in perovskite compounds. Physical Review B 49, 9 (1994), 5828
work page 1994
-
[35]
Walter Kohn. 1959. Analytic Properties of Bloch Waves and Wannier Functions. Physical Review 115 (Aug 1959), 809–821. Issue 4. h/t_tps: //doi.org/10.1103/PhysRev.115.809
-
[36]
Walter Kohn and Lu Jeu Sham. 1965. Self-consistent equations including exchange and correlation effects. Physical Review 140, 4A (1965), A1133–A1138. Manuscript submi/t_ted to ACM Algorithms and data structures for matrix-free /f_inite element operators with MPI-parallel sparse multi-vectors 29
work page 1965
-
[37]
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R Bishop. 2014. A uni/f_ied sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scienti/f_ic Computing36, 5 (2014), C401–C423
work page 2014
-
[38]
Moritz Kreutzer, Jonas /T_hies, Melven R¨ohrig-Z¨ollner, Andreas Pieper, Faisal Shahzad, Martin Galgon, Achim Basermann, Holger Fehske, Georg Hager, and Gerhard Wellein. 2017. GHOST: building blocks for high performance sparse linear algebra on heterogeneous systems. International Journal of Parallel Programming 45, 5 (2017), 1046–1072
work page 2017
-
[39]
Martin Kronbichler and Katharina Kormann. 2012. A generic interface for parallel cell-based /f_inite element operator application.Computers & Fluids 63 (jun 2012), 135–147. h/t_tps://doi.org/10.1016/j.comp/f_luid.2012.04.012
-
[40]
Martin Kronbichler and Katharina Kormann. 2019. Fast matrix-free evaluation of discontinuous Galerkin /f_inite element operators.ACM Trans. Math. So/f_tware(2019). In press
work page 2019
-
[41]
Jianfeng Lu and Kyle /T_hicke. 2017. Orbital minimization method withl 1 regularization. J. Comput. Phys. 336 (2017), 87–103
work page 2017
-
[42]
Francesco Mauri and Giulia Galli. 1994. Electronic-structure calculations and molecular-dynamics simulations with linear system-size scaling. Physical Review B 50, 7 (1994), 4316
work page 1994
-
[43]
Michael McCourt, Barry Smith, and Hong Zhang. 2013. Efficient sparse Matrix-Matrix products using colorings. SIAM J. Matrix Anal. Appl. (2013)
work page 2013
-
[44]
Phani Motamarri, Sambit Das, Shiva Rudraraju, Krishnendu Ghosh, Denis Davydov, and Vikram Gavini. 2019. DFT-FE - A massively parallel adaptive /f_inite-element code for large-scale density functional theory calculations.Computer Physics Communications (2019)
work page 2019
-
[45]
Phani Motamarri, Michael R Nowak, Kenneth Leiter, Jaroslaw Knap, and Vikram Gavini. 2012. Higher-order adaptive /f_inite-element methods for Kohn-Sham density functional theory. J. Comput. Phys. 253, 15 (#jun# 2012), 308–343
work page 2012
-
[46]
Kurtis L. Nusbaum. 2011. Optimizing Tpetra’s sparse matrix-matrix multiplication routine . Technical Report. Technical report, SAND2011–6036, Sandia National Laboratories, Albuquerque, NM
work page 2011
-
[47]
Pablo Ordej´on, David A Drabold, Richard M Martin, and Ma/t_thew P Grumbach. 1995. Linear system-size scaling methods for electronic-structure calculations. Physical Review B 51, 3 (1995), 1456
work page 1995
-
[48]
John E. Pask, Barry M. Klein, Philip A. Sterne, and Chingyao Y. Fong. 2001. Finite-element methods in electronic-structure theory. Computer Physics Communications 135, 1 (2001), 1–34
work page 2001
-
[49]
John E. Pask and Natarajan Sukumar. 2017. Partition of unity /f_inite element method for quantum mechanical materials calculations. Extreme Mechanics Le/t_ters11 (2017), 8–17. h/t_tps://doi.org/10.1016/j.eml.2016.11.003
-
[50]
Md Mostofa Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, Jongsoo Park, Michael J Anderson, Satya Gautam Vadlamudi, Dipankar Das, Sergey G Pudov, Vadim O Pirogov, and Pradeep Dubey. 2015. Parallel efficient sparse matrix-matrix multiplication on multicore platforms. In International Conference on High Performance Computing . Springer, 48–57
work page 2015
-
[51]
Liang Peng, Feng Long Gu, and Weitao Yang. 2013. Effective preconditioning for ab initio ground state energy minimization with non-orthogonal localized molecular orbitals. Physical Chemistry Chemical Physics 15, 37 (2013), 15518–15527
work page 2013
-
[52]
L Ramdas Ram-Mohan. 2002. Finite Element and Boundary Element Applications in /Q_uantum Mechanics. Oxford University Press
work page 2002
-
[53]
Emanuel H Rubensson, Elias Rudberg, and Pawe l Sa lek. 2006. Sparse matrix algebra for quantum modeling of large systems. In International Workshop on Applied Parallel Computing. Springer, 90–99
work page 2006
-
[54]
Emanuel H Rubensson, Elias Rudberg, and Pawe l Sa lek. 2007. A hierarchic sparse matrix data structure for large-scale Hartree-Fock/Kohn-Sham calculations. Journal of computational chemistry 28, 16 (2007), 2531–2537
work page 2007
-
[55]
Elias Rudberg, Emanuel H Rubensson, Pawe l Sa lek, and Anastasia Kruchinina. 2018. Ergo: An open-source program for linear-scaling electronic structure calculations. So/f_twareX7 (2018), 107–111
work page 2018
-
[56]
Chandra Saravanan, Yihan Shao, Roi Baer, Philip N Ross, and Martin Head-Gordon. 2003. Sparse matrix multiplications for linear scaling electronic structure calculations in an atom-centered basis set using multiatom blocks. Journal of computational chemistry 24, 5 (2003), 618–622
work page 2003
-
[57]
Volker Schauer and Christian Linder. 2015. /T_he reduced basis method in all-electron calculations with /f_inite elements.Advances in Computational Mathematics 41, 5 (2015), 1035–1047. h/t_tps://doi.org/10.1007/s10444-014-9374-z
-
[58]
Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of PSTI2010, the First International Workshop on Parallel So/f_tware Tools and Tool Infrastructures. San Diego CA
work page 2010
-
[59]
Eiji Tsuchida and Masaru Tsukada. 1996. Adaptive /f_inite-element method for electronic-structure calculations. Phys. Rev. B 54, 11 (sep 1996), 7602–7605. h/t_tps://doi.org/10.1103/physrevb.54.7602
-
[60]
Eiji Tsuchida and Masaru Tsukada. 1998. Large-Scale Electronic-Structure Calculations Based on the Adaptive Finite-Element Method. Journal of the Physical Society of Japan 67, 11 (1998), 3844–3858. h/t_tps://doi.org/10.1143/JPSJ.67.3844
-
[61]
Young, Eloy Romero, and Jose E
Toby D. Young, Eloy Romero, and Jose E. Roman. 2013. Parallel /f_inite element density functional computations exploiting grid re/f_inement and subspace recycling. In Computer Physics Communications. IEEE, 66–72. h/t_tps://doi.org/10.1109/DFT.2011.68
-
[62]
Dier Zhang, Lihua Shen, Aihui Zhou, and Xin-Gao Gong. 2008. Finite element method for solving Kohn–Sham equations based on self-adaptive tetrahedral mesh. Physics Le/t_ters A372, 30 (#jul# 2008), 5071–5076. h/t_tps://doi.org/10.1016/j.physleta.2008.05.075 Manuscript submi/t_ted to ACM
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.