pith. sign in

arxiv: 1907.01005 · v1 · pith:C7ONL5XAnew · submitted 2019-07-01 · 💻 cs.MS · cs.DS· cs.NA· math.NA· physics.comp-ph

Algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multi-vectors

Pith reviewed 2026-05-25 11:30 UTC · model grok-4.3

classification 💻 cs.MS cs.DScs.NAmath.NAphysics.comp-ph
keywords matrix-free methodsfinite element operatorssparse multi-vectorsMPI parallelHamiltonian evaluationquantum mechanicsdeal.II
0
0 comments X

The pith

An a-priori support choice for sparse multi-vectors enables matrix-free finite element operators to reach 157 GFlop/s on Intel Cascade Lake with MPI scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents algorithms and data structures for sparse parallel vectors inside a finite element library. These allow matrix-free evaluation of operators through cell-wise quadrature together with sparse matrix-multivector products, inner products onto a sparse subspace, and post-multiplication by a square matrix. The approach exploits local support of the vectors to keep communication and fill-in low in an MPI setting. Performance measurements show 157 GFlop/s on Cascade Lake together with strong and weak scaling on benchmark problems that use quadratic and quartic elements.

Core claim

By fixing an a-priori support for each vector in a sparse multi-vector, the authors construct data structures that support three matrix-free operations: sparse matrix-multivector multiplication, projection of an operator onto the sparse subspace, and post-multiplication of the sparse multi-vector by a square matrix. All three operations are realized via cell-wise quadrature of the Hamiltonian without assembling global matrices. On Intel Cascade Lake this implementation delivers 157 GFlop/s; strong and weak scaling results are shown for typical benchmark problems discretized with quadratic and quartic finite-element bases.

What carries the argument

Matrix-free evaluation of the Hamiltonian operator by cell-wise quadrature, driven by an a-priori chosen support for each vector inside an MPI-parallel sparse multi-vector.

If this is right

  • Matrix-free SpMM, inner products, and post-multiplication become practical without assembling global sparse matrices.
  • Linear-scaling solution methods for localized problems become feasible inside existing finite-element libraries.
  • Node-level performance is limited by the roofline of the target CPU rather than by matrix assembly overhead.
  • The same support-based data structures apply to both quadratic and quartic bases on the reported benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same support-selection idea could be tested on other matrix-free operators that currently rely on dense vectors.
  • Adaptive rather than fixed support might further reduce communication if the localization pattern changes during a simulation.
  • The approach may transfer to other high-order discretization libraries that already use cell-wise quadrature.

Load-bearing premise

An a-priori chosen support for each vector remains small enough to keep communication volume and fill-in low during the three matrix-free operations in the MPI setting.

What would settle it

A scaling run on a larger number of MPI ranks that shows communication time or extra fill-in growing faster than the arithmetic work would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 1907.01005 by Denis Davydov, Martin Kronbichler.

Figure 1
Figure 1. Figure 1: FE mesh with quadratic Lagrange basis partitioned into 2 processes. Locally owned DoFs are depicted with crosses whereas [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Arithmetic complexity of matrix-free and matrix-based CSR approaches for a single matrix-vector multiplication with the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Three support domains for sparse FE vectors distributed over two MPI processes. Non-zero locally owned DoFs are depicted [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Renumbering of DoFs using Hilbert space filling curves via Algorithm [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: BCSR sparsity for the six sparse vectors with support depicted in Figure [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The cut of the mesh for M = 320 (carbon nanotube with 160 atoms) with quadratic FE basis. Color indicates MPI partitioning into ten processes. As a benchmark problem, we look at carbon nanotubes of di‚erent lengths. Note that within this contribution we do not solve the DFT but only benchmark and study the required operations for the OM approach. Carbon nanotubes are chosen as a benchmark since these syste… view at source ↗
Figure 7
Figure 7. Figure 7: Roofline performance model. The dashed roof line ”w/o r/w” represents the matrix-free Algorithm [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Breakdown of computing times of various steps of Algorithm [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Roofline performance model on Intel Xeon Cascade Lake using the GCC compiler. The dashed roof line ”w/o r/w” represents [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Scaling results for carbon nanotubes of di€erent lengths for a fixed number of processors. Timings are the average of each [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Strong scaling results for carbon nanotube with [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Weak scaling results for carbon nanotubes of di€erent length. Timings are the average of each operation called thirty times. [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
read the original abstract

Traditional solution approaches for problems in quantum mechanics scale as $\mathcal O(M^3)$, where $M$ is the number of electrons. Various methods have been proposed to address this issue and obtain linear scaling $\mathcal O(M)$. One promising formulation is the direct minimization of energy. Such methods take advantage of physical localization of the solution, namely that the solution can be sought in terms of non-orthogonal orbitals with local support. In this work a numerically efficient implementation of sparse parallel vectors within the open-source finite element library deal.II is proposed. The main algorithmic ingredient is the matrix-free evaluation of the Hamiltonian operator by cell-wise quadrature. Based on an a-priori chosen support for each vector we develop algorithms and data structures to perform (i) matrix-free sparse matrix multivector products (SpMM), (ii) the projection of an operator onto a sparse sub-space (inner products), and (iii) post-multiplication of a sparse multivector with a square matrix. The node-level performance is analyzed using a roofline model. Our matrix-free implementation of finite element operators with sparse multivectors achieves the performance of 157 GFlop/s on Intel Cascade Lake architecture. Strong and weak scaling results are reported for a typical benchmark problem using quadratic and quartic finite element bases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents algorithms and data structures in the deal.II library for MPI-parallel sparse multi-vectors to enable matrix-free finite element operators. The core approach uses a-priori chosen per-vector support to implement (i) sparse matrix multivector products (SpMM), (ii) operator projection via inner products, and (iii) post-multiplication with a square matrix. Node-level performance is analyzed via roofline model, with a reported peak of 157 GFlop/s on Intel Cascade Lake; strong and weak scaling results are shown for quadratic and quartic bases on a benchmark problem.

Significance. If the performance and scaling claims hold under the stated assumptions, the work provides a practical foundation for linear-scaling O(M) methods in quantum mechanics within distributed finite-element frameworks by exploiting orbital localization. The matrix-free quadrature-based operator evaluation and explicit handling of sparse multivectors are concrete contributions that could be adopted in other libraries.

major comments (1)
  1. [Abstract] Abstract: The central performance claim of 157 GFlop/s and the reported strong/weak scaling rest on the matrix-free SpMM, inner-product, and post-multiplication kernels remaining communication-efficient under an a-priori chosen support. No communication-volume bound, halo-exchange analysis, or fill-in estimate is supplied as a function of process count or polynomial degree; without this, the roofline numbers and scaling curves cannot be verified to demonstrate the claimed efficiency rather than being limited by MPI overhead.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the positive assessment of its significance for linear-scaling methods in quantum mechanics. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim of 157 GFlop/s and the reported strong/weak scaling rest on the matrix-free SpMM, inner-product, and post-multiplication kernels remaining communication-efficient under an a-priori chosen support. No communication-volume bound, halo-exchange analysis, or fill-in estimate is supplied as a function of process count or polynomial degree; without this, the roofline numbers and scaling curves cannot be verified to demonstrate the claimed efficiency rather than being limited by MPI overhead.

    Authors: We agree that the manuscript would benefit from an explicit communication analysis. The presented scaling results are empirical and rely on the a-priori localized support to ensure that halo exchanges remain limited to neighboring processes. In the revised manuscript we will add a dedicated subsection that derives communication-volume bounds and halo sizes as functions of process count and polynomial degree, using the chosen support pattern. These estimates will be compared against the observed strong- and weak-scaling curves to confirm that MPI overhead does not dominate the reported performance. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation and benchmarking study

full rationale

The paper describes algorithms, data structures, and MPI-parallel implementations for sparse multi-vectors and matrix-free operators in deal.II. Performance numbers (157 GFlop/s) and scaling results are empirical measurements on Cascade Lake hardware for quadratic/quartic bases. No equations, predictions, or central claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The a-priori support choice is an explicit algorithmic assumption whose validity is tested via reported benchmarks rather than derived from itself. This is a standard self-contained engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an algorithmic implementation study and introduces no new physical constants, fitted parameters, or postulated entities. It relies on standard assumptions of finite-element methods and parallel computing.

axioms (1)
  • domain assumption Finite-element basis functions have local support that can be exploited by an a-priori chosen vector support pattern
    Invoked to justify the sparse multi-vector representation and matrix-free quadrature approach.

pith-pipeline@v0.9.0 · 5774 in / 1356 out tokens · 54590 ms · 2026-05-25T11:30:04.548401+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

  1. [1]

    Kadir Akbudak, Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning models for scaling parallel sparse matrix-matrix multiplication. ACM Transactions on Parallel Computing (TOPC) 4, 3 (2018), 13

  2. [2]

    Giovanni Alze/t_ta, Daniel Arndt, Wolfgang Bangerth, Vishal Boddu, Benjamin Brands, Denis Davydov, Rene Gassm¨oller, Timo Heister, Luca Heltai, Katharina Kormann, Martin Kronbichler, Ma/t_thias Maier, Jean-Paul Pelteret, Bruno Turcksin, and David Wells. 2018. /T_hedeal.II Library, Version 9.0. Journal of Numerical Mathematics 26, 4 (2018), 173–183. h/t_tp...

  3. [3]

    Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2015. Accelerating the LOBPCG Method on GPUs Using a Blocked Sparse Matrix Vector Product. In Proceedings of the Symposium on High Performance Computing (HPC ’15) . Society for Computer Simulation International, 75–82. h/t_tp://dl.acm.org/citation.cfm?id=2872599.2872609

  4. [4]

    Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Victor Eijkhout, William D

    Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Ma/t_thew G. Knepley, Lois Curfman McInnes, Karl Rupp, Barry F. Smith, Stefano Zampini, and Hong Zhang. 2015.PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.6. Argonne National Laboratory

  5. [5]

    Gang Bao, Guanghui Hu, and Di Liu. 2012. Numerical Solution of the Kohn-Sham Equation by Finite Element Methods with an Adaptive Mesh Redistribution Technique. Journal of Scienti/f_ic Computing55, 2 (#sep# 2012), 372–391. h/t_tps://doi.org/10.1007/s10915-012-9636-1

  6. [6]

    /T_homas L. Beck. 2009.Real-Space and Multigrid Methods in Computational Chemistry . John Wiley & Sons, Inc., Hoboken, New Jersey, Chapter 5, 223–285. h/t_tps://doi.org/10.1002/9780470399545.ch5

  7. [7]

    Nicolas Bock and Ma/t_t Challacombe. 2013. An optimized sparse approximate matrix multiply for matrices with decay.SIAM Journal on Scienti/f_ic Computing 35, 1 (2013), C72–C98

  8. [8]

    Nicolas Bock, Ma/t_t Challacombe, and Laxmikant V Kal´e. 2016. Solvers for O(N) Electronic Structure in the Strong Scaling Limit. SIAM Journal on Scienti/f_ic Computing38, 1 (2016), C1–C21

  9. [9]

    Urban Borˇstnik, Joost VandeVondele, Val´ery Weber, and J¨urg Hu/t_ter. 2014. Sparse matrix multiplication: /T_he distributed block-compressed sparse row library. Parallel Comput. 40, 5-6 (2014), 47–58

  10. [10]

    Bowler, Ian J

    David R. Bowler, Ian J. Bush, and Michael J. Gillan. 2000. Practical methods for ab initio calculations on thousands of atoms. International Journal of /Q_uantum Chemistry77, 5 (2000), 831–842. Manuscript submi/t_ted to ACM 28 Davydov, D. and Kronbichler, M

  11. [11]

    Bowler, Tsuyoshi Miyazaki, and Michael J

    David R. Bowler, Tsuyoshi Miyazaki, and Michael J. Gillan. 2001. Parallel sparse matrix multiplication for linear scaling electronic structure calculations. Computer physics communications 137, 2 (2001), 255–273

  12. [12]

    Aydin Buluc and John R Gilbert. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In 37th International Conference on Parallel Processing. IEEE, 503–510

  13. [13]

    Steven K Burger and Weitao Yang. 2008. Linear-scaling quantum calculations using non-orthogonal localized molecular orbitals. Journal of Physics: Condensed Ma/t_ter20, 29 (2008), 294209

  14. [14]

    Eric J Bylaska, Michael Holst, and John H Weare. 2009. Adaptive Finite Element Method for Solving the Exact Kohn-Sham Equation of Density Functional /T_heory.Journal of Chemical /T_heory and Computation5, 4 (#apr# 2009), 937–948. h/t_tps://doi.org/10.1021/ct800350j

  15. [15]

    Ma/t_t Challacombe. 2000. A general parallel sparse-blocked matrix multiply for linear scaling SCF theory.Computer physics communications 128, 1-2 (2000), 93–107

  16. [16]

    Denis Davydov, Tymo/f_iy Gerasimov, Jean-Paul Pelteret, and Paul Steinmann. 2017. Convergence study of the h-adaptive PUM and the hp-adaptive FEM applied to eigenvalue problems in quantum mechanics. Advanced Modeling and Simulation in Engineering Sciences 4, 1 (12 Dec 2017), 7. h/t_tps://doi.org/10.1186/s40323-017-0093-0

  17. [17]

    Denis Davydov, Timo Heister, Martin Kronbichler, and Paul Steinmann. 2018. Matrix-free locally adaptive /f_inite element solution of density- functional theory with nonorthogonal orbitals and multigrid preconditioning. Physica Status Solidi B: Basic Solid State Physics 255, 9 (2018). h/t_tps://doi.org/10.1002/pssb.201800069

  18. [18]

    Jun Fang, Xingyu Gao, and Aihui Zhou. 2012. A Kohn-Sham equation solver based on hexahedral /f_inite elements.J. Comput. Phys. 231, 8 (2012), 3166–3180

  19. [19]

    Jean-Luc Fa/t_tebert and Jerry Bernholc. 2000. Towards grid-basedO(N) density-functional theory methods: Optimized nonorthogonal orbitals and multigrid acceleration. Phys. Rev. B 62 (Jul 2000), 1713–1722. Issue 3. h/t_tps://doi.org/10.1103/PhysRevB.62.1713

  20. [20]

    Jean-Luc Fa/t_tebert and Francois Gygi. 2004. Linear scaling /f_irst-principles molecular dynamics with controlled accuracy. Computer Physics Communications 162, 1 (2004), 24–36

  21. [21]

    Jean-Luc Fa/t_tebert and Francois Gygi. 2006. Linear-scaling /f_irst-principles molecular dynamics with plane-waves accuracy.Phys. Rev. B 73 (Mar 2006), 115124. Issue 11. h/t_tps://doi.org/10.1103/PhysRevB.73.115124

  22. [22]

    Hornung, and Andrew M

    Jean-Luc Fa/t_tebert, Richard D. Hornung, and Andrew M. Wissink. 2007. Finite element approach for density functional theory calculations on locally-re/f_ined meshes.J. Comput. Phys. 223, 2 (#may# 2007), 759–773. h/t_tps://doi.org/10.1016/j.jcp.2006.10.013

  23. [23]

    Frey and D.J

    J.T. Frey and D.J. Doren. 2011. TubeGen 3.4 (web-interface, h/t_tp://turin.nss.udel.edu/research/tubegenonline.html), University of Delaware, Newark, DE. (2011)

  24. [24]

    Giulia Galli and Michele Parrinello. 1992. Large scale electronic structure calculations. Physical review le/t_ters69, 24 (1992), 3547

  25. [25]

    Goringe, Eduardo Hern ´andez, Michael J

    Chris M. Goringe, Eduardo Hern ´andez, Michael J. Gillan, and Ian J. Bush. 1997. Linear-scaling DFT-pseudopotential calculations on parallel computers. Computer physics communications 102, 1-3 (1997), 1–16

  26. [26]

    Fred G Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical So/f_tware (TOMS)4, 3 (1978), 250–269

  27. [27]

    Heroux, Roscoe A

    Michael A. Heroux, Roscoe A. Bartle/t_t, Vicki E. Howle, Robert J. Hoekstra, Jonathan J. Hu, Tamara G. Kolda, Richard B. Lehoucq, Kevin R. Long, Roger P. Pawlowski, Eric T. Phipps, Andrew G. Salinger, Heidi K. /T_hornquist, Ray S. Tuminaro, James M. Willenbring, Alan Williams, and Kendall S. Stanley. 2005. An overview of the Trilinos project. ACM Trans. M...

  28. [28]

    Kikuji Hirose and Tomoya Ono. 2001. Direct minimization to generate electronic states with proper occupation numbers. Physical Review B 64, 8 (2001), 085105

  29. [29]

    Kikuji Hirose, Tomoya Ono, Yoshitaka Fujimoto, and Shigeru Tsukamoto. 2005. First-Principles Claculations in Real-Space Formalism . Imperial College Press, 57 Shelton Street, Covent Garden, London WC2H 9HE

  30. [30]

    Hohenberg and Walter Kohn

    Pierre C. Hohenberg and Walter Kohn. 1964. Inhomogeneous electron gas. Physical Review 136, 3B (1964), B864–B871

  31. [31]

    C ¸ataly¨urek, Srinivasan Parthasarathy, and P

    Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, S ¨ureyya Emre Kurt, Israt Nisa, Shivani Sabhlok, ¨Umit V. C ¸ataly¨urek, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Efficient Sparse-matrix Multi-vector Product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing...

  32. [32]

    Sohrab Ismail-Beigi and Tomas A. Arias. 1999. Locality of the Density Matrix in Metals, Semiconductors, and Insulators. Physical Review Le/t_ters82 (Mar 1999), 2127–2130. Issue 10. h/t_tps://doi.org/10.1103/PhysRevLe/t_t.82.2127

  33. [33]

    Jeremy Kepner and John Gilbert. 2011. Graph algorithms in the language of linear algebra . SIAM

  34. [34]

    King-Smith and David Vanderbilt

    R.D. King-Smith and David Vanderbilt. 1994. First-principles investigation of ferroelectricity in perovskite compounds. Physical Review B 49, 9 (1994), 5828

  35. [35]

    Walter Kohn. 1959. Analytic Properties of Bloch Waves and Wannier Functions. Physical Review 115 (Aug 1959), 809–821. Issue 4. h/t_tps: //doi.org/10.1103/PhysRev.115.809

  36. [36]

    Walter Kohn and Lu Jeu Sham. 1965. Self-consistent equations including exchange and correlation effects. Physical Review 140, 4A (1965), A1133–A1138. Manuscript submi/t_ted to ACM Algorithms and data structures for matrix-free /f_inite element operators with MPI-parallel sparse multi-vectors 29

  37. [37]

    Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R Bishop. 2014. A uni/f_ied sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM Journal on Scienti/f_ic Computing36, 5 (2014), C401–C423

  38. [38]

    Moritz Kreutzer, Jonas /T_hies, Melven R¨ohrig-Z¨ollner, Andreas Pieper, Faisal Shahzad, Martin Galgon, Achim Basermann, Holger Fehske, Georg Hager, and Gerhard Wellein. 2017. GHOST: building blocks for high performance sparse linear algebra on heterogeneous systems. International Journal of Parallel Programming 45, 5 (2017), 1046–1072

  39. [39]

    Martin Kronbichler and Katharina Kormann. 2012. A generic interface for parallel cell-based /f_inite element operator application.Computers & Fluids 63 (jun 2012), 135–147. h/t_tps://doi.org/10.1016/j.comp/f_luid.2012.04.012

  40. [40]

    Martin Kronbichler and Katharina Kormann. 2019. Fast matrix-free evaluation of discontinuous Galerkin /f_inite element operators.ACM Trans. Math. So/f_tware(2019). In press

  41. [41]

    Jianfeng Lu and Kyle /T_hicke. 2017. Orbital minimization method withl 1 regularization. J. Comput. Phys. 336 (2017), 87–103

  42. [42]

    Francesco Mauri and Giulia Galli. 1994. Electronic-structure calculations and molecular-dynamics simulations with linear system-size scaling. Physical Review B 50, 7 (1994), 4316

  43. [43]

    Michael McCourt, Barry Smith, and Hong Zhang. 2013. Efficient sparse Matrix-Matrix products using colorings. SIAM J. Matrix Anal. Appl. (2013)

  44. [44]

    Phani Motamarri, Sambit Das, Shiva Rudraraju, Krishnendu Ghosh, Denis Davydov, and Vikram Gavini. 2019. DFT-FE - A massively parallel adaptive /f_inite-element code for large-scale density functional theory calculations.Computer Physics Communications (2019)

  45. [45]

    Phani Motamarri, Michael R Nowak, Kenneth Leiter, Jaroslaw Knap, and Vikram Gavini. 2012. Higher-order adaptive /f_inite-element methods for Kohn-Sham density functional theory. J. Comput. Phys. 253, 15 (#jun# 2012), 308–343

  46. [46]

    Kurtis L. Nusbaum. 2011. Optimizing Tpetra’s sparse matrix-matrix multiplication routine . Technical Report. Technical report, SAND2011–6036, Sandia National Laboratories, Albuquerque, NM

  47. [47]

    Pablo Ordej´on, David A Drabold, Richard M Martin, and Ma/t_thew P Grumbach. 1995. Linear system-size scaling methods for electronic-structure calculations. Physical Review B 51, 3 (1995), 1456

  48. [48]

    Pask, Barry M

    John E. Pask, Barry M. Klein, Philip A. Sterne, and Chingyao Y. Fong. 2001. Finite-element methods in electronic-structure theory. Computer Physics Communications 135, 1 (2001), 1–34

  49. [49]

    Pask and Natarajan Sukumar

    John E. Pask and Natarajan Sukumar. 2017. Partition of unity /f_inite element method for quantum mechanical materials calculations. Extreme Mechanics Le/t_ters11 (2017), 8–17. h/t_tps://doi.org/10.1016/j.eml.2016.11.003

  50. [50]

    Md Mostofa Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, Jongsoo Park, Michael J Anderson, Satya Gautam Vadlamudi, Dipankar Das, Sergey G Pudov, Vadim O Pirogov, and Pradeep Dubey. 2015. Parallel efficient sparse matrix-matrix multiplication on multicore platforms. In International Conference on High Performance Computing . Springer, 48–57

  51. [51]

    Liang Peng, Feng Long Gu, and Weitao Yang. 2013. Effective preconditioning for ab initio ground state energy minimization with non-orthogonal localized molecular orbitals. Physical Chemistry Chemical Physics 15, 37 (2013), 15518–15527

  52. [52]

    L Ramdas Ram-Mohan. 2002. Finite Element and Boundary Element Applications in /Q_uantum Mechanics. Oxford University Press

  53. [53]

    Emanuel H Rubensson, Elias Rudberg, and Pawe l Sa lek. 2006. Sparse matrix algebra for quantum modeling of large systems. In International Workshop on Applied Parallel Computing. Springer, 90–99

  54. [54]

    Emanuel H Rubensson, Elias Rudberg, and Pawe l Sa lek. 2007. A hierarchic sparse matrix data structure for large-scale Hartree-Fock/Kohn-Sham calculations. Journal of computational chemistry 28, 16 (2007), 2531–2537

  55. [55]

    Elias Rudberg, Emanuel H Rubensson, Pawe l Sa lek, and Anastasia Kruchinina. 2018. Ergo: An open-source program for linear-scaling electronic structure calculations. So/f_twareX7 (2018), 107–111

  56. [56]

    Chandra Saravanan, Yihan Shao, Roi Baer, Philip N Ross, and Martin Head-Gordon. 2003. Sparse matrix multiplications for linear scaling electronic structure calculations in an atom-centered basis set using multiatom blocks. Journal of computational chemistry 24, 5 (2003), 618–622

  57. [57]

    Volker Schauer and Christian Linder. 2015. /T_he reduced basis method in all-electron calculations with /f_inite elements.Advances in Computational Mathematics 41, 5 (2015), 1035–1047. h/t_tps://doi.org/10.1007/s10444-014-9374-z

  58. [58]

    Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of PSTI2010, the First International Workshop on Parallel So/f_tware Tools and Tool Infrastructures. San Diego CA

  59. [59]

    Eiji Tsuchida and Masaru Tsukada. 1996. Adaptive /f_inite-element method for electronic-structure calculations. Phys. Rev. B 54, 11 (sep 1996), 7602–7605. h/t_tps://doi.org/10.1103/physrevb.54.7602

  60. [60]

    Eiji Tsuchida and Masaru Tsukada. 1998. Large-Scale Electronic-Structure Calculations Based on the Adaptive Finite-Element Method. Journal of the Physical Society of Japan 67, 11 (1998), 3844–3858. h/t_tps://doi.org/10.1143/JPSJ.67.3844

  61. [61]

    Young, Eloy Romero, and Jose E

    Toby D. Young, Eloy Romero, and Jose E. Roman. 2013. Parallel /f_inite element density functional computations exploiting grid re/f_inement and subspace recycling. In Computer Physics Communications. IEEE, 66–72. h/t_tps://doi.org/10.1109/DFT.2011.68

  62. [62]

    Dier Zhang, Lihua Shen, Aihui Zhou, and Xin-Gao Gong. 2008. Finite element method for solving Kohn–Sham equations based on self-adaptive tetrahedral mesh. Physics Le/t_ters A372, 30 (#jul# 2008), 5071–5076. h/t_tps://doi.org/10.1016/j.physleta.2008.05.075 Manuscript submi/t_ted to ACM