A Hermite-like basis for faster matrix-free evaluation of interior penalty discontinuous Galerkin operators
Pith reviewed 2026-05-24 19:20 UTC · model grok-4.3
The pith
A Hermite-like basis limits neighbor access in matrix-free DG evaluations to one function value and one derivative per neighbor.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Hermite-like basis allows matrix-free evaluation of symmetric interior penalty DG discretizations on hexahedra with neighbor access restricted to one function value and one derivative, achieved by nodal points from Jacobi polynomial roots and tensor-product extension that preserves sum-factorization structure; the resulting data-access reduction improves throughput while a sum-factorization basis change restores effective multigrid performance at negligible extra cost.
What carries the argument
The Hermite-like basis, which enforces a minimal stencil of one function value plus one derivative from each neighbor while supporting sum factorization.
If this is right
- Matrix-free operator application requires less memory traffic per degree of freedom than standard nodal bases.
- Sum factorization remains applicable in three dimensions despite the non-standard basis.
- A plain point-Jacobi multigrid smoother becomes less efficient than with best nodal polynomials, but a sum-factorization basis change restores usable multigrid constituents.
- The basis-change cost can be hidden behind the dominant data-access cost on modern hardware.
Where Pith is reading between the lines
- The same minimal-stencil idea could be tested on other DG flux formulations or on non-hexahedral elements to see whether the access reduction generalizes.
- In large-scale parallel runs the lower per-element data volume might improve strong scaling once communication volume is measured.
- Combining the basis with cache-blocking or vectorization strategies not discussed in the paper could yield further speed-ups.
Load-bearing premise
The nodal points taken from roots of Jacobi polynomials continue to deliver the one-value-plus-one-derivative stencil property after extension to higher orders and multiple dimensions by tensor products.
What would settle it
Run the operator evaluation for polynomial degree four in three dimensions and count the distinct neighbor data items actually loaded; if the count exceeds one value and one derivative per face neighbor, the minimal-stencil claim does not hold.
Figures
read the original abstract
This work proposes a basis for improved throughput of matrix-free evaluation of discontinuous Galerkin symmetric interior penalty discretizations on hexahedral elements. The basis relies on ideas of Hermite polynomials. It is used in a fully discontinuous setting not for higher order continuity but to minimize the effective stencil width, namely to limit the neighbor access of an element to one data point for the function value and one for the derivative. The basis is extended to higher orders with nodal contributions derived from roots of Jacobi polynomials and extended to multiple dimensions with tensor products, which enable the use of sum factorization. The beneficial effect of the reduced data access on modern processors is shown. Furthermore, the viability of the basis in the context of multigrid solvers is analyzed. While a plain point-Jacobi approach is less efficient than with the best nodal polynomials, a basis change via sum-factorization techniques enables the combination of the fast matrix-vector products with effective multigrid constituents. The basis change is essentially for free on modern hardware because these computations can be hidden behind the cost of the data access.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a Hermite-like nodal basis for matrix-free evaluation of symmetric interior penalty discontinuous Galerkin (SIPG) operators on hexahedral elements. The basis is intended to minimize the effective stencil in a fully discontinuous setting by restricting neighbor data access to exactly one function value and one normal derivative per adjacent element. The construction is extended from low-order Hermite functions to higher polynomial degrees via nodal points at roots of Jacobi polynomials and to multiple dimensions via tensor products, enabling sum factorization. Performance benefits for throughput on modern processors are claimed, and viability within multigrid solvers is analyzed via an essentially free basis change.
Significance. If the minimal-stencil property is preserved under the stated extensions, the approach could reduce memory-bandwidth pressure in high-order matrix-free DG kernels, which is often the dominant cost on current hardware. The integration with multigrid through sum-factorization basis changes is a constructive feature that addresses a practical limitation of non-standard bases.
major comments (2)
- [§3] §3 (basis construction and tensor-product extension): the central claim that Jacobi-root nodal contributions plus tensor products preserve the one-value-plus-one-derivative neighbor stencil is not accompanied by an explicit verification, proof, or numerical check. SIPG face integrals depend on traces and normal derivatives; without showing that all other couplings remain identically zero for p>1, the claimed data-access reduction is unsupported.
- [§5] §5 (numerical results): the text asserts that beneficial throughput effects are shown and that the basis change can be hidden behind data access, yet no quantitative timing tables, flop counts, or direct comparisons to standard nodal bases are referenced that would confirm the stencil reduction actually materializes in the implemented kernels for the tested orders and dimensions.
minor comments (2)
- Notation for the one-dimensional basis functions and their extension to faces could be made more explicit with a small worked example (e.g., p=2) to aid reproducibility.
- The multigrid section would benefit from a brief statement of the smoother and coarse-grid operator used, even if the focus is on the basis change cost.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address the two major points below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [§3] §3 (basis construction and tensor-product extension): the central claim that Jacobi-root nodal contributions plus tensor products preserve the one-value-plus-one-derivative neighbor stencil is not accompanied by an explicit verification, proof, or numerical check. SIPG face integrals depend on traces and normal derivatives; without showing that all other couplings remain identically zero for p>1, the claimed data-access reduction is unsupported.
Authors: The 1D Hermite-like construction at Jacobi roots is chosen precisely so that all interior nodal contributions to a face integral vanish except the value and normal derivative at the face point; the tensor-product extension inherits this because SIPG face terms act only in the normal direction while the other two directions remain decoupled. We agree an explicit verification is missing from the current text. We will insert a short subsection in §3 containing (i) a direct algebraic argument that off-face couplings are identically zero for any p and (ii) a numerical check confirming the stencil for p=2,3,4 on a single face. revision: yes
-
Referee: [§5] §5 (numerical results): the text asserts that beneficial throughput effects are shown and that the basis change can be hidden behind data access, yet no quantitative timing tables, flop counts, or direct comparisons to standard nodal bases are referenced that would confirm the stencil reduction actually materializes in the implemented kernels for the tested orders and dimensions.
Authors: The manuscript reports throughput gains, yet we accept that the presentation lacks the explicit tables and flop-count breakdowns the referee requests. We will expand §5 with (i) wall-clock timing tables for the matrix-free operator on hexahedral meshes up to p=5 in 3D, (ii) arithmetic intensity and flop counts contrasting the new basis against standard nodal bases, and (iii) a direct side-by-side comparison demonstrating the measured reduction in neighbor data movement. revision: yes
Circularity Check
No circularity: construction uses standard polynomial roots and tensor products without reduction to fitted inputs or self-citations
full rationale
The paper presents an explicit basis construction starting from Hermite-like ideas, extended via roots of Jacobi polynomials for higher orders and tensor products for multiple dimensions to enable sum factorization. No equations or claims reduce a performance prediction or stencil property to a fitted parameter or prior self-citation by definition; the minimal neighbor access is a direct consequence of the chosen nodal points and the discontinuous setting, with viability shown through implementation rather than assumed uniqueness theorems. The derivation chain is self-contained against external polynomial properties and does not invoke load-bearing self-citations.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard algebraic and orthogonality properties of Hermite and Jacobi polynomials hold and can be used to define nodal points and basis functions.
Reference graph
Works this paper leans on
-
[1]
M. Adams, M. Brezina, J. Hu, and R. Tuminaro , Parallel multigrid smoothing: polynomial versus Gauss–Seidel, J. Comput. Phys., 188 (2003), pp. 593–610, doi:10.1016/S0 021-9991(03)00194-3
work page doi:10.1016/s0 2003
-
[2]
M. Ainsworth, G. Andriamaro, and O. Davydov , Bernstein–B´ ezier finite elements of arbi- trary order and optimal assembly procedures , SIAM J. Sci. Comput., 33 (2011), pp. 3087–3109, doi:10.1137/11082539X
-
[3]
G. Alzetta, D. Arndt, W. Bangerth, V. Boddu, B. Brands, D. Dav ydov, R. Gassmoeller, T. Heister, L. Heltai, K. Kormann, M. Kronbichler, M. Maier, J.-P. Pelteret, B. Tur- cksin, and D. Wells , The deal.II library, version 9.0 , J. Numer. Math., 26 (2018), pp. 173–184, doi:10.1515/jnma-2018-0054, www.dealii.org
-
[4]
A Geometric Heuristic for Rectilinear Crossing Minimization
D. Appel ¨o, T. Hagstrom, and A. V argas , Hermite methods for the scalar wave equation , SIAM J. Sci. Comput., 40 (2018), pp. A3902–A3927, doi:10.1137/1 8M1171072
work page doi:10.1137/1 2018
-
[5]
D. Arnold, F. Brezzi, B. Cockburn, and L. D. Marini , Unified analysis of discontinu- ous Galerkin methods for elliptic problems , SIAM J. Numer. Anal., 39 (2002), pp. 1749–1779, doi:10.1137/S0036142901384162
-
[6]
W. Bangerth, C. Burstedde, T. Heister, and M. Kronbichler , Algorithms and data structures for massively parallel generic finite element codes , ACM Trans. Math. Softw., 38 (2011), pp. 14:1– 14:28, doi:10.1145/2049673.2049678
-
[7]
S. Beuchler, V. Pill wein, and S. Zaglmayr , Sparsity optimized high order finite element functions for H(div) on simplices , Numer. Math., 122 (2012), pp. 197–225, doi:10.1007/s0021 1-012-0461-0
-
[8]
C. Burstedde, L. C. Wilcox, and O. Ghattas , p4est: Scalable algorithms for parallel adap- tive mesh refinement on forests of octrees , SIAM J. Sci. Comput., 33 (2011), pp. 1103–1133, doi:10.1137/10079163, http://p4est.org
-
[9]
B. Cockburn, J. Gopalakrishnan, and R. Lazarov , Unified hybridization of discontinuous Galerkin, mixed, and continuous Galerkin methods for secon d order elliptic equations , SIAM J. Numer. Anal., 47 (2009), pp. 1139–1365, doi:10.1137/070706616
-
[11]
M. O. Deville, P. F. Fischer, and E. H. Mund , High-order methods for incompressible fluid flow , vol. 9, Cambridge University Press, 2002
work page 2002
-
[12]
Y. Epshteyn and B. Rivi `ere, Estimation of penalty parameters for symmetric interior pe nalty Galerkin methods, J. Comput. Appl. Math., 206 (2007), pp. 843–872, doi:10.10 16/j.cam.2006.08.029
work page 2007
-
[13]
N. Fehn, W. A. W all, and M. Kronbichler , Efficiency of high-performance discontinuous Galerkin spectral element methods for under-resolved turb ulent incompressible flows , Int. J. Numer. Meth. Fluids, 88 (2018), pp. 32–54, doi:10.1002/fld.4511
work page 2018
-
[14]
N. Fehn, W. A. W all, and M. Kronbichler , Robust and efficient discontinuous Galerkin meth- ods for under-resolved turbulent incompressible flows , J. Comput. Phys., 372 (2018), pp. 667–693, doi:10.1016/j.jcp.2018.06.037
-
[15]
P. Fischer, T. Rathnayake, S. Dutta, V. Dobrev, T. Kolev, J.- S. Camier, M. Kronbichler, K. ´Swirydowicz, T. W arburton, J. Brown, and M. Min , Running faster in HPC applications , in preparation, (2019). http://ceed.exascaleproject.org
work page 2019
-
[16]
A. Gholami, D. Malhotra, H. Sundar, and G. Biros , FFT, FMM, or multigrid? A comparative study of state-of-the-art Poisson solvers for uniform and n onuniform grids in the unit cube , SIAM J. Sci. Comput., 38 (2016), pp. C280–C306, doi:10.1137/15M10 10798
-
[17]
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith , Performance modeling and tuning of an unstructured mesh CFD application , in SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, 2000, pp. 34–34, doi:10.1109/SC.2000. 10059
-
[18]
G. Hager and G. Wellein , Introduction to High Performance Computing for Scientists and Engi- neers, CRC Press, Boca Raton, 2011
work page 2011
-
[19]
P. W. Hemker, W. Hoffmann, and M. H. V an Raalte , Two-level Fourier analysis of a multigrid approach for discontinuous Galerkin discretization , SIAM J. Sci. Comput., 25 (2003), pp. 1018–1041, doi:10.1137/S1064827502405100
-
[20]
F. Hindenlang, G. Gassner, C. Altmann, A. Beck, M. Staudenma ier, and C.-D. Munz , Explicit discontinuous Galerkin methods for unsteady prob lems, Comput. Fluids, 61 (2012), pp. 86– 93, doi:10.1016/j.compfluid.2012.03.006. 19
-
[21]
I. Huismann, J. Stiller, and J. Fr ¨ohlich, Factorizing the factorization – a spectral-element solver for elliptic equations with linear operation count , J. Comput. Phys., 346 (2017), pp. 437–448, doi:10.1016/j.jcp.2017.06.012
-
[22]
R. C. Kirby , Fast inversion of the simplicial Bernstein mass matrix , Numer. Math., 135 (2017), pp. 73–95, doi:10.1007/s00211-016-0795-0
-
[23]
M. G. Knepley, J. Brown, K. Rupp, and B. F. Smith , Achieving high performance with unified residual evaluation , arXiv preprint cs.MS, 1309.1204 (2013), pp. 1–4
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[24]
Kopriva, Implementing spectral methods for partial differential equa tions, Springer, Berlin, 2009
D. Kopriva, Implementing spectral methods for partial differential equa tions, Springer, Berlin, 2009
work page 2009
-
[25]
M. Kronbichler and K. Kormann , A generic interface for parallel cell-based finite element o perator application, Comput. Fluids, 63 (2012), pp. 135–147, doi:10.1016/j.co mpfluid.2012.04.012
-
[26]
M. Kronbichler and K. Kormann , Fast matrix-free evaluation of discontinuous Galerkin fini te element operators, ACM Trans. Math. Softw., in press (2019), pp. 1–37, doi:10. 1145/3325864
work page 2019
-
[27]
M. Kronbichler and K. Ljungkvist , Multigrid for matrix-free high-order finite element computations on graphics processors , ACM Trans. Parallel Comput., 6 (2019), pp. 2:1–2:32, doi:10.1145/3322813
-
[28]
M. Kronbichler, S. Schoeder, C. M ¨uller, and W. A. W all , Comparison of implicit and explicit hybridizable discontinuous Galerkin methods for the acous tic wave equation , Int. J. Numer. Meth. Eng., 106 (2016), pp. 712–739, doi:10.1002/nme.5137
-
[29]
M. Kronbichler and W. A. W all , A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers , SIAM J. Sci. Comput., 40 (2018), pp. A3423–A3448, doi:10.1137/16M110455X
-
[30]
C. Lehrenfeld and J. Sch ¨oberl, High order exactly divergence-free hybrid discontinuous G alerkin methods for unsteady incompressible flows , Comput. Meth. Appl. Mech. Eng., 307 (2016), pp. 339–361, doi:10.1016/j.cma.2016.04.025
-
[31]
J. W. Lottes and P. F. Fischer , Hybrid multigrid/Schwarz algorithms for the spectral elem ent method, J. Sci. Comput., 24 (2005), pp. 45–78, doi:10.1007/s10915 -004-4787
-
[32]
R. E. Lynch, J. R. Rice, and D. H. Thomas , Direct solution of partial difference equations by tensor product methods , Numer. Math., 6 (1964), pp. 185–199, doi:10.1007/BF01386 067
-
[33]
S. A. Orszag , Spectral methods for problems in complex geometries , J. Comput. Phys., 37 (1980), pp. 70–92, doi:10.1016/0021-9991(80)90005-4
-
[34]
J. Peraire and P.-O. Persson , The compact discontinuous Galerkin (CDG) method for ellipt ic problems, SIAM J. Sci. Comput., 30 (2008), pp. 1806–1824, doi:10.113 7/070685518
work page 2008
-
[35]
P. Persson , A sparse and high-order accurate line-based discontinuous Galerkin method for unstruc- tured meshes, J. Comput. Phys., (2013), pp. 414–429, doi:10.1016/j.jcp .2012.09.008
-
[36]
M. Piatkowski, S. M ¨uthing, and P. Bastian , A stable and high-order accurate discontinuous Galerkin based splitting method for the incompressible Nav ier–Stokes equations , J. Comput. Phys., 356 (2018), pp. 220–239, doi:10.1016/j.jcp.2017.11.035
-
[37]
S. Schoeder, K. Kormann, W. A. W all, and M. Kronbichler , Efficient explicit time stepping of high order discontinuous Galerkin schemes for waves , SIAM J. Sci. Comput., 40 (2018), pp. C803– C826, doi:10.1137/18M1185399
-
[38]
Solomonoff , A fast algorithm for spectral differentiation , J
A. Solomonoff , A fast algorithm for spectral differentiation , J. Comput. Phys., 98 (1992), pp. 174– 177, doi:10.1016/0021-9991(92)90182-X
-
[39]
H. Sundar, G. Stadler, and G. Biros , Comparison of multigrid algorithms for high-order continuous finite element discretizations , Numer. Linear Algebra Appl., 22 (2015), pp. 664–680, doi:10.1002/nla.1979
-
[40]
J. Treibig, G. Hager, and G. Wellein , LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments , in Proceedings of PSTI2010, the First International Works hop on Parallel Software Tools and Tool Infrastructures, San Dieg o CA, 2010, doi:10.1109/ICPPW.2010.38. https://github.com/RRZE-HPC/likwid, retrieved on June 19, 2019
-
[41]
U. Trottenberg, C. Oosterlee, and A. Sch ¨uller, Multigrid, Elsevier Academic Press, London, 2001
work page 2001
-
[42]
R. S. V arga, Matrix iterative analysis , Springer, Berlin, 2nd ed., 2009
work page 2009
-
[43]
S. Williams, A. W aterman, and D. P atterson , Roofline: An insightful visual performance model for multicore architectures, Commun. ACM, 52 (2009), pp. 65–76, doi:10.1145/1498765.1 498785
- [44]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.