pith. sign in

arxiv: 1907.08492 · v1 · pith:2XZNWCHLnew · submitted 2019-07-19 · 🧮 math.NA · cs.MS· cs.NA· cs.PF

A Hermite-like basis for faster matrix-free evaluation of interior penalty discontinuous Galerkin operators

Pith reviewed 2026-05-24 19:20 UTC · model grok-4.3

classification 🧮 math.NA cs.MScs.NAcs.PF
keywords discontinuous Galerkinmatrix-free methodsinterior penaltyHermite-like basissum factorizationmultigridhexahedral elementsJacobi polynomials
0
0 comments X

The pith

A Hermite-like basis limits neighbor access in matrix-free DG evaluations to one function value and one derivative per neighbor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a basis modeled on Hermite polynomials to speed up matrix-free evaluation of symmetric interior penalty discontinuous Galerkin operators on hexahedral elements. It applies the basis in a fully discontinuous setting specifically to shrink the effective stencil width, so each element needs only one neighboring data point for the function value and one for the derivative. The construction extends to higher polynomial orders by placing nodes at roots of Jacobi polynomials and to three dimensions by tensor products that support sum factorization. The reduced memory traffic raises throughput on current processors, and a cheap basis change via sum factorization keeps the method compatible with multigrid solvers even though a plain point-Jacobi smoother performs worse than with standard nodal bases.

Core claim

The Hermite-like basis allows matrix-free evaluation of symmetric interior penalty DG discretizations on hexahedra with neighbor access restricted to one function value and one derivative, achieved by nodal points from Jacobi polynomial roots and tensor-product extension that preserves sum-factorization structure; the resulting data-access reduction improves throughput while a sum-factorization basis change restores effective multigrid performance at negligible extra cost.

What carries the argument

The Hermite-like basis, which enforces a minimal stencil of one function value plus one derivative from each neighbor while supporting sum factorization.

If this is right

  • Matrix-free operator application requires less memory traffic per degree of freedom than standard nodal bases.
  • Sum factorization remains applicable in three dimensions despite the non-standard basis.
  • A plain point-Jacobi multigrid smoother becomes less efficient than with best nodal polynomials, but a sum-factorization basis change restores usable multigrid constituents.
  • The basis-change cost can be hidden behind the dominant data-access cost on modern hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same minimal-stencil idea could be tested on other DG flux formulations or on non-hexahedral elements to see whether the access reduction generalizes.
  • In large-scale parallel runs the lower per-element data volume might improve strong scaling once communication volume is measured.
  • Combining the basis with cache-blocking or vectorization strategies not discussed in the paper could yield further speed-ups.

Load-bearing premise

The nodal points taken from roots of Jacobi polynomials continue to deliver the one-value-plus-one-derivative stencil property after extension to higher orders and multiple dimensions by tensor products.

What would settle it

Run the operator evaluation for polynomial degree four in three dimensions and count the distinct neighbor data items actually loaded; if the count exceeds one value and one derivative per face neighbor, the minimal-stencil claim does not hold.

Figures

Figures reproduced from arXiv: 1907.08492 by Julius Witte, Katharina Kormann, Martin Kronbichler, Niklas Fehn, Peter Munch.

Figure 1
Figure 1. Figure 1: Classical Hermite basis functions (7) (left) and proposed [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Hermite-like basis for p = 1 (left), p = 2 (middle), and p = 7 (right). derivatives in reference space as well, for which the values on the face are needed. For evaluating only the values on a face, e.g. for a hyperbolic term, a single value is touched via Sf . 2. Construction for p ≤ 2. We require Df = [−α1, α1, 0] at ξ = 0 and Df = [0, −α1, α1] at ξ = 1 with α1 = 2 for p = 2, see [PITH_FULL_IMAGE:figure… view at source ↗
Figure 3
Figure 3. Figure 3: Data access pattern for p = 5 in terms of values read (black circles) for the element shaded in blue and values read and written (black disks). Note that the Hermite-like basis is not nodal and the two layers of nodes closest to the surface are highlighted for illustration purposes. information must also be loaded. Computing a tri-linear mapping adds only little memory transfer for p ≥ 3 and is of high ari… view at source ↗
Figure 4
Figure 4. Figure 4: Throughput of double-precision matrix-vector product [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of data transfer over various memory hierarchie [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Throughput of double-precision matrix-vector product [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Analysis of evaluation of operator P −1 with basis transformation (12) versus a diagonal representation as function of the degree with 30–56 million DoF (left panel), as function of the size (middle panel), and as function of the evaluation time (right panel). The middle and right panels use the same data. All operations are done in single precision and with OpenMP on Xeon Gold. 6.3 Evaluation of P −1 A mo… view at source ↗
Figure 8
Figure 8. Figure 8: Throughput of one step within single-precision Chebyshev [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

This work proposes a basis for improved throughput of matrix-free evaluation of discontinuous Galerkin symmetric interior penalty discretizations on hexahedral elements. The basis relies on ideas of Hermite polynomials. It is used in a fully discontinuous setting not for higher order continuity but to minimize the effective stencil width, namely to limit the neighbor access of an element to one data point for the function value and one for the derivative. The basis is extended to higher orders with nodal contributions derived from roots of Jacobi polynomials and extended to multiple dimensions with tensor products, which enable the use of sum factorization. The beneficial effect of the reduced data access on modern processors is shown. Furthermore, the viability of the basis in the context of multigrid solvers is analyzed. While a plain point-Jacobi approach is less efficient than with the best nodal polynomials, a basis change via sum-factorization techniques enables the combination of the fast matrix-vector products with effective multigrid constituents. The basis change is essentially for free on modern hardware because these computations can be hidden behind the cost of the data access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a Hermite-like nodal basis for matrix-free evaluation of symmetric interior penalty discontinuous Galerkin (SIPG) operators on hexahedral elements. The basis is intended to minimize the effective stencil in a fully discontinuous setting by restricting neighbor data access to exactly one function value and one normal derivative per adjacent element. The construction is extended from low-order Hermite functions to higher polynomial degrees via nodal points at roots of Jacobi polynomials and to multiple dimensions via tensor products, enabling sum factorization. Performance benefits for throughput on modern processors are claimed, and viability within multigrid solvers is analyzed via an essentially free basis change.

Significance. If the minimal-stencil property is preserved under the stated extensions, the approach could reduce memory-bandwidth pressure in high-order matrix-free DG kernels, which is often the dominant cost on current hardware. The integration with multigrid through sum-factorization basis changes is a constructive feature that addresses a practical limitation of non-standard bases.

major comments (2)
  1. [§3] §3 (basis construction and tensor-product extension): the central claim that Jacobi-root nodal contributions plus tensor products preserve the one-value-plus-one-derivative neighbor stencil is not accompanied by an explicit verification, proof, or numerical check. SIPG face integrals depend on traces and normal derivatives; without showing that all other couplings remain identically zero for p>1, the claimed data-access reduction is unsupported.
  2. [§5] §5 (numerical results): the text asserts that beneficial throughput effects are shown and that the basis change can be hidden behind data access, yet no quantitative timing tables, flop counts, or direct comparisons to standard nodal bases are referenced that would confirm the stencil reduction actually materializes in the implemented kernels for the tested orders and dimensions.
minor comments (2)
  1. Notation for the one-dimensional basis functions and their extension to faces could be made more explicit with a small worked example (e.g., p=2) to aid reproducibility.
  2. The multigrid section would benefit from a brief statement of the smoother and coarse-grid operator used, even if the focus is on the basis change cost.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the two major points below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [§3] §3 (basis construction and tensor-product extension): the central claim that Jacobi-root nodal contributions plus tensor products preserve the one-value-plus-one-derivative neighbor stencil is not accompanied by an explicit verification, proof, or numerical check. SIPG face integrals depend on traces and normal derivatives; without showing that all other couplings remain identically zero for p>1, the claimed data-access reduction is unsupported.

    Authors: The 1D Hermite-like construction at Jacobi roots is chosen precisely so that all interior nodal contributions to a face integral vanish except the value and normal derivative at the face point; the tensor-product extension inherits this because SIPG face terms act only in the normal direction while the other two directions remain decoupled. We agree an explicit verification is missing from the current text. We will insert a short subsection in §3 containing (i) a direct algebraic argument that off-face couplings are identically zero for any p and (ii) a numerical check confirming the stencil for p=2,3,4 on a single face. revision: yes

  2. Referee: [§5] §5 (numerical results): the text asserts that beneficial throughput effects are shown and that the basis change can be hidden behind data access, yet no quantitative timing tables, flop counts, or direct comparisons to standard nodal bases are referenced that would confirm the stencil reduction actually materializes in the implemented kernels for the tested orders and dimensions.

    Authors: The manuscript reports throughput gains, yet we accept that the presentation lacks the explicit tables and flop-count breakdowns the referee requests. We will expand §5 with (i) wall-clock timing tables for the matrix-free operator on hexahedral meshes up to p=5 in 3D, (ii) arithmetic intensity and flop counts contrasting the new basis against standard nodal bases, and (iii) a direct side-by-side comparison demonstrating the measured reduction in neighbor data movement. revision: yes

Circularity Check

0 steps flagged

No circularity: construction uses standard polynomial roots and tensor products without reduction to fitted inputs or self-citations

full rationale

The paper presents an explicit basis construction starting from Hermite-like ideas, extended via roots of Jacobi polynomials for higher orders and tensor products for multiple dimensions to enable sum factorization. No equations or claims reduce a performance prediction or stencil property to a fitted parameter or prior self-citation by definition; the minimal neighbor access is a direct consequence of the chosen nodal points and the discontinuous setting, with viability shown through implementation rather than assumed uniqueness theorems. The derivation chain is self-contained against external polynomial properties and does not invoke load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard properties of Hermite and Jacobi polynomials without introducing new fitted parameters or postulated entities.

axioms (1)
  • standard math Standard algebraic and orthogonality properties of Hermite and Jacobi polynomials hold and can be used to define nodal points and basis functions.
    Invoked to construct the basis and its nodal contributions.

pith-pipeline@v0.9.0 · 5736 in / 1156 out tokens · 20242 ms · 2026-05-24T19:20:01.228483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    Adams, M

    M. Adams, M. Brezina, J. Hu, and R. Tuminaro , Parallel multigrid smoothing: polynomial versus Gauss–Seidel, J. Comput. Phys., 188 (2003), pp. 593–610, doi:10.1016/S0 021-9991(03)00194-3

  2. [2]

    Ainsworth, G

    M. Ainsworth, G. Andriamaro, and O. Davydov , Bernstein–B´ ezier finite elements of arbi- trary order and optimal assembly procedures , SIAM J. Sci. Comput., 33 (2011), pp. 3087–3109, doi:10.1137/11082539X

  3. [3]

    Alzetta, D

    G. Alzetta, D. Arndt, W. Bangerth, V. Boddu, B. Brands, D. Dav ydov, R. Gassmoeller, T. Heister, L. Heltai, K. Kormann, M. Kronbichler, M. Maier, J.-P. Pelteret, B. Tur- cksin, and D. Wells , The deal.II library, version 9.0 , J. Numer. Math., 26 (2018), pp. 173–184, doi:10.1515/jnma-2018-0054, www.dealii.org

  4. [4]

    A Geometric Heuristic for Rectilinear Crossing Minimization

    D. Appel ¨o, T. Hagstrom, and A. V argas , Hermite methods for the scalar wave equation , SIAM J. Sci. Comput., 40 (2018), pp. A3902–A3927, doi:10.1137/1 8M1171072

  5. [5]

    Arnold, F

    D. Arnold, F. Brezzi, B. Cockburn, and L. D. Marini , Unified analysis of discontinu- ous Galerkin methods for elliptic problems , SIAM J. Numer. Anal., 39 (2002), pp. 1749–1779, doi:10.1137/S0036142901384162

  6. [6]

    Bangerth, C

    W. Bangerth, C. Burstedde, T. Heister, and M. Kronbichler , Algorithms and data structures for massively parallel generic finite element codes , ACM Trans. Math. Softw., 38 (2011), pp. 14:1– 14:28, doi:10.1145/2049673.2049678

  7. [7]

    Beuchler, V

    S. Beuchler, V. Pill wein, and S. Zaglmayr , Sparsity optimized high order finite element functions for H(div) on simplices , Numer. Math., 122 (2012), pp. 197–225, doi:10.1007/s0021 1-012-0461-0

  8. [8]

    Burstedde, L

    C. Burstedde, L. C. Wilcox, and O. Ghattas , p4est: Scalable algorithms for parallel adap- tive mesh refinement on forests of octrees , SIAM J. Sci. Comput., 33 (2011), pp. 1103–1133, doi:10.1137/10079163, http://p4est.org

  9. [9]

    Cockburn, J

    B. Cockburn, J. Gopalakrishnan, and R. Lazarov , Unified hybridization of discontinuous Galerkin, mixed, and continuous Galerkin methods for secon d order elliptic equations , SIAM J. Numer. Anal., 47 (2009), pp. 1139–1365, doi:10.1137/070706616

  10. [11]

    M. O. Deville, P. F. Fischer, and E. H. Mund , High-order methods for incompressible fluid flow , vol. 9, Cambridge University Press, 2002

  11. [12]

    Epshteyn and B

    Y. Epshteyn and B. Rivi `ere, Estimation of penalty parameters for symmetric interior pe nalty Galerkin methods, J. Comput. Appl. Math., 206 (2007), pp. 843–872, doi:10.10 16/j.cam.2006.08.029

  12. [13]

    N. Fehn, W. A. W all, and M. Kronbichler , Efficiency of high-performance discontinuous Galerkin spectral element methods for under-resolved turb ulent incompressible flows , Int. J. Numer. Meth. Fluids, 88 (2018), pp. 32–54, doi:10.1002/fld.4511

  13. [14]

    N. Fehn, W. A. W all, and M. Kronbichler , Robust and efficient discontinuous Galerkin meth- ods for under-resolved turbulent incompressible flows , J. Comput. Phys., 372 (2018), pp. 667–693, doi:10.1016/j.jcp.2018.06.037

  14. [15]

    Fischer, T

    P. Fischer, T. Rathnayake, S. Dutta, V. Dobrev, T. Kolev, J.- S. Camier, M. Kronbichler, K. ´Swirydowicz, T. W arburton, J. Brown, and M. Min , Running faster in HPC applications , in preparation, (2019). http://ceed.exascaleproject.org

  15. [16]

    Gholami, D

    A. Gholami, D. Malhotra, H. Sundar, and G. Biros , FFT, FMM, or multigrid? A comparative study of state-of-the-art Poisson solvers for uniform and n onuniform grids in the unit cube , SIAM J. Sci. Comput., 38 (2016), pp. C280–C306, doi:10.1137/15M10 10798

  16. [17]

    W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith , Performance modeling and tuning of an unstructured mesh CFD application , in SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, 2000, pp. 34–34, doi:10.1109/SC.2000. 10059

  17. [18]

    Hager and G

    G. Hager and G. Wellein , Introduction to High Performance Computing for Scientists and Engi- neers, CRC Press, Boca Raton, 2011

  18. [19]

    P. W. Hemker, W. Hoffmann, and M. H. V an Raalte , Two-level Fourier analysis of a multigrid approach for discontinuous Galerkin discretization , SIAM J. Sci. Comput., 25 (2003), pp. 1018–1041, doi:10.1137/S1064827502405100

  19. [20]

    Coclite, S

    F. Hindenlang, G. Gassner, C. Altmann, A. Beck, M. Staudenma ier, and C.-D. Munz , Explicit discontinuous Galerkin methods for unsteady prob lems, Comput. Fluids, 61 (2012), pp. 86– 93, doi:10.1016/j.compfluid.2012.03.006. 19

  20. [21]

    Huismann, J

    I. Huismann, J. Stiller, and J. Fr ¨ohlich, Factorizing the factorization – a spectral-element solver for elliptic equations with linear operation count , J. Comput. Phys., 346 (2017), pp. 437–448, doi:10.1016/j.jcp.2017.06.012

  21. [22]

    R. C. Kirby , Fast inversion of the simplicial Bernstein mass matrix , Numer. Math., 135 (2017), pp. 73–95, doi:10.1007/s00211-016-0795-0

  22. [23]

    M. G. Knepley, J. Brown, K. Rupp, and B. F. Smith , Achieving high performance with unified residual evaluation , arXiv preprint cs.MS, 1309.1204 (2013), pp. 1–4

  23. [24]

    Kopriva, Implementing spectral methods for partial differential equa tions, Springer, Berlin, 2009

    D. Kopriva, Implementing spectral methods for partial differential equa tions, Springer, Berlin, 2009

  24. [25]

    Kronbichler and K

    M. Kronbichler and K. Kormann , A generic interface for parallel cell-based finite element o perator application, Comput. Fluids, 63 (2012), pp. 135–147, doi:10.1016/j.co mpfluid.2012.04.012

  25. [26]

    Kronbichler and K

    M. Kronbichler and K. Kormann , Fast matrix-free evaluation of discontinuous Galerkin fini te element operators, ACM Trans. Math. Softw., in press (2019), pp. 1–37, doi:10. 1145/3325864

  26. [27]

    Kronbichler and K

    M. Kronbichler and K. Ljungkvist , Multigrid for matrix-free high-order finite element computations on graphics processors , ACM Trans. Parallel Comput., 6 (2019), pp. 2:1–2:32, doi:10.1145/3322813

  27. [28]

    Kronbichler, S

    M. Kronbichler, S. Schoeder, C. M ¨uller, and W. A. W all , Comparison of implicit and explicit hybridizable discontinuous Galerkin methods for the acous tic wave equation , Int. J. Numer. Meth. Eng., 106 (2016), pp. 712–739, doi:10.1002/nme.5137

  28. [29]

    Kronbichler and W

    M. Kronbichler and W. A. W all , A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers , SIAM J. Sci. Comput., 40 (2018), pp. A3423–A3448, doi:10.1137/16M110455X

  29. [30]

    Lehrenfeld and J

    C. Lehrenfeld and J. Sch ¨oberl, High order exactly divergence-free hybrid discontinuous G alerkin methods for unsteady incompressible flows , Comput. Meth. Appl. Mech. Eng., 307 (2016), pp. 339–361, doi:10.1016/j.cma.2016.04.025

  30. [31]

    J. W. Lottes and P. F. Fischer , Hybrid multigrid/Schwarz algorithms for the spectral elem ent method, J. Sci. Comput., 24 (2005), pp. 45–78, doi:10.1007/s10915 -004-4787

  31. [32]

    R. E. Lynch, J. R. Rice, and D. H. Thomas , Direct solution of partial difference equations by tensor product methods , Numer. Math., 6 (1964), pp. 185–199, doi:10.1007/BF01386 067

  32. [33]

    S. A. Orszag , Spectral methods for problems in complex geometries , J. Comput. Phys., 37 (1980), pp. 70–92, doi:10.1016/0021-9991(80)90005-4

  33. [34]

    Peraire and P.-O

    J. Peraire and P.-O. Persson , The compact discontinuous Galerkin (CDG) method for ellipt ic problems, SIAM J. Sci. Comput., 30 (2008), pp. 1806–1824, doi:10.113 7/070685518

  34. [35]

    Raissi, P

    P. Persson , A sparse and high-order accurate line-based discontinuous Galerkin method for unstruc- tured meshes, J. Comput. Phys., (2013), pp. 414–429, doi:10.1016/j.jcp .2012.09.008

  35. [36]

    Piatkowski, S

    M. Piatkowski, S. M ¨uthing, and P. Bastian , A stable and high-order accurate discontinuous Galerkin based splitting method for the incompressible Nav ier–Stokes equations , J. Comput. Phys., 356 (2018), pp. 220–239, doi:10.1016/j.jcp.2017.11.035

  36. [37]

    Schoeder, K

    S. Schoeder, K. Kormann, W. A. W all, and M. Kronbichler , Efficient explicit time stepping of high order discontinuous Galerkin schemes for waves , SIAM J. Sci. Comput., 40 (2018), pp. C803– C826, doi:10.1137/18M1185399

  37. [38]

    Solomonoff , A fast algorithm for spectral differentiation , J

    A. Solomonoff , A fast algorithm for spectral differentiation , J. Comput. Phys., 98 (1992), pp. 174– 177, doi:10.1016/0021-9991(92)90182-X

  38. [39]

    Sundar, G

    H. Sundar, G. Stadler, and G. Biros , Comparison of multigrid algorithms for high-order continuous finite element discretizations , Numer. Linear Algebra Appl., 22 (2015), pp. 664–680, doi:10.1002/nla.1979

  39. [40]

    Treibig, G

    J. Treibig, G. Hager, and G. Wellein , LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments , in Proceedings of PSTI2010, the First International Works hop on Parallel Software Tools and Tool Infrastructures, San Dieg o CA, 2010, doi:10.1109/ICPPW.2010.38. https://github.com/RRZE-HPC/likwid, retrieved on June 19, 2019

  40. [41]

    Trottenberg, C

    U. Trottenberg, C. Oosterlee, and A. Sch ¨uller, Multigrid, Elsevier Academic Press, London, 2001

  41. [42]

    R. S. V arga, Matrix iterative analysis , Springer, Berlin, 2nd ed., 2009

  42. [43]

    Williams, A

    S. Williams, A. W aterman, and D. P atterson , Roofline: An insightful visual performance model for multicore architectures, Commun. ACM, 52 (2009), pp. 65–76, doi:10.1145/1498765.1 498785

  43. [44]

    Witte, D

    J. Witte, D. Arndt, and G. Kanschat , Fast tensor product Schwarz smoothers , 2019. in prepa- ration. 20