A Hermite-like basis for faster matrix-free evaluation of interior penalty discontinuous Galerkin operators

Julius Witte; Katharina Kormann; Martin Kronbichler; Niklas Fehn; Peter Munch

arxiv: 1907.08492 · v1 · pith:2XZNWCHLnew · submitted 2019-07-19 · 🧮 math.NA · cs.MS· cs.NA· cs.PF

A Hermite-like basis for faster matrix-free evaluation of interior penalty discontinuous Galerkin operators

Martin Kronbichler , Katharina Kormann , Niklas Fehn , Peter Munch , Julius Witte This is my paper

Pith reviewed 2026-05-24 19:20 UTC · model grok-4.3

classification 🧮 math.NA cs.MScs.NAcs.PF

keywords discontinuous Galerkinmatrix-free methodsinterior penaltyHermite-like basissum factorizationmultigridhexahedral elementsJacobi polynomials

0 comments

The pith

A Hermite-like basis limits neighbor access in matrix-free DG evaluations to one function value and one derivative per neighbor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a basis modeled on Hermite polynomials to speed up matrix-free evaluation of symmetric interior penalty discontinuous Galerkin operators on hexahedral elements. It applies the basis in a fully discontinuous setting specifically to shrink the effective stencil width, so each element needs only one neighboring data point for the function value and one for the derivative. The construction extends to higher polynomial orders by placing nodes at roots of Jacobi polynomials and to three dimensions by tensor products that support sum factorization. The reduced memory traffic raises throughput on current processors, and a cheap basis change via sum factorization keeps the method compatible with multigrid solvers even though a plain point-Jacobi smoother performs worse than with standard nodal bases.

Core claim

The Hermite-like basis allows matrix-free evaluation of symmetric interior penalty DG discretizations on hexahedra with neighbor access restricted to one function value and one derivative, achieved by nodal points from Jacobi polynomial roots and tensor-product extension that preserves sum-factorization structure; the resulting data-access reduction improves throughput while a sum-factorization basis change restores effective multigrid performance at negligible extra cost.

What carries the argument

The Hermite-like basis, which enforces a minimal stencil of one function value plus one derivative from each neighbor while supporting sum factorization.

If this is right

Matrix-free operator application requires less memory traffic per degree of freedom than standard nodal bases.
Sum factorization remains applicable in three dimensions despite the non-standard basis.
A plain point-Jacobi multigrid smoother becomes less efficient than with best nodal polynomials, but a sum-factorization basis change restores usable multigrid constituents.
The basis-change cost can be hidden behind the dominant data-access cost on modern hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same minimal-stencil idea could be tested on other DG flux formulations or on non-hexahedral elements to see whether the access reduction generalizes.
In large-scale parallel runs the lower per-element data volume might improve strong scaling once communication volume is measured.
Combining the basis with cache-blocking or vectorization strategies not discussed in the paper could yield further speed-ups.

Load-bearing premise

The nodal points taken from roots of Jacobi polynomials continue to deliver the one-value-plus-one-derivative stencil property after extension to higher orders and multiple dimensions by tensor products.

What would settle it

Run the operator evaluation for polynomial degree four in three dimensions and count the distinct neighbor data items actually loaded; if the count exceeds one value and one derivative per face neighbor, the minimal-stencil claim does not hold.

Figures

Figures reproduced from arXiv: 1907.08492 by Julius Witte, Katharina Kormann, Martin Kronbichler, Niklas Fehn, Peter Munch.

**Figure 2.** Figure 2: Hermite-like basis for p = 1 (left), p = 2 (middle), and p = 7 (right). derivatives in reference space as well, for which the values on the face are needed. For evaluating only the values on a face, e.g. for a hyperbolic term, a single value is touched via Sf . 2. Construction for p ≤ 2. We require Df = [−α1, α1, 0] at ξ = 0 and Df = [0, −α1, α1] at ξ = 1 with α1 = 2 for p = 2, see [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 3.** Figure 3: Data access pattern for p = 5 in terms of values read (black circles) for the element shaded in blue and values read and written (black disks). Note that the Hermite-like basis is not nodal and the two layers of nodes closest to the surface are highlighted for illustration purposes. information must also be loaded. Computing a tri-linear mapping adds only little memory transfer for p ≥ 3 and is of high ari… view at source ↗

**Figure 4.** Figure 4: Throughput of double-precision matrix-vector product [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Analysis of data transfer over various memory hierarchie [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Throughput of double-precision matrix-vector product [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Analysis of evaluation of operator P −1 with basis transformation (12) versus a diagonal representation as function of the degree with 30–56 million DoF (left panel), as function of the size (middle panel), and as function of the evaluation time (right panel). The middle and right panels use the same data. All operations are done in single precision and with OpenMP on Xeon Gold. 6.3 Evaluation of P −1 A mo… view at source ↗

**Figure 8.** Figure 8: Throughput of one step within single-precision Chebyshev [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

This work proposes a basis for improved throughput of matrix-free evaluation of discontinuous Galerkin symmetric interior penalty discretizations on hexahedral elements. The basis relies on ideas of Hermite polynomials. It is used in a fully discontinuous setting not for higher order continuity but to minimize the effective stencil width, namely to limit the neighbor access of an element to one data point for the function value and one for the derivative. The basis is extended to higher orders with nodal contributions derived from roots of Jacobi polynomials and extended to multiple dimensions with tensor products, which enable the use of sum factorization. The beneficial effect of the reduced data access on modern processors is shown. Furthermore, the viability of the basis in the context of multigrid solvers is analyzed. While a plain point-Jacobi approach is less efficient than with the best nodal polynomials, a basis change via sum-factorization techniques enables the combination of the fast matrix-vector products with effective multigrid constituents. The basis change is essentially for free on modern hardware because these computations can be hidden behind the cost of the data access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a Hermite-style nodal basis that limits DG neighbor access to one value and one derivative per face, then extends it with Jacobi roots and tensor products to keep sum factorization intact.

read the letter

The core contribution is a purpose-built basis that shrinks the effective stencil in matrix-free symmetric interior penalty DG on hexes. The authors take Hermite ideas but apply them inside a fully discontinuous space purely to cut data movement: each adjacent element contributes only a single value and a single normal derivative instead of the full polynomial trace. They then lift this to higher order by placing extra nodes at Jacobi roots and use tensor products so the sum-factorization machinery still works without extra neighbor fetches. That combination looks new for the DG setting. The performance section shows the expected reduction in memory traffic on current hardware, and the multigrid part demonstrates that a cheap basis change (hidden behind the data loads) lets them keep fast matrix-vector products while retaining reasonable smoother behavior. Point-Jacobi is slower than standard nodal bases, but the sum-factorized route recovers most of the advantage. The construction itself rests on standard polynomial properties rather than fitted parameters, so there is no obvious circularity. The main soft spot is that the abstract and stress-test note both flag the need to verify the stencil property survives the Jacobi-root and tensor-product steps; if any extra face coupling appears at p>1 or in 3D, the claimed data-access win shrinks. The paper presumably checks this numerically, but without explicit counts or error tables in the summary it is hard to judge how tight the bound stays. This is aimed at people already running high-order matrix-free codes for engineering problems. It is a practical, incremental improvement rather than a theoretical breakthrough, but the implementation details and hardware measurements give it enough substance to justify referee time.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a Hermite-like nodal basis for matrix-free evaluation of symmetric interior penalty discontinuous Galerkin (SIPG) operators on hexahedral elements. The basis is intended to minimize the effective stencil in a fully discontinuous setting by restricting neighbor data access to exactly one function value and one normal derivative per adjacent element. The construction is extended from low-order Hermite functions to higher polynomial degrees via nodal points at roots of Jacobi polynomials and to multiple dimensions via tensor products, enabling sum factorization. Performance benefits for throughput on modern processors are claimed, and viability within multigrid solvers is analyzed via an essentially free basis change.

Significance. If the minimal-stencil property is preserved under the stated extensions, the approach could reduce memory-bandwidth pressure in high-order matrix-free DG kernels, which is often the dominant cost on current hardware. The integration with multigrid through sum-factorization basis changes is a constructive feature that addresses a practical limitation of non-standard bases.

major comments (2)

[§3] §3 (basis construction and tensor-product extension): the central claim that Jacobi-root nodal contributions plus tensor products preserve the one-value-plus-one-derivative neighbor stencil is not accompanied by an explicit verification, proof, or numerical check. SIPG face integrals depend on traces and normal derivatives; without showing that all other couplings remain identically zero for p>1, the claimed data-access reduction is unsupported.
[§5] §5 (numerical results): the text asserts that beneficial throughput effects are shown and that the basis change can be hidden behind data access, yet no quantitative timing tables, flop counts, or direct comparisons to standard nodal bases are referenced that would confirm the stencil reduction actually materializes in the implemented kernels for the tested orders and dimensions.

minor comments (2)

Notation for the one-dimensional basis functions and their extension to faces could be made more explicit with a small worked example (e.g., p=2) to aid reproducibility.
The multigrid section would benefit from a brief statement of the smoother and coarse-grid operator used, even if the focus is on the basis change cost.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the two major points below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3] §3 (basis construction and tensor-product extension): the central claim that Jacobi-root nodal contributions plus tensor products preserve the one-value-plus-one-derivative neighbor stencil is not accompanied by an explicit verification, proof, or numerical check. SIPG face integrals depend on traces and normal derivatives; without showing that all other couplings remain identically zero for p>1, the claimed data-access reduction is unsupported.

Authors: The 1D Hermite-like construction at Jacobi roots is chosen precisely so that all interior nodal contributions to a face integral vanish except the value and normal derivative at the face point; the tensor-product extension inherits this because SIPG face terms act only in the normal direction while the other two directions remain decoupled. We agree an explicit verification is missing from the current text. We will insert a short subsection in §3 containing (i) a direct algebraic argument that off-face couplings are identically zero for any p and (ii) a numerical check confirming the stencil for p=2,3,4 on a single face. revision: yes
Referee: [§5] §5 (numerical results): the text asserts that beneficial throughput effects are shown and that the basis change can be hidden behind data access, yet no quantitative timing tables, flop counts, or direct comparisons to standard nodal bases are referenced that would confirm the stencil reduction actually materializes in the implemented kernels for the tested orders and dimensions.

Authors: The manuscript reports throughput gains, yet we accept that the presentation lacks the explicit tables and flop-count breakdowns the referee requests. We will expand §5 with (i) wall-clock timing tables for the matrix-free operator on hexahedral meshes up to p=5 in 3D, (ii) arithmetic intensity and flop counts contrasting the new basis against standard nodal bases, and (iii) a direct side-by-side comparison demonstrating the measured reduction in neighbor data movement. revision: yes

Circularity Check

0 steps flagged

No circularity: construction uses standard polynomial roots and tensor products without reduction to fitted inputs or self-citations

full rationale

The paper presents an explicit basis construction starting from Hermite-like ideas, extended via roots of Jacobi polynomials for higher orders and tensor products for multiple dimensions to enable sum factorization. No equations or claims reduce a performance prediction or stencil property to a fitted parameter or prior self-citation by definition; the minimal neighbor access is a direct consequence of the chosen nodal points and the discontinuous setting, with viability shown through implementation rather than assumed uniqueness theorems. The derivation chain is self-contained against external polynomial properties and does not invoke load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard properties of Hermite and Jacobi polynomials without introducing new fitted parameters or postulated entities.

axioms (1)

standard math Standard algebraic and orthogonality properties of Hermite and Jacobi polynomials hold and can be used to define nodal points and basis functions.
Invoked to construct the basis and its nodal contributions.

pith-pipeline@v0.9.0 · 5736 in / 1156 out tokens · 20242 ms · 2026-05-24T19:20:01.228483+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

[1]

Adams, M

M. Adams, M. Brezina, J. Hu, and R. Tuminaro , Parallel multigrid smoothing: polynomial versus Gauss–Seidel, J. Comput. Phys., 188 (2003), pp. 593–610, doi:10.1016/S0 021-9991(03)00194-3

work page doi:10.1016/s0 2003
[2]

Ainsworth, G

M. Ainsworth, G. Andriamaro, and O. Davydov , Bernstein–B´ ezier ﬁnite elements of arbi- trary order and optimal assembly procedures , SIAM J. Sci. Comput., 33 (2011), pp. 3087–3109, doi:10.1137/11082539X

work page doi:10.1137/11082539x 2011
[3]

Alzetta, D

G. Alzetta, D. Arndt, W. Bangerth, V. Boddu, B. Brands, D. Dav ydov, R. Gassmoeller, T. Heister, L. Heltai, K. Kormann, M. Kronbichler, M. Maier, J.-P. Pelteret, B. Tur- cksin, and D. Wells , The deal.II library, version 9.0 , J. Numer. Math., 26 (2018), pp. 173–184, doi:10.1515/jnma-2018-0054, www.dealii.org

work page doi:10.1515/jnma-2018-0054 2018
[4]

A Geometric Heuristic for Rectilinear Crossing Minimization

D. Appel ¨o, T. Hagstrom, and A. V argas , Hermite methods for the scalar wave equation , SIAM J. Sci. Comput., 40 (2018), pp. A3902–A3927, doi:10.1137/1 8M1171072

work page doi:10.1137/1 2018
[5]

Arnold, F

D. Arnold, F. Brezzi, B. Cockburn, and L. D. Marini , Uniﬁed analysis of discontinu- ous Galerkin methods for elliptic problems , SIAM J. Numer. Anal., 39 (2002), pp. 1749–1779, doi:10.1137/S0036142901384162

work page doi:10.1137/s0036142901384162 2002
[6]

Bangerth, C

W. Bangerth, C. Burstedde, T. Heister, and M. Kronbichler , Algorithms and data structures for massively parallel generic ﬁnite element codes , ACM Trans. Math. Softw., 38 (2011), pp. 14:1– 14:28, doi:10.1145/2049673.2049678

work page doi:10.1145/2049673.2049678 2011
[7]

Beuchler, V

S. Beuchler, V. Pill wein, and S. Zaglmayr , Sparsity optimized high order ﬁnite element functions for H(div) on simplices , Numer. Math., 122 (2012), pp. 197–225, doi:10.1007/s0021 1-012-0461-0

work page doi:10.1007/s0021 2012
[8]

Burstedde, L

C. Burstedde, L. C. Wilcox, and O. Ghattas , p4est: Scalable algorithms for parallel adap- tive mesh reﬁnement on forests of octrees , SIAM J. Sci. Comput., 33 (2011), pp. 1103–1133, doi:10.1137/10079163, http://p4est.org

work page doi:10.1137/10079163 2011
[9]

Cockburn, J

B. Cockburn, J. Gopalakrishnan, and R. Lazarov , Uniﬁed hybridization of discontinuous Galerkin, mixed, and continuous Galerkin methods for secon d order elliptic equations , SIAM J. Numer. Anal., 47 (2009), pp. 1139–1365, doi:10.1137/070706616

work page doi:10.1137/070706616 2009
[11]

M. O. Deville, P. F. Fischer, and E. H. Mund , High-order methods for incompressible ﬂuid ﬂow , vol. 9, Cambridge University Press, 2002

work page 2002
[12]

Epshteyn and B

Y. Epshteyn and B. Rivi `ere, Estimation of penalty parameters for symmetric interior pe nalty Galerkin methods, J. Comput. Appl. Math., 206 (2007), pp. 843–872, doi:10.10 16/j.cam.2006.08.029

work page 2007
[13]

N. Fehn, W. A. W all, and M. Kronbichler , Eﬃciency of high-performance discontinuous Galerkin spectral element methods for under-resolved turb ulent incompressible ﬂows , Int. J. Numer. Meth. Fluids, 88 (2018), pp. 32–54, doi:10.1002/ﬂd.4511

work page 2018
[14]

N. Fehn, W. A. W all, and M. Kronbichler , Robust and eﬃcient discontinuous Galerkin meth- ods for under-resolved turbulent incompressible ﬂows , J. Comput. Phys., 372 (2018), pp. 667–693, doi:10.1016/j.jcp.2018.06.037

work page doi:10.1016/j.jcp.2018.06.037 2018
[15]

Fischer, T

P. Fischer, T. Rathnayake, S. Dutta, V. Dobrev, T. Kolev, J.- S. Camier, M. Kronbichler, K. ´Swirydowicz, T. W arburton, J. Brown, and M. Min , Running faster in HPC applications , in preparation, (2019). http://ceed.exascaleproject.org

work page 2019
[16]

Gholami, D

A. Gholami, D. Malhotra, H. Sundar, and G. Biros , FFT, FMM, or multigrid? A comparative study of state-of-the-art Poisson solvers for uniform and n onuniform grids in the unit cube , SIAM J. Sci. Comput., 38 (2016), pp. C280–C306, doi:10.1137/15M10 10798

work page doi:10.1137/15m10 2016
[17]

W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith , Performance modeling and tuning of an unstructured mesh CFD application , in SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, 2000, pp. 34–34, doi:10.1109/SC.2000. 10059

work page doi:10.1109/sc.2000 2000
[18]

Hager and G

G. Hager and G. Wellein , Introduction to High Performance Computing for Scientists and Engi- neers, CRC Press, Boca Raton, 2011

work page 2011
[19]

P. W. Hemker, W. Hoffmann, and M. H. V an Raalte , Two-level Fourier analysis of a multigrid approach for discontinuous Galerkin discretization , SIAM J. Sci. Comput., 25 (2003), pp. 1018–1041, doi:10.1137/S1064827502405100

work page doi:10.1137/s1064827502405100 2003
[20]

Coclite, S

F. Hindenlang, G. Gassner, C. Altmann, A. Beck, M. Staudenma ier, and C.-D. Munz , Explicit discontinuous Galerkin methods for unsteady prob lems, Comput. Fluids, 61 (2012), pp. 86– 93, doi:10.1016/j.compﬂuid.2012.03.006. 19

work page doi:10.1016/j.comp 2012
[21]

Huismann, J

I. Huismann, J. Stiller, and J. Fr ¨ohlich, Factorizing the factorization – a spectral-element solver for elliptic equations with linear operation count , J. Comput. Phys., 346 (2017), pp. 437–448, doi:10.1016/j.jcp.2017.06.012

work page doi:10.1016/j.jcp.2017.06.012 2017
[22]

R. C. Kirby , Fast inversion of the simplicial Bernstein mass matrix , Numer. Math., 135 (2017), pp. 73–95, doi:10.1007/s00211-016-0795-0

work page doi:10.1007/s00211-016-0795-0 2017
[23]

M. G. Knepley, J. Brown, K. Rupp, and B. F. Smith , Achieving high performance with uniﬁed residual evaluation , arXiv preprint cs.MS, 1309.1204 (2013), pp. 1–4

work page internal anchor Pith review Pith/arXiv arXiv 2013
[24]

Kopriva, Implementing spectral methods for partial diﬀerential equa tions, Springer, Berlin, 2009

D. Kopriva, Implementing spectral methods for partial diﬀerential equa tions, Springer, Berlin, 2009

work page 2009
[25]

Kronbichler and K

M. Kronbichler and K. Kormann , A generic interface for parallel cell-based ﬁnite element o perator application, Comput. Fluids, 63 (2012), pp. 135–147, doi:10.1016/j.co mpﬂuid.2012.04.012

work page doi:10.1016/j.co 2012
[26]

Kronbichler and K

M. Kronbichler and K. Kormann , Fast matrix-free evaluation of discontinuous Galerkin ﬁni te element operators, ACM Trans. Math. Softw., in press (2019), pp. 1–37, doi:10. 1145/3325864

work page 2019
[27]

Kronbichler and K

M. Kronbichler and K. Ljungkvist , Multigrid for matrix-free high-order ﬁnite element computations on graphics processors , ACM Trans. Parallel Comput., 6 (2019), pp. 2:1–2:32, doi:10.1145/3322813

work page doi:10.1145/3322813 2019
[28]

Kronbichler, S

M. Kronbichler, S. Schoeder, C. M ¨uller, and W. A. W all , Comparison of implicit and explicit hybridizable discontinuous Galerkin methods for the acous tic wave equation , Int. J. Numer. Meth. Eng., 106 (2016), pp. 712–739, doi:10.1002/nme.5137

work page doi:10.1002/nme.5137 2016
[29]

Kronbichler and W

M. Kronbichler and W. A. W all , A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers , SIAM J. Sci. Comput., 40 (2018), pp. A3423–A3448, doi:10.1137/16M110455X

work page doi:10.1137/16m110455x 2018
[30]

Lehrenfeld and J

C. Lehrenfeld and J. Sch ¨oberl, High order exactly divergence-free hybrid discontinuous G alerkin methods for unsteady incompressible ﬂows , Comput. Meth. Appl. Mech. Eng., 307 (2016), pp. 339–361, doi:10.1016/j.cma.2016.04.025

work page doi:10.1016/j.cma.2016.04.025 2016
[31]

J. W. Lottes and P. F. Fischer , Hybrid multigrid/Schwarz algorithms for the spectral elem ent method, J. Sci. Comput., 24 (2005), pp. 45–78, doi:10.1007/s10915 -004-4787

work page doi:10.1007/s10915 2005
[32]

R. E. Lynch, J. R. Rice, and D. H. Thomas , Direct solution of partial diﬀerence equations by tensor product methods , Numer. Math., 6 (1964), pp. 185–199, doi:10.1007/BF01386 067

work page doi:10.1007/bf01386 1964
[33]

S. A. Orszag , Spectral methods for problems in complex geometries , J. Comput. Phys., 37 (1980), pp. 70–92, doi:10.1016/0021-9991(80)90005-4

work page doi:10.1016/0021-9991(80)90005-4 1980
[34]

Peraire and P.-O

J. Peraire and P.-O. Persson , The compact discontinuous Galerkin (CDG) method for ellipt ic problems, SIAM J. Sci. Comput., 30 (2008), pp. 1806–1824, doi:10.113 7/070685518

work page 2008
[35]

Raissi, P

P. Persson , A sparse and high-order accurate line-based discontinuous Galerkin method for unstruc- tured meshes, J. Comput. Phys., (2013), pp. 414–429, doi:10.1016/j.jcp .2012.09.008

work page doi:10.1016/j.jcp 2013
[36]

Piatkowski, S

M. Piatkowski, S. M ¨uthing, and P. Bastian , A stable and high-order accurate discontinuous Galerkin based splitting method for the incompressible Nav ier–Stokes equations , J. Comput. Phys., 356 (2018), pp. 220–239, doi:10.1016/j.jcp.2017.11.035

work page doi:10.1016/j.jcp.2017.11.035 2018
[37]

Schoeder, K

S. Schoeder, K. Kormann, W. A. W all, and M. Kronbichler , Eﬃcient explicit time stepping of high order discontinuous Galerkin schemes for waves , SIAM J. Sci. Comput., 40 (2018), pp. C803– C826, doi:10.1137/18M1185399

work page doi:10.1137/18m1185399 2018
[38]

Solomonoff , A fast algorithm for spectral diﬀerentiation , J

A. Solomonoff , A fast algorithm for spectral diﬀerentiation , J. Comput. Phys., 98 (1992), pp. 174– 177, doi:10.1016/0021-9991(92)90182-X

work page doi:10.1016/0021-9991(92)90182-x 1992
[39]

Sundar, G

H. Sundar, G. Stadler, and G. Biros , Comparison of multigrid algorithms for high-order continuous ﬁnite element discretizations , Numer. Linear Algebra Appl., 22 (2015), pp. 664–680, doi:10.1002/nla.1979

work page doi:10.1002/nla.1979 2015
[40]

Treibig, G

J. Treibig, G. Hager, and G. Wellein , LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments , in Proceedings of PSTI2010, the First International Works hop on Parallel Software Tools and Tool Infrastructures, San Dieg o CA, 2010, doi:10.1109/ICPPW.2010.38. https://github.com/RRZE-HPC/likwid, retrieved on June 19, 2019

work page doi:10.1109/icppw.2010.38 2010
[41]

Trottenberg, C

U. Trottenberg, C. Oosterlee, and A. Sch ¨uller, Multigrid, Elsevier Academic Press, London, 2001

work page 2001
[42]

R. S. V arga, Matrix iterative analysis , Springer, Berlin, 2nd ed., 2009

work page 2009
[43]

Williams, A

S. Williams, A. W aterman, and D. P atterson , Rooﬂine: An insightful visual performance model for multicore architectures, Commun. ACM, 52 (2009), pp. 65–76, doi:10.1145/1498765.1 498785

work page doi:10.1145/1498765.1 2009
[44]

Witte, D

J. Witte, D. Arndt, and G. Kanschat , Fast tensor product Schwarz smoothers , 2019. in prepa- ration. 20

work page 2019

[1] [1]

Adams, M

M. Adams, M. Brezina, J. Hu, and R. Tuminaro , Parallel multigrid smoothing: polynomial versus Gauss–Seidel, J. Comput. Phys., 188 (2003), pp. 593–610, doi:10.1016/S0 021-9991(03)00194-3

work page doi:10.1016/s0 2003

[2] [2]

Ainsworth, G

M. Ainsworth, G. Andriamaro, and O. Davydov , Bernstein–B´ ezier ﬁnite elements of arbi- trary order and optimal assembly procedures , SIAM J. Sci. Comput., 33 (2011), pp. 3087–3109, doi:10.1137/11082539X

work page doi:10.1137/11082539x 2011

[3] [3]

Alzetta, D

G. Alzetta, D. Arndt, W. Bangerth, V. Boddu, B. Brands, D. Dav ydov, R. Gassmoeller, T. Heister, L. Heltai, K. Kormann, M. Kronbichler, M. Maier, J.-P. Pelteret, B. Tur- cksin, and D. Wells , The deal.II library, version 9.0 , J. Numer. Math., 26 (2018), pp. 173–184, doi:10.1515/jnma-2018-0054, www.dealii.org

work page doi:10.1515/jnma-2018-0054 2018

[4] [4]

A Geometric Heuristic for Rectilinear Crossing Minimization

D. Appel ¨o, T. Hagstrom, and A. V argas , Hermite methods for the scalar wave equation , SIAM J. Sci. Comput., 40 (2018), pp. A3902–A3927, doi:10.1137/1 8M1171072

work page doi:10.1137/1 2018

[5] [5]

Arnold, F

D. Arnold, F. Brezzi, B. Cockburn, and L. D. Marini , Uniﬁed analysis of discontinu- ous Galerkin methods for elliptic problems , SIAM J. Numer. Anal., 39 (2002), pp. 1749–1779, doi:10.1137/S0036142901384162

work page doi:10.1137/s0036142901384162 2002

[6] [6]

Bangerth, C

W. Bangerth, C. Burstedde, T. Heister, and M. Kronbichler , Algorithms and data structures for massively parallel generic ﬁnite element codes , ACM Trans. Math. Softw., 38 (2011), pp. 14:1– 14:28, doi:10.1145/2049673.2049678

work page doi:10.1145/2049673.2049678 2011

[7] [7]

Beuchler, V

S. Beuchler, V. Pill wein, and S. Zaglmayr , Sparsity optimized high order ﬁnite element functions for H(div) on simplices , Numer. Math., 122 (2012), pp. 197–225, doi:10.1007/s0021 1-012-0461-0

work page doi:10.1007/s0021 2012

[8] [8]

Burstedde, L

C. Burstedde, L. C. Wilcox, and O. Ghattas , p4est: Scalable algorithms for parallel adap- tive mesh reﬁnement on forests of octrees , SIAM J. Sci. Comput., 33 (2011), pp. 1103–1133, doi:10.1137/10079163, http://p4est.org

work page doi:10.1137/10079163 2011

[9] [9]

Cockburn, J

B. Cockburn, J. Gopalakrishnan, and R. Lazarov , Uniﬁed hybridization of discontinuous Galerkin, mixed, and continuous Galerkin methods for secon d order elliptic equations , SIAM J. Numer. Anal., 47 (2009), pp. 1139–1365, doi:10.1137/070706616

work page doi:10.1137/070706616 2009

[10] [11]

M. O. Deville, P. F. Fischer, and E. H. Mund , High-order methods for incompressible ﬂuid ﬂow , vol. 9, Cambridge University Press, 2002

work page 2002

[11] [12]

Epshteyn and B

Y. Epshteyn and B. Rivi `ere, Estimation of penalty parameters for symmetric interior pe nalty Galerkin methods, J. Comput. Appl. Math., 206 (2007), pp. 843–872, doi:10.10 16/j.cam.2006.08.029

work page 2007

[12] [13]

N. Fehn, W. A. W all, and M. Kronbichler , Eﬃciency of high-performance discontinuous Galerkin spectral element methods for under-resolved turb ulent incompressible ﬂows , Int. J. Numer. Meth. Fluids, 88 (2018), pp. 32–54, doi:10.1002/ﬂd.4511

work page 2018

[13] [14]

N. Fehn, W. A. W all, and M. Kronbichler , Robust and eﬃcient discontinuous Galerkin meth- ods for under-resolved turbulent incompressible ﬂows , J. Comput. Phys., 372 (2018), pp. 667–693, doi:10.1016/j.jcp.2018.06.037

work page doi:10.1016/j.jcp.2018.06.037 2018

[14] [15]

Fischer, T

P. Fischer, T. Rathnayake, S. Dutta, V. Dobrev, T. Kolev, J.- S. Camier, M. Kronbichler, K. ´Swirydowicz, T. W arburton, J. Brown, and M. Min , Running faster in HPC applications , in preparation, (2019). http://ceed.exascaleproject.org

work page 2019

[15] [16]

Gholami, D

A. Gholami, D. Malhotra, H. Sundar, and G. Biros , FFT, FMM, or multigrid? A comparative study of state-of-the-art Poisson solvers for uniform and n onuniform grids in the unit cube , SIAM J. Sci. Comput., 38 (2016), pp. C280–C306, doi:10.1137/15M10 10798

work page doi:10.1137/15m10 2016

[16] [17]

W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith , Performance modeling and tuning of an unstructured mesh CFD application , in SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, 2000, pp. 34–34, doi:10.1109/SC.2000. 10059

work page doi:10.1109/sc.2000 2000

[17] [18]

Hager and G

G. Hager and G. Wellein , Introduction to High Performance Computing for Scientists and Engi- neers, CRC Press, Boca Raton, 2011

work page 2011

[18] [19]

P. W. Hemker, W. Hoffmann, and M. H. V an Raalte , Two-level Fourier analysis of a multigrid approach for discontinuous Galerkin discretization , SIAM J. Sci. Comput., 25 (2003), pp. 1018–1041, doi:10.1137/S1064827502405100

work page doi:10.1137/s1064827502405100 2003

[19] [20]

Coclite, S

F. Hindenlang, G. Gassner, C. Altmann, A. Beck, M. Staudenma ier, and C.-D. Munz , Explicit discontinuous Galerkin methods for unsteady prob lems, Comput. Fluids, 61 (2012), pp. 86– 93, doi:10.1016/j.compﬂuid.2012.03.006. 19

work page doi:10.1016/j.comp 2012

[20] [21]

Huismann, J

I. Huismann, J. Stiller, and J. Fr ¨ohlich, Factorizing the factorization – a spectral-element solver for elliptic equations with linear operation count , J. Comput. Phys., 346 (2017), pp. 437–448, doi:10.1016/j.jcp.2017.06.012

work page doi:10.1016/j.jcp.2017.06.012 2017

[21] [22]

R. C. Kirby , Fast inversion of the simplicial Bernstein mass matrix , Numer. Math., 135 (2017), pp. 73–95, doi:10.1007/s00211-016-0795-0

work page doi:10.1007/s00211-016-0795-0 2017

[22] [23]

M. G. Knepley, J. Brown, K. Rupp, and B. F. Smith , Achieving high performance with uniﬁed residual evaluation , arXiv preprint cs.MS, 1309.1204 (2013), pp. 1–4

work page internal anchor Pith review Pith/arXiv arXiv 2013

[23] [24]

Kopriva, Implementing spectral methods for partial diﬀerential equa tions, Springer, Berlin, 2009

D. Kopriva, Implementing spectral methods for partial diﬀerential equa tions, Springer, Berlin, 2009

work page 2009

[24] [25]

Kronbichler and K

M. Kronbichler and K. Kormann , A generic interface for parallel cell-based ﬁnite element o perator application, Comput. Fluids, 63 (2012), pp. 135–147, doi:10.1016/j.co mpﬂuid.2012.04.012

work page doi:10.1016/j.co 2012

[25] [26]

Kronbichler and K

M. Kronbichler and K. Kormann , Fast matrix-free evaluation of discontinuous Galerkin ﬁni te element operators, ACM Trans. Math. Softw., in press (2019), pp. 1–37, doi:10. 1145/3325864

work page 2019

[26] [27]

Kronbichler and K

M. Kronbichler and K. Ljungkvist , Multigrid for matrix-free high-order ﬁnite element computations on graphics processors , ACM Trans. Parallel Comput., 6 (2019), pp. 2:1–2:32, doi:10.1145/3322813

work page doi:10.1145/3322813 2019

[27] [28]

Kronbichler, S

M. Kronbichler, S. Schoeder, C. M ¨uller, and W. A. W all , Comparison of implicit and explicit hybridizable discontinuous Galerkin methods for the acous tic wave equation , Int. J. Numer. Meth. Eng., 106 (2016), pp. 712–739, doi:10.1002/nme.5137

work page doi:10.1002/nme.5137 2016

[28] [29]

Kronbichler and W

M. Kronbichler and W. A. W all , A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers , SIAM J. Sci. Comput., 40 (2018), pp. A3423–A3448, doi:10.1137/16M110455X

work page doi:10.1137/16m110455x 2018

[29] [30]

Lehrenfeld and J

C. Lehrenfeld and J. Sch ¨oberl, High order exactly divergence-free hybrid discontinuous G alerkin methods for unsteady incompressible ﬂows , Comput. Meth. Appl. Mech. Eng., 307 (2016), pp. 339–361, doi:10.1016/j.cma.2016.04.025

work page doi:10.1016/j.cma.2016.04.025 2016

[30] [31]

J. W. Lottes and P. F. Fischer , Hybrid multigrid/Schwarz algorithms for the spectral elem ent method, J. Sci. Comput., 24 (2005), pp. 45–78, doi:10.1007/s10915 -004-4787

work page doi:10.1007/s10915 2005

[31] [32]

R. E. Lynch, J. R. Rice, and D. H. Thomas , Direct solution of partial diﬀerence equations by tensor product methods , Numer. Math., 6 (1964), pp. 185–199, doi:10.1007/BF01386 067

work page doi:10.1007/bf01386 1964

[32] [33]

S. A. Orszag , Spectral methods for problems in complex geometries , J. Comput. Phys., 37 (1980), pp. 70–92, doi:10.1016/0021-9991(80)90005-4

work page doi:10.1016/0021-9991(80)90005-4 1980

[33] [34]

Peraire and P.-O

J. Peraire and P.-O. Persson , The compact discontinuous Galerkin (CDG) method for ellipt ic problems, SIAM J. Sci. Comput., 30 (2008), pp. 1806–1824, doi:10.113 7/070685518

work page 2008

[34] [35]

Raissi, P

P. Persson , A sparse and high-order accurate line-based discontinuous Galerkin method for unstruc- tured meshes, J. Comput. Phys., (2013), pp. 414–429, doi:10.1016/j.jcp .2012.09.008

work page doi:10.1016/j.jcp 2013

[35] [36]

Piatkowski, S

M. Piatkowski, S. M ¨uthing, and P. Bastian , A stable and high-order accurate discontinuous Galerkin based splitting method for the incompressible Nav ier–Stokes equations , J. Comput. Phys., 356 (2018), pp. 220–239, doi:10.1016/j.jcp.2017.11.035

work page doi:10.1016/j.jcp.2017.11.035 2018

[36] [37]

Schoeder, K

S. Schoeder, K. Kormann, W. A. W all, and M. Kronbichler , Eﬃcient explicit time stepping of high order discontinuous Galerkin schemes for waves , SIAM J. Sci. Comput., 40 (2018), pp. C803– C826, doi:10.1137/18M1185399

work page doi:10.1137/18m1185399 2018

[37] [38]

Solomonoff , A fast algorithm for spectral diﬀerentiation , J

A. Solomonoff , A fast algorithm for spectral diﬀerentiation , J. Comput. Phys., 98 (1992), pp. 174– 177, doi:10.1016/0021-9991(92)90182-X

work page doi:10.1016/0021-9991(92)90182-x 1992

[38] [39]

Sundar, G

H. Sundar, G. Stadler, and G. Biros , Comparison of multigrid algorithms for high-order continuous ﬁnite element discretizations , Numer. Linear Algebra Appl., 22 (2015), pp. 664–680, doi:10.1002/nla.1979

work page doi:10.1002/nla.1979 2015

[39] [40]

Treibig, G

J. Treibig, G. Hager, and G. Wellein , LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments , in Proceedings of PSTI2010, the First International Works hop on Parallel Software Tools and Tool Infrastructures, San Dieg o CA, 2010, doi:10.1109/ICPPW.2010.38. https://github.com/RRZE-HPC/likwid, retrieved on June 19, 2019

work page doi:10.1109/icppw.2010.38 2010

[40] [41]

Trottenberg, C

U. Trottenberg, C. Oosterlee, and A. Sch ¨uller, Multigrid, Elsevier Academic Press, London, 2001

work page 2001

[41] [42]

R. S. V arga, Matrix iterative analysis , Springer, Berlin, 2nd ed., 2009

work page 2009

[42] [43]

Williams, A

S. Williams, A. W aterman, and D. P atterson , Rooﬂine: An insightful visual performance model for multicore architectures, Commun. ACM, 52 (2009), pp. 65–76, doi:10.1145/1498765.1 498785

work page doi:10.1145/1498765.1 2009

[43] [44]

Witte, D

J. Witte, D. Arndt, and G. Kanschat , Fast tensor product Schwarz smoothers , 2019. in prepa- ration. 20

work page 2019