pith. machine review for the scientific record.
sign in

arxiv: 2604.04644 · v1 · submitted 2026-04-06 · 🧮 math.NA · cs.NA

Architecture-aware h-to-p optimisation: spectral/hp element operators for mixed-element meshes

Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3

classification 🧮 math.NA cs.NA
keywords spectral element methodsmixed element meshesGPU optimizationHelmholtz operatortensorial expansionshp finite elementsstiffness matrix evaluationarchitecture-aware performance
0
0 comments X

The pith

Architecture-aware optimizations let spectral/hp operators on mixed-element meshes keep tetrahedral Helmholtz throughput within 2.5 times that of hexahedral elements on GPUs despite six times the floating-point operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends prior hexahedral-focused optimizations for spectral element methods on GPUs and vector CPUs to meshes that combine prismatic, pyramidic, and tetrahedral elements through tensorial expansions. It shows that standard operators such as mass and Helmholtz matrices reach best performance when their implementations are chosen according to element shape, polynomial order, and target architecture. A new evaluation technique for operations that include derivatives in the integrand, such as stiffness matrices, exploits the collocation properties of nodal tensorial expansions to maximize efficient operations. GPU measurements confirm that tetrahedral Helmholtz throughput stays at most 2.5 times slower than hexahedral throughput even though tetrahedra require six times more floating-point work.

Core claim

We extend earlier international efforts to optimise hexahedral-based spectral element methods on GPUs and vectorised CPUs to mixed element meshes additionally involving prismatic, pyramidic, and tetrahedral shapes using tensorial expansions. We demonstrate that common finite element operators (such as the mass and Helmholtz matrices) benefit from alternative implementation strategies depending on the element shape, choice of polynomial order, and system architecture in order to achieve optimal performance. In addition, we introduce a new approach/interpretation to efficiently evaluate more complex operations involving inner products with the derivative of the expansions as part of the integr

What carries the argument

Shape- and architecture-dependent implementation strategies for tensorial expansions, together with a collocation-based method that maximises nodal operations when evaluating derivative inner products such as those in the stiffness matrix.

Load-bearing premise

The alternative implementation strategies for each element shape and architecture are correctly realized in code and the reported throughput numbers reflect true optimal performance without unaccounted overhead from mesh connectivity or data movement in mixed-element settings.

What would settle it

A direct measurement of Helmholtz operator throughput on tetrahedral versus hexahedral elements inside a single mixed mesh on the same GPU, checking whether the tetrahedral rate ever falls below 40 percent of the hexahedral rate.

Figures

Figures reproduced from arXiv: 2604.04644 by Boyang Xia, Chris D. Cantwell, David Moxey, Diego Renner, Jacques Y. Xing, Robert M. Kirby, Spencer J. Sherwin.

Figure 1
Figure 1. Figure 1: Representation of a high-order triangular element with three different coordinate systems. Elements may be curvilinear or benefit from reduced geometric information if regular straight-sided. The distribution of quadrature points is illustrated with equispaced lines. Execution Space Serial AVX SSE2 AVX2 AVX512 SVE/SVE2 Device CUDA HIP SYCL CUDA HIP OpenCL [PITH_FULL_IMAGE:figures/full_fig_p020_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Nektar++ execution space model hierarchy. Execution spaces are shown in green, while specific backends are shown in blue. Availability of backends depends on the host and/or device architecture. J.Y. Xing et al.: Preprint submitted to Elsevier Page 20 of 19 [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Diagram showing the SumFac implementation on GPU. The four grey dots are stored contiguously in the memory and accessed by different threads. For illustrative purposes, we suppose there are 4 threads in each work group. Each work group processes a different element group. Work group e=1 element 1 ← Threads 1-4 ← × Threads 1-4 × [P1 × P2 ] B1 we û e [Q1 × P1 [Q ] 1 × P2 ] we [Q1 × P2 ] BT 2 [P2 × Q2 ] ue [… view at source ↗
Figure 4
Figure 4. Figure 4: Diagram showing the SumFacTOP implementation on GPU. The four grey dots are stored contiguously in memory and accessed by different threads. For illustrative purposes, we suppose there are four threads in each work group. Each work group processes a different element. J.Y. Xing et al.: Preprint submitted to Elsevier Page 21 of 19 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) NVIDIA GH200 Grace Hopper Superchip and (b) Intel Xeon 6526Y throughput performance versus elemental degrees of freedom for the mass operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr) and tetrahedral (Tet) elements. For different polynomial degree (P) the standard matrix (StdMat) approach is labelled in red, the vectorised sum-factorisation (SumFac) is labelled in blue and the sum-factor… view at source ↗
Figure 6
Figure 6. Figure 6: (a) NVIDIA GH200 Grace Hopper Superchip and (b) Intel Xeon 6526Y throughput performance versus elemental degrees of freedom for the Helmholtz operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr), and tetrahedral (Tet) elements. For different polynomial degree (P) the standard matrix (StdMat) approach is labelled in red, the vectorised sum-factorisation (SumFac) is labelled in blue and the sum-… view at source ↗
Figure 7
Figure 7. Figure 7: Intel Xeon 6526Y throughput performance versus elemental degrees of freedom on mass operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr) and tetrahedral (Tet) elements using the vectorised sum-factorisation (SumFac) implementation. The solid lines indicate throughput assuming regular (Regular) elements that have an affine transformation where as the dashed lines indicate the throughput for cur… view at source ↗
Figure 8
Figure 8. Figure 8: Intel Xeon 6526Y throughput performance versus elemental degrees of freedom on Helmholtz operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr) and tetrahedral (Tet) elements using the vectorised sum-factorisation (SumFac) implementation. The solid lines indicate throughput assuming regular (Regular) elements that have an affine transformation where as the dashed lines indicate the throughput fo… view at source ↗
Figure 9
Figure 9. Figure 9: Intel Xeon 6526Y throughput performance versus elemental degrees of freedom on Helmholtz operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr), and tetrahedral (Tet) elements using the vectorised sum-factorisation (SumFac) implementation. The solid lines indicate throughput assuming collocated (coll) approach on regular elements where as the dashed lines indicate the throughput for the traditio… view at source ↗
Figure 10
Figure 10. Figure 10: Field and Block data structure association. Field contains a series of blocks representing similar properties of elements that are associated with a contiguous block of memory on the device (i.e. GPU). J.Y. Xing et al.: Preprint submitted to Elsevier Page 25 of 19 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Schematic representation of Operator class hierarchy [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: NVIDIA GH200 Grace Hopper Superchip throughput performance versus elemental degrees of freedom on mass operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr) and tetrahedral (Tet) elements using the sum￾factorisation threaded on output-point (SumFacTOP) implementation. The solid lines indicate throughput assuming regular (Regular) elements that have an affine transformation where as the dashed … view at source ↗
Figure 13
Figure 13. Figure 13: NVIDIA GH200 Grace Hopper Superchip throughput performance versus elemental degrees of freedom on Helmholtz operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr) and tetrahedral (Tet) elements using the sum-factorisation threaded on output-point (SumFacTOP) implementation. The solid lines indicate throughput assuming regular (Regular) elements that have an affine transformation where as the da… view at source ↗
Figure 14
Figure 14. Figure 14: Throughput results for the AMD EPYC 9554 processor for (a) deformed and (b) regular elements for the Helmholtz operator on all element types using the SumFac and StdMat implementation strategies. J.Y. Xing et al.: Preprint submitted to Elsevier Page 28 of 19 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
read the original abstract

We extend earlier international efforts to optimise hexahedral-based spectral element methods on GPUs and vectorised CPUs to mixed element meshes additionally involving prismatic, pyramidic, and tetrahedral shapes using tensorial expansions. We demonstrate that common finite element operators (such as the mass and Helmholtz matrices) benefit from alternative implementation strategies depending on the element shape, choice of polynomial order, and system architecture in order to achieve optimal performance. In addition, we introduce a new approach/interpretation to efficiently evaluate more complex operations involving inner products with the derivative of the expansions as part of the integrand such as the stiffness matrix. This approach seeks to maximise operations using the collocation properties of the nodal tensorial expansion associated with classical quadrature rules. Our GPU performance tests demonstrate that the throughput of the Helmholtz operator on tetrahedral elements is at most 2.5 times slower than on hexahedral elements, despite tetrahedra having a factor of six greater floating-point operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript extends prior work on optimizing spectral/hp element operators for hexahedral meshes on GPUs and vectorized CPUs to mixed-element meshes that also include prisms, pyramids, and tetrahedra. It employs tensor-product expansions and demonstrates that mass and Helmholtz operators benefit from shape-specific kernel strategies depending on polynomial order and architecture. A new collocation-based interpretation is introduced for operations involving derivatives in the integrand, such as the stiffness matrix. GPU benchmarks are reported showing that Helmholtz operator throughput on tetrahedral elements is at most 2.5 times slower than on hexahedral elements, despite tetrahedra requiring six times more floating-point operations.

Significance. If the empirical throughput ratios hold under the stated implementation choices, the work is significant for enabling practical high-order discretizations on complex geometries that require mixed meshes. It provides concrete evidence that the FLOPs penalty of non-hex elements can be largely offset by architecture-aware kernels, which could accelerate adoption of spectral/hp methods in applications such as CFD and structural mechanics where pure hexahedral meshes are infeasible.

major comments (2)
  1. [GPU performance tests] The central performance claim (Helmholtz throughput on tets at most 2.5× slower than hexes) is presented in the GPU tests; however, the manuscript does not report achieved memory bandwidth or arithmetic intensity for each element type, making it difficult to confirm that the 2.5× factor reflects optimal kernel realization rather than residual data-movement overhead in the mixed-mesh setting.
  2. [Stiffness matrix evaluation] The new collocation approach for the stiffness matrix (maximizing operations via nodal tensorial expansions and classical quadrature) is described conceptually, but the manuscript lacks a side-by-side comparison of its floating-point operation count versus a standard quadrature implementation for tetrahedral elements at representative polynomial orders (e.g., P=4–8).
minor comments (3)
  1. [Introduction] The abstract and introduction use the term 'tensorial expansions' without an early equation defining the basis functions for each element shape; a compact notation table would improve readability.
  2. [Figures] Several figure captions for the performance plots omit the exact polynomial orders and mesh sizes used in the timing runs.
  3. [Related work] The manuscript cites prior hexahedral-only GPU work but does not explicitly contrast the new mixed-element kernels against the most recent published tensor-product implementations for prisms or tets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation and recommendation for minor revision. The comments are constructive and we address each major point below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [GPU performance tests] The central performance claim (Helmholtz throughput on tets at most 2.5× slower than hexes) is presented in the GPU tests; however, the manuscript does not report achieved memory bandwidth or arithmetic intensity for each element type, making it difficult to confirm that the 2.5× factor reflects optimal kernel realization rather than residual data-movement overhead in the mixed-mesh setting.

    Authors: We agree that additional performance metrics would strengthen the interpretation of the reported throughput ratios. In the revised manuscript we will add a new table (or subsection in the GPU results) reporting measured memory bandwidth utilization and arithmetic intensity for the Helmholtz operator on each element type. These values are readily obtainable from the existing benchmark runs and will clarify that the kernels are operating near architectural limits rather than being limited by unoptimized data movement in the mixed-mesh setting. The 2.5× throughput factor itself remains an empirical, directly measured quantity; the added metrics will simply provide supporting context. revision: yes

  2. Referee: [Stiffness matrix evaluation] The new collocation approach for the stiffness matrix (maximizing operations via nodal tensorial expansions and classical quadrature) is described conceptually, but the manuscript lacks a side-by-side comparison of its floating-point operation count versus a standard quadrature implementation for tetrahedral elements at representative polynomial orders (e.g., P=4–8).

    Authors: We accept that a quantitative FLOPs comparison would make the advantage of the collocation interpretation more concrete. In the revision we will insert a short table (new Table X) that tabulates the floating-point operation counts for the collocation-based stiffness evaluation versus a conventional quadrature implementation on tetrahedral elements for polynomial degrees P=4 through P=8. The counts follow directly from the tensor-product structure and quadrature rules already detailed in the manuscript; no new algorithmic development is required. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmarks independent of derivation

full rationale

The paper's central result is an empirical GPU throughput measurement (Helmholtz operator on tets at most 2.5x slower than hexes despite 6x FLOPs). This is obtained from implemented tensor-product kernels with shape-specific strategies (mass, stiffness via collocation) and reported benchmarks across polynomial orders and architectures. No load-bearing step reduces a claimed prediction or first-principles result to fitted inputs, self-citations, or ansatzes by construction. The performance ratio follows directly from the stated implementation choices and measured timings rather than any equation that equates to its own inputs. Self-citations to prior hexahedral optimization work are present but not required to establish the mixed-element comparison or the 2.5x factor.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard properties of tensorial spectral expansions and classical quadrature rules with no new free parameters or invented entities introduced.

axioms (1)
  • standard math Collocation properties of nodal tensorial expansions with classical quadrature rules allow efficient evaluation of derivative inner products
    Invoked to justify the new approach for stiffness-like operators.

pith-pipeline@v0.9.0 · 5488 in / 1176 out tokens · 49620 ms · 2026-05-10T19:17:54.593961+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    Chris D. Cantwell, David Moxey, Andrew Comerford, Alessandro Bolis, Gabriele Rocco, Gianmarco Mengaldo, Daniele De Grazia, Sergey Yakovlev, Jean-Eloi Lombard, Dirk Ekelschot, Bastien Jordi, Hui Xu, Yumnah Mohamied, Claus Eskilsson, Blake Nelson, Peter Vos, Cristian Biotto, Robert M. Kirby, and Spencer J. Sherwin. Nektar++: An open-source spectral/hp eleme...

  2. [2]

    Cantwell, Spencer J

    Chris D. Cantwell, Spencer J. Sherwin, Roberty M. Kirby, and Paul H. J. Kelly. From h to p efficiently: Strategy selection for operator evaluation on hexahedral and tetrahedral elements.Computers and Fluids, 43(1):23–28, 2011

  3. [3]

    Cantwell, Spencer J

    Chris D. Cantwell, Spencer J. Sherwin, Robert M. Kirby, and Paul H. J. Kelly. From h to p efficiently: Selecting the optimal spectral/hp discretisation in three dimensions.Mathematical Modelling of Natural Phenomena, 6(3):84–96, 2011

  4. [4]

    JanEichstädt,MartinVymazal,DavidMoxey,andJoaquimPeiró.Acomparisonoftheshared-memoryparallelprogrammingmodelsOpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM.Computer Physics Communications, page 107245, 2020

  5. [5]

    Paul Fischer, James Lottes, and Henry Tufo. Nek5000. Technical report, Argonne National Laboratory (ANL), Argonne, IL (United States), 2007

  6. [6]

    Openacc acceleration ofthe nek5000spectral element code.TheInternational Journal ofHigh PerformanceComputing Applications, 29(3):311–319, 2015

    Stefano Markidis, Jing Gong, Michael Schliephake, Erwin Laure, Alistair Hart, David Henty, Katherine Heisey, and Paul Fischer. Openacc acceleration ofthe nek5000spectral element code.TheInternational Journal ofHigh PerformanceComputing Applications, 29(3):311–319, 2015

  7. [7]

    Strong scaling of openacc enabled nek5000 on several gpu based hpc systems

    JonathanVincent,JingGong,MartinKarp,AdamPeplinski,NiclasJansson,ArturPodobas,AndreasJocksch,JieYao,FazleHussain,Stefano Markidis, et al. Strong scaling of openacc enabled nek5000 on several gpu based hpc systems. InInternational Conference on High Performance Computing in Asia-Pacific Region, pages 94–102, 2022

  8. [8]

    Openacc—firstexperienceswithreal-worldapplications

    SandraWienke,PaulSpringer,ChristianTerboven,andDieteranMey. Openacc—firstexperienceswithreal-worldapplications. InEuropean Conference on Parallel Processing, pages 859–870. Springer, 2012

  9. [9]

    Nekrs, a gpu-accelerated spectral element navier–stokes solver.Parallel Computing, 114:102982, 2022

    Paul Fischer, Stefan Kerkemeier, Misun Min, Yu-Hsiang Lan, Malachi Phillips, Thilina Rathnayake, Elia Merzari, Ananias Tomboulides, Ali Karakus, Noel Chalmers, et al. Nekrs, a gpu-accelerated spectral element navier–stokes solver.Parallel Computing, 114:102982, 2022

  10. [10]

    Occa:Aunifiedapproachtomulti-threadinglanguages.arXivpreprintarXiv:1403.0968, 2014

    DavidSMedina,AmikSt-Cyr,andTimWarburton. Occa:Aunifiedapproachtomulti-threadinglanguages.arXivpreprintarXiv:1403.0968, 2014

  11. [11]

    Neko: A modern, portable, and scalable framework for high-fidelity computational fluid dynamics.Computers & Fluids, 275:106243, 2024

    Niclas Jansson, Martin Karp, Artur Podobas, Stefano Markidis, and Philipp Schlatter. Neko: A modern, portable, and scalable framework for high-fidelity computational fluid dynamics.Computers & Fluids, 275:106243, 2024

  12. [12]

    Mfem:Amodularfiniteelementmethodslibrary.Computers&MathematicswithApplications,81:42–74, 2021

    Robert Anderson, Julian Andrej, Andrew Barker, Jamie Bramwell, Jean-Sylvain Camier, Jakub Cerveny, Veselin Dobrev, Yohann Dudouit, AaronFisher,TzanioKolev,etal. Mfem:Amodularfiniteelementmethodslibrary.Computers&MathematicswithApplications,81:42–74, 2021

  13. [13]

    Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J

    David A. Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J. Kunen, Olga Pearce, Peter Robinson, Brian S. Ryujin, and Thomas R. W. Scogland. Raja: Portable performance for large-scale scientific applications. In2019 ieee/acm international workshop on performance, portability and productivity in hpc (p3hpc), pages 71–81. IEEE, 2019

  14. [14]

    WolfgangBangerth,RalfHartmann,andGuidoKanschat.deal.ii—ageneral-purposeobject-orientedfiniteelementlibrary.ACMTransactions on Mathematical Software (TOMS), 33(4):24–es, 2007

  15. [15]

    The deal

    DanielArndt,WolfgangBangerth,DenisDavydov,TimoHeister,LucaHeltai,MartinKronbichler,MatthiasMaier,Jean-PaulPelteret,Bruno Turcksin, and David Wells. The deal. ii finite element library: Design, features, and insights.Computers & Mathematics with Applications, 81:407–422, 2021

  16. [16]

    The deal

    Daniel Arndt, Wolfgang Bangerth, Maximilian Bergbauer, Marco Feder, Marc Fehling, Johannes Heinz, Timo Heister, Luca Heltai, Martin Kronbichler, Matthias Maier, et al. The deal. ii library, version 9.5.Journal of Numerical Mathematics, 31(3):231–246, 2023

  17. [17]

    Carter Edwards, Christian R

    H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns.Journal of Parallel and Distributed Computing, 74(12):3202–3216, 2014

  18. [18]

    Afluxreconstructionapproachtohigh-orderschemesincludingdiscontinuousgalerkinmethods

    HungTHuynh. Afluxreconstructionapproachtohigh-orderschemesincludingdiscontinuousgalerkinmethods. In18thAIAAcomputational fluid dynamics conference, page 4079, 2007

  19. [19]

    Hesthaven and Tim Warburton.Nodal Discontinuous Galerkin Methods

    Jan S. Hesthaven and Tim Warburton.Nodal Discontinuous Galerkin Methods. Springer, 2008

  20. [20]

    The development of discontinuous galerkin methods

    Bernardo Cockburn, George E Karniadakis, and Chi-Wang Shu. The development of discontinuous galerkin methods. InDiscontinuous Galerkin methods: theory, computation and applications, pages 3–50. Springer, 2000

  21. [21]

    Freddie D Witherden, Antony M Farrington, and Peter E Vincent. Pyfr: An open source framework for solving advection–diffusion type problems on streaming architectures using the flux reconstruction approach.Computer Physics Communications, 185(11):3028–3040, 2014

  22. [22]

    Freddie D Witherden, Peter E Vincent, Will Trojak, Yoshiaki Abe, Amir Akbarzadeh, Semih Akkurt, Mohammad Alhawwary, Lidia Caros, Tarik Dzanic, Giorgio Giangaspero, et al. Pyfr v2. 0.3: Towards industrial adoption of scale-resolving simulations.Computer Physics Communications, 311:109567, 2025

  23. [23]

    Mako templates for python.BibSonomy www

    Michael Bayer. Mako templates for python.BibSonomy www. bibsonomy. org/bibtex/aa47d818a1c2f889b7456117003b3d42, 2012

  24. [24]

    Galæxi:Solvingcomplexcompressibleflowswithhigh-orderdiscontinuousgalerkinmethodsonaccelerator-basedsystems.Computer Physics Communications, 306:109388, 2025

    Marius Kurz, Daniel Kempf, Marcel P Blind, Patrick Kopper, Philipp Offenhäuser, Anna Schwarz, Spencer Starr, Jens Keim, and Andrea Beck. Galæxi:Solvingcomplexcompressibleflowswithhigh-orderdiscontinuousgalerkinmethodsonaccelerator-basedsystems.Computer Physics Communications, 306:109388, 2025

  25. [25]

    Flexi:Ahighorderdiscontinuousgalerkinframeworkforhyperbolic–parabolicconservationlaws.Computers & Mathematics with Applications, 81:186–219, 2021

    Nico Krais, Andrea Beck, Thomas Bolemann, Hannes Frank, David Flad, Gregor Gassner, Florian Hindenlang, Malte Hoffmann, Thomas Kuhn,MatthiasSonntag,etal. Flexi:Ahighorderdiscontinuousgalerkinframeworkforhyperbolic–parabolicconservationlaws.Computers & Mathematics with Applications, 81:186–219, 2021

  26. [26]

    Horses3d: A high-order discontinuous galerkin solver for flow simulations and multi- physics applications.Computer Physics Communications, 287:108700, 2023

    Esteban Ferrer, Gonzalo Rubio, Gerasimos Ntoukas, Wojciech Laskowski, Oscar A Mariño, Stefano Colombo, Andrés Mateo-Gabín, H Marbona, F Manrique de Lara, David Huergo, et al. Horses3d: A high-order discontinuous galerkin solver for flow simulations and multi- physics applications.Computer Physics Communications, 287:108700, 2023. J.Y. Xing et al.:Preprint...

  27. [27]

    A comparison of h-and p- refinement to capture wind turbine wakes.Physics of Fluids, 36(12), 2024

    Hatem Kessasra, Marta Cordero-Gracia, Mariola Gómez, Eusebio Valero, Gonzalo Rubio, and Esteban Ferrer. A comparison of h-and p- refinement to capture wind turbine wakes.Physics of Fluids, 36(12), 2024

  28. [28]

    OxfordUniversityPress,second edition, 2005

    GeorgeKarniadakisandSpencerSherwin.Spectral/hpElementMethodsforComputationalFluidDynamics. OxfordUniversityPress,second edition, 2005

  29. [29]

    OxfordUniversityPress,2013

    GeorgeKarniadakisandSpencerSherwin.Spectral/ℎ𝑝ElementMethodsforComputationalFluidDynamics. OxfordUniversityPress,2013

  30. [30]

    Vincent, and Spencer J

    Gianmarco Mengaldo, Daniele De Grazia, David Moxey, Peter E. Vincent, and Spencer J. Sherwin. Dealiasing techniques for high-order spectral element methods on regular and irregular grids.Journal of Computational Physics, 299:56–81, 2015

  31. [31]

    Michael G. Duffy. Quadrature over a pyramid or cube of integrands with a singularity at a vertex.SIAM Journal on Numerical Analysis, 19(6):1260–1262, 1982

  32. [32]

    Kokkos: Enabling manycore performance portability through polymorphic memory access patterns.Journal of parallel and distributed computing, 74(12):3202–3216, 2014

    H Carter Edwards, Christian R Trott, and Daniel Sunderland. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns.Journal of parallel and distributed computing, 74(12):3202–3216, 2014

  33. [33]

    Starpu: a unified platform for task scheduling on heterogeneous multicore architectures

    Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. InEuropean Conference on Parallel Processing, pages 863–874. Springer, 2009

  34. [34]

    Efficientmatrix-freehigh-orderfiniteelementevaluationforsimplicialelements.SIAMJournal on Scientific Computing, 42(3):C97–C123, 2020

    DavidMoxey,RomanAmici,andMikeKirby. Efficientmatrix-freehigh-orderfiniteelementevaluationforsimplicialelements.SIAMJournal on Scientific Computing, 42(3):C97–C123, 2020

  35. [35]

    xsimd: C++ wrappers for simd intrinsics and parallelized, optimized mathematical functions.https://github.com/ xtensor-stack/xsimd, 2025

    xtensor-stack. xsimd: C++ wrappers for simd intrinsics and parallelized, optimized mathematical functions.https://github.com/ xtensor-stack/xsimd, 2025

  36. [36]

    Cantwell, Robert M

    David Moxey, Chris D. Cantwell, Robert M. Kirby, and Spencer J. Sherwin. Optimizing the performance of the spectral/hp element method with collective linear algebra operations.Computer Methods in Applied Mechanics and Engineering, 310:628–645, 2016

  37. [37]

    David Moxey, Chris D. Cantwell, Yan Bao, Andrea Cassinelli, Giacomo Castiglioni, Sehun Chun, Emilia Juda, Ehsan Kazemi, Kilian Lack- hove,JulianMarcon,GianmarcoMengaldo,DouglasSerson,MichaelTurner,HuiXu,JoaquimPeiró,RobertM.Kirby,andSpencerJ.Sherwin. Nektar++: Enhancing the capability and application of high-fidelity spectral/hp element methods.Computer P...

  38. [38]

    Libxsmm:acceleratingsmallmatrixmultiplicationsbyruntimecode generation

    AlexanderHeinecke,GregHenry,MaxwellHutchinson,andHansPabst. Libxsmm:acceleratingsmallmatrixmultiplicationsbyruntimecode generation. InSC’16:ProceedingsoftheInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis,pages 981–991. IEEE, 2016

  39. [39]

    oneapi math library (onemath).https://github.com/uxlfoundation/oneMath

    UXL Unified Acceleration Foundation. oneapi math library (onemath).https://github.com/uxlfoundation/oneMath

  40. [40]

    libCEED:Fastalgebraforhigh-orderelement-baseddiscretizations.JournalofOpenSourceSoftware,6(63):2945, 2021

    JedBrown,AhmadAbdelfattah,ValeriaBarra,NatalieBeams,Jean-SylvainCamier,VeselinDobrev,YohannDudouit,LeilaGhaffari,Tzanio Kolev,DavidMedina,etal. libCEED:Fastalgebraforhigh-orderelement-baseddiscretizations.JournalofOpenSourceSoftware,6(63):2945, 2021

  41. [41]

    Scalabilityofhigh-performancepdesolvers.TheInternationalJournalofHighPerformanceComputing Applications, 34(5):562–586, 2020

    Paul Fischer, Misun Min, Thilina Rathnayake, Som Dutta, Tzanio Kolev, Veselin Dobrev, Jean-Sylvain Camier, Martin Kronbichler, Tim Warburton,KasiaŚwirydowicz,etal. Scalabilityofhigh-performancepdesolvers.TheInternationalJournalofHighPerformanceComputing Applications, 34(5):562–586, 2020. J.Y. Xing et al.:Preprint submitted to ElsevierPage 19 of 19 Archite...