Architecture-aware h-to-p optimisation: spectral/hp element operators for mixed-element meshes
Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3
The pith
Architecture-aware optimizations let spectral/hp operators on mixed-element meshes keep tetrahedral Helmholtz throughput within 2.5 times that of hexahedral elements on GPUs despite six times the floating-point operations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We extend earlier international efforts to optimise hexahedral-based spectral element methods on GPUs and vectorised CPUs to mixed element meshes additionally involving prismatic, pyramidic, and tetrahedral shapes using tensorial expansions. We demonstrate that common finite element operators (such as the mass and Helmholtz matrices) benefit from alternative implementation strategies depending on the element shape, choice of polynomial order, and system architecture in order to achieve optimal performance. In addition, we introduce a new approach/interpretation to efficiently evaluate more complex operations involving inner products with the derivative of the expansions as part of the integr
What carries the argument
Shape- and architecture-dependent implementation strategies for tensorial expansions, together with a collocation-based method that maximises nodal operations when evaluating derivative inner products such as those in the stiffness matrix.
Load-bearing premise
The alternative implementation strategies for each element shape and architecture are correctly realized in code and the reported throughput numbers reflect true optimal performance without unaccounted overhead from mesh connectivity or data movement in mixed-element settings.
What would settle it
A direct measurement of Helmholtz operator throughput on tetrahedral versus hexahedral elements inside a single mixed mesh on the same GPU, checking whether the tetrahedral rate ever falls below 40 percent of the hexahedral rate.
Figures
read the original abstract
We extend earlier international efforts to optimise hexahedral-based spectral element methods on GPUs and vectorised CPUs to mixed element meshes additionally involving prismatic, pyramidic, and tetrahedral shapes using tensorial expansions. We demonstrate that common finite element operators (such as the mass and Helmholtz matrices) benefit from alternative implementation strategies depending on the element shape, choice of polynomial order, and system architecture in order to achieve optimal performance. In addition, we introduce a new approach/interpretation to efficiently evaluate more complex operations involving inner products with the derivative of the expansions as part of the integrand such as the stiffness matrix. This approach seeks to maximise operations using the collocation properties of the nodal tensorial expansion associated with classical quadrature rules. Our GPU performance tests demonstrate that the throughput of the Helmholtz operator on tetrahedral elements is at most 2.5 times slower than on hexahedral elements, despite tetrahedra having a factor of six greater floating-point operations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends prior work on optimizing spectral/hp element operators for hexahedral meshes on GPUs and vectorized CPUs to mixed-element meshes that also include prisms, pyramids, and tetrahedra. It employs tensor-product expansions and demonstrates that mass and Helmholtz operators benefit from shape-specific kernel strategies depending on polynomial order and architecture. A new collocation-based interpretation is introduced for operations involving derivatives in the integrand, such as the stiffness matrix. GPU benchmarks are reported showing that Helmholtz operator throughput on tetrahedral elements is at most 2.5 times slower than on hexahedral elements, despite tetrahedra requiring six times more floating-point operations.
Significance. If the empirical throughput ratios hold under the stated implementation choices, the work is significant for enabling practical high-order discretizations on complex geometries that require mixed meshes. It provides concrete evidence that the FLOPs penalty of non-hex elements can be largely offset by architecture-aware kernels, which could accelerate adoption of spectral/hp methods in applications such as CFD and structural mechanics where pure hexahedral meshes are infeasible.
major comments (2)
- [GPU performance tests] The central performance claim (Helmholtz throughput on tets at most 2.5× slower than hexes) is presented in the GPU tests; however, the manuscript does not report achieved memory bandwidth or arithmetic intensity for each element type, making it difficult to confirm that the 2.5× factor reflects optimal kernel realization rather than residual data-movement overhead in the mixed-mesh setting.
- [Stiffness matrix evaluation] The new collocation approach for the stiffness matrix (maximizing operations via nodal tensorial expansions and classical quadrature) is described conceptually, but the manuscript lacks a side-by-side comparison of its floating-point operation count versus a standard quadrature implementation for tetrahedral elements at representative polynomial orders (e.g., P=4–8).
minor comments (3)
- [Introduction] The abstract and introduction use the term 'tensorial expansions' without an early equation defining the basis functions for each element shape; a compact notation table would improve readability.
- [Figures] Several figure captions for the performance plots omit the exact polynomial orders and mesh sizes used in the timing runs.
- [Related work] The manuscript cites prior hexahedral-only GPU work but does not explicitly contrast the new mixed-element kernels against the most recent published tensor-product implementations for prisms or tets.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation and recommendation for minor revision. The comments are constructive and we address each major point below, indicating the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [GPU performance tests] The central performance claim (Helmholtz throughput on tets at most 2.5× slower than hexes) is presented in the GPU tests; however, the manuscript does not report achieved memory bandwidth or arithmetic intensity for each element type, making it difficult to confirm that the 2.5× factor reflects optimal kernel realization rather than residual data-movement overhead in the mixed-mesh setting.
Authors: We agree that additional performance metrics would strengthen the interpretation of the reported throughput ratios. In the revised manuscript we will add a new table (or subsection in the GPU results) reporting measured memory bandwidth utilization and arithmetic intensity for the Helmholtz operator on each element type. These values are readily obtainable from the existing benchmark runs and will clarify that the kernels are operating near architectural limits rather than being limited by unoptimized data movement in the mixed-mesh setting. The 2.5× throughput factor itself remains an empirical, directly measured quantity; the added metrics will simply provide supporting context. revision: yes
-
Referee: [Stiffness matrix evaluation] The new collocation approach for the stiffness matrix (maximizing operations via nodal tensorial expansions and classical quadrature) is described conceptually, but the manuscript lacks a side-by-side comparison of its floating-point operation count versus a standard quadrature implementation for tetrahedral elements at representative polynomial orders (e.g., P=4–8).
Authors: We accept that a quantitative FLOPs comparison would make the advantage of the collocation interpretation more concrete. In the revision we will insert a short table (new Table X) that tabulates the floating-point operation counts for the collocation-based stiffness evaluation versus a conventional quadrature implementation on tetrahedral elements for polynomial degrees P=4 through P=8. The counts follow directly from the tensor-product structure and quadrature rules already detailed in the manuscript; no new algorithmic development is required. revision: yes
Circularity Check
No significant circularity: empirical benchmarks independent of derivation
full rationale
The paper's central result is an empirical GPU throughput measurement (Helmholtz operator on tets at most 2.5x slower than hexes despite 6x FLOPs). This is obtained from implemented tensor-product kernels with shape-specific strategies (mass, stiffness via collocation) and reported benchmarks across polynomial orders and architectures. No load-bearing step reduces a claimed prediction or first-principles result to fitted inputs, self-citations, or ansatzes by construction. The performance ratio follows directly from the stated implementation choices and measured timings rather than any equation that equates to its own inputs. Self-citations to prior hexahedral optimization work are present but not required to establish the mixed-element comparison or the 2.5x factor.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Collocation properties of nodal tensorial expansions with classical quadrature rules allow efficient evaluation of derivative inner products
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our GPU performance tests demonstrate that the throughput of the Helmholtz operator on tetrahedral elements is at most 2.5 times slower than on hexahedral elements, despite tetrahedra having a factor of six greater floating-point operations.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a new approach/interpretation to efficiently evaluate more complex operations involving inner products with the derivative of the expansions as part of the integrand such as the stiffness matrix.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chris D. Cantwell, David Moxey, Andrew Comerford, Alessandro Bolis, Gabriele Rocco, Gianmarco Mengaldo, Daniele De Grazia, Sergey Yakovlev, Jean-Eloi Lombard, Dirk Ekelschot, Bastien Jordi, Hui Xu, Yumnah Mohamied, Claus Eskilsson, Blake Nelson, Peter Vos, Cristian Biotto, Robert M. Kirby, and Spencer J. Sherwin. Nektar++: An open-source spectral/hp eleme...
work page 2015
-
[2]
Chris D. Cantwell, Spencer J. Sherwin, Roberty M. Kirby, and Paul H. J. Kelly. From h to p efficiently: Strategy selection for operator evaluation on hexahedral and tetrahedral elements.Computers and Fluids, 43(1):23–28, 2011
work page 2011
-
[3]
Chris D. Cantwell, Spencer J. Sherwin, Robert M. Kirby, and Paul H. J. Kelly. From h to p efficiently: Selecting the optimal spectral/hp discretisation in three dimensions.Mathematical Modelling of Natural Phenomena, 6(3):84–96, 2011
work page 2011
-
[4]
JanEichstädt,MartinVymazal,DavidMoxey,andJoaquimPeiró.Acomparisonoftheshared-memoryparallelprogrammingmodelsOpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM.Computer Physics Communications, page 107245, 2020
work page 2020
-
[5]
Paul Fischer, James Lottes, and Henry Tufo. Nek5000. Technical report, Argonne National Laboratory (ANL), Argonne, IL (United States), 2007
work page 2007
-
[6]
Stefano Markidis, Jing Gong, Michael Schliephake, Erwin Laure, Alistair Hart, David Henty, Katherine Heisey, and Paul Fischer. Openacc acceleration ofthe nek5000spectral element code.TheInternational Journal ofHigh PerformanceComputing Applications, 29(3):311–319, 2015
work page 2015
-
[7]
Strong scaling of openacc enabled nek5000 on several gpu based hpc systems
JonathanVincent,JingGong,MartinKarp,AdamPeplinski,NiclasJansson,ArturPodobas,AndreasJocksch,JieYao,FazleHussain,Stefano Markidis, et al. Strong scaling of openacc enabled nek5000 on several gpu based hpc systems. InInternational Conference on High Performance Computing in Asia-Pacific Region, pages 94–102, 2022
work page 2022
-
[8]
Openacc—firstexperienceswithreal-worldapplications
SandraWienke,PaulSpringer,ChristianTerboven,andDieteranMey. Openacc—firstexperienceswithreal-worldapplications. InEuropean Conference on Parallel Processing, pages 859–870. Springer, 2012
work page 2012
-
[9]
Nekrs, a gpu-accelerated spectral element navier–stokes solver.Parallel Computing, 114:102982, 2022
Paul Fischer, Stefan Kerkemeier, Misun Min, Yu-Hsiang Lan, Malachi Phillips, Thilina Rathnayake, Elia Merzari, Ananias Tomboulides, Ali Karakus, Noel Chalmers, et al. Nekrs, a gpu-accelerated spectral element navier–stokes solver.Parallel Computing, 114:102982, 2022
work page 2022
-
[10]
Occa:Aunifiedapproachtomulti-threadinglanguages.arXivpreprintarXiv:1403.0968, 2014
DavidSMedina,AmikSt-Cyr,andTimWarburton. Occa:Aunifiedapproachtomulti-threadinglanguages.arXivpreprintarXiv:1403.0968, 2014
-
[11]
Niclas Jansson, Martin Karp, Artur Podobas, Stefano Markidis, and Philipp Schlatter. Neko: A modern, portable, and scalable framework for high-fidelity computational fluid dynamics.Computers & Fluids, 275:106243, 2024
work page 2024
-
[12]
Mfem:Amodularfiniteelementmethodslibrary.Computers&MathematicswithApplications,81:42–74, 2021
Robert Anderson, Julian Andrej, Andrew Barker, Jamie Bramwell, Jean-Sylvain Camier, Jakub Cerveny, Veselin Dobrev, Yohann Dudouit, AaronFisher,TzanioKolev,etal. Mfem:Amodularfiniteelementmethodslibrary.Computers&MathematicswithApplications,81:42–74, 2021
work page 2021
-
[13]
Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J
David A. Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J. Kunen, Olga Pearce, Peter Robinson, Brian S. Ryujin, and Thomas R. W. Scogland. Raja: Portable performance for large-scale scientific applications. In2019 ieee/acm international workshop on performance, portability and productivity in hpc (p3hpc), pages 71–81. IEEE, 2019
work page 2019
-
[14]
WolfgangBangerth,RalfHartmann,andGuidoKanschat.deal.ii—ageneral-purposeobject-orientedfiniteelementlibrary.ACMTransactions on Mathematical Software (TOMS), 33(4):24–es, 2007
work page 2007
- [15]
- [16]
-
[17]
H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns.Journal of Parallel and Distributed Computing, 74(12):3202–3216, 2014
work page 2014
-
[18]
Afluxreconstructionapproachtohigh-orderschemesincludingdiscontinuousgalerkinmethods
HungTHuynh. Afluxreconstructionapproachtohigh-orderschemesincludingdiscontinuousgalerkinmethods. In18thAIAAcomputational fluid dynamics conference, page 4079, 2007
work page 2007
-
[19]
Hesthaven and Tim Warburton.Nodal Discontinuous Galerkin Methods
Jan S. Hesthaven and Tim Warburton.Nodal Discontinuous Galerkin Methods. Springer, 2008
work page 2008
-
[20]
The development of discontinuous galerkin methods
Bernardo Cockburn, George E Karniadakis, and Chi-Wang Shu. The development of discontinuous galerkin methods. InDiscontinuous Galerkin methods: theory, computation and applications, pages 3–50. Springer, 2000
work page 2000
-
[21]
Freddie D Witherden, Antony M Farrington, and Peter E Vincent. Pyfr: An open source framework for solving advection–diffusion type problems on streaming architectures using the flux reconstruction approach.Computer Physics Communications, 185(11):3028–3040, 2014
work page 2014
-
[22]
Freddie D Witherden, Peter E Vincent, Will Trojak, Yoshiaki Abe, Amir Akbarzadeh, Semih Akkurt, Mohammad Alhawwary, Lidia Caros, Tarik Dzanic, Giorgio Giangaspero, et al. Pyfr v2. 0.3: Towards industrial adoption of scale-resolving simulations.Computer Physics Communications, 311:109567, 2025
work page 2025
-
[23]
Mako templates for python.BibSonomy www
Michael Bayer. Mako templates for python.BibSonomy www. bibsonomy. org/bibtex/aa47d818a1c2f889b7456117003b3d42, 2012
work page 2012
-
[24]
Marius Kurz, Daniel Kempf, Marcel P Blind, Patrick Kopper, Philipp Offenhäuser, Anna Schwarz, Spencer Starr, Jens Keim, and Andrea Beck. Galæxi:Solvingcomplexcompressibleflowswithhigh-orderdiscontinuousgalerkinmethodsonaccelerator-basedsystems.Computer Physics Communications, 306:109388, 2025
work page 2025
-
[25]
Nico Krais, Andrea Beck, Thomas Bolemann, Hannes Frank, David Flad, Gregor Gassner, Florian Hindenlang, Malte Hoffmann, Thomas Kuhn,MatthiasSonntag,etal. Flexi:Ahighorderdiscontinuousgalerkinframeworkforhyperbolic–parabolicconservationlaws.Computers & Mathematics with Applications, 81:186–219, 2021
work page 2021
-
[26]
Esteban Ferrer, Gonzalo Rubio, Gerasimos Ntoukas, Wojciech Laskowski, Oscar A Mariño, Stefano Colombo, Andrés Mateo-Gabín, H Marbona, F Manrique de Lara, David Huergo, et al. Horses3d: A high-order discontinuous galerkin solver for flow simulations and multi- physics applications.Computer Physics Communications, 287:108700, 2023. J.Y. Xing et al.:Preprint...
work page 2023
-
[27]
A comparison of h-and p- refinement to capture wind turbine wakes.Physics of Fluids, 36(12), 2024
Hatem Kessasra, Marta Cordero-Gracia, Mariola Gómez, Eusebio Valero, Gonzalo Rubio, and Esteban Ferrer. A comparison of h-and p- refinement to capture wind turbine wakes.Physics of Fluids, 36(12), 2024
work page 2024
-
[28]
OxfordUniversityPress,second edition, 2005
GeorgeKarniadakisandSpencerSherwin.Spectral/hpElementMethodsforComputationalFluidDynamics. OxfordUniversityPress,second edition, 2005
work page 2005
-
[29]
GeorgeKarniadakisandSpencerSherwin.Spectral/ℎ𝑝ElementMethodsforComputationalFluidDynamics. OxfordUniversityPress,2013
work page 2013
-
[30]
Gianmarco Mengaldo, Daniele De Grazia, David Moxey, Peter E. Vincent, and Spencer J. Sherwin. Dealiasing techniques for high-order spectral element methods on regular and irregular grids.Journal of Computational Physics, 299:56–81, 2015
work page 2015
-
[31]
Michael G. Duffy. Quadrature over a pyramid or cube of integrands with a singularity at a vertex.SIAM Journal on Numerical Analysis, 19(6):1260–1262, 1982
work page 1982
-
[32]
H Carter Edwards, Christian R Trott, and Daniel Sunderland. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns.Journal of parallel and distributed computing, 74(12):3202–3216, 2014
work page 2014
-
[33]
Starpu: a unified platform for task scheduling on heterogeneous multicore architectures
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. InEuropean Conference on Parallel Processing, pages 863–874. Springer, 2009
work page 2009
-
[34]
DavidMoxey,RomanAmici,andMikeKirby. Efficientmatrix-freehigh-orderfiniteelementevaluationforsimplicialelements.SIAMJournal on Scientific Computing, 42(3):C97–C123, 2020
work page 2020
-
[35]
xtensor-stack. xsimd: C++ wrappers for simd intrinsics and parallelized, optimized mathematical functions.https://github.com/ xtensor-stack/xsimd, 2025
work page 2025
-
[36]
David Moxey, Chris D. Cantwell, Robert M. Kirby, and Spencer J. Sherwin. Optimizing the performance of the spectral/hp element method with collective linear algebra operations.Computer Methods in Applied Mechanics and Engineering, 310:628–645, 2016
work page 2016
-
[37]
David Moxey, Chris D. Cantwell, Yan Bao, Andrea Cassinelli, Giacomo Castiglioni, Sehun Chun, Emilia Juda, Ehsan Kazemi, Kilian Lack- hove,JulianMarcon,GianmarcoMengaldo,DouglasSerson,MichaelTurner,HuiXu,JoaquimPeiró,RobertM.Kirby,andSpencerJ.Sherwin. Nektar++: Enhancing the capability and application of high-fidelity spectral/hp element methods.Computer P...
work page 2020
-
[38]
Libxsmm:acceleratingsmallmatrixmultiplicationsbyruntimecode generation
AlexanderHeinecke,GregHenry,MaxwellHutchinson,andHansPabst. Libxsmm:acceleratingsmallmatrixmultiplicationsbyruntimecode generation. InSC’16:ProceedingsoftheInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis,pages 981–991. IEEE, 2016
work page 2016
-
[39]
oneapi math library (onemath).https://github.com/uxlfoundation/oneMath
UXL Unified Acceleration Foundation. oneapi math library (onemath).https://github.com/uxlfoundation/oneMath
-
[40]
JedBrown,AhmadAbdelfattah,ValeriaBarra,NatalieBeams,Jean-SylvainCamier,VeselinDobrev,YohannDudouit,LeilaGhaffari,Tzanio Kolev,DavidMedina,etal. libCEED:Fastalgebraforhigh-orderelement-baseddiscretizations.JournalofOpenSourceSoftware,6(63):2945, 2021
work page 2021
-
[41]
Paul Fischer, Misun Min, Thilina Rathnayake, Som Dutta, Tzanio Kolev, Veselin Dobrev, Jean-Sylvain Camier, Martin Kronbichler, Tim Warburton,KasiaŚwirydowicz,etal. Scalabilityofhigh-performancepdesolvers.TheInternationalJournalofHighPerformanceComputing Applications, 34(5):562–586, 2020. J.Y. Xing et al.:Preprint submitted to ElsevierPage 19 of 19 Archite...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.