arxiv: 2604.04644 · v1 · submitted 2026-04-06 · 🧮 math.NA · cs.NA

Architecture-aware h-to-p optimisation: spectral/hp element operators for mixed-element meshes

Jacques Y. Xing , Boyang Xia , Diego Renner , Chris D. Cantwell , David Moxey , Robert M. Kirby , Spencer J. Sherwin This is my paper

Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3

classification 🧮 math.NA cs.NA

keywords spectral element methodsmixed element meshesGPU optimizationHelmholtz operatortensorial expansionshp finite elementsstiffness matrix evaluationarchitecture-aware performance

0 comments

The pith

Architecture-aware optimizations let spectral/hp operators on mixed-element meshes keep tetrahedral Helmholtz throughput within 2.5 times that of hexahedral elements on GPUs despite six times the floating-point operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends prior hexahedral-focused optimizations for spectral element methods on GPUs and vector CPUs to meshes that combine prismatic, pyramidic, and tetrahedral elements through tensorial expansions. It shows that standard operators such as mass and Helmholtz matrices reach best performance when their implementations are chosen according to element shape, polynomial order, and target architecture. A new evaluation technique for operations that include derivatives in the integrand, such as stiffness matrices, exploits the collocation properties of nodal tensorial expansions to maximize efficient operations. GPU measurements confirm that tetrahedral Helmholtz throughput stays at most 2.5 times slower than hexahedral throughput even though tetrahedra require six times more floating-point work.

Core claim

What carries the argument

Shape- and architecture-dependent implementation strategies for tensorial expansions, together with a collocation-based method that maximises nodal operations when evaluating derivative inner products such as those in the stiffness matrix.

Load-bearing premise

The alternative implementation strategies for each element shape and architecture are correctly realized in code and the reported throughput numbers reflect true optimal performance without unaccounted overhead from mesh connectivity or data movement in mixed-element settings.

What would settle it

A direct measurement of Helmholtz operator throughput on tetrahedral versus hexahedral elements inside a single mixed mesh on the same GPU, checking whether the tetrahedral rate ever falls below 40 percent of the hexahedral rate.

Figures

Figures reproduced from arXiv: 2604.04644 by Boyang Xia, Chris D. Cantwell, David Moxey, Diego Renner, Jacques Y. Xing, Robert M. Kirby, Spencer J. Sherwin.

**Figure 1.** Figure 1: Representation of a high-order triangular element with three different coordinate systems. Elements may be curvilinear or benefit from reduced geometric information if regular straight-sided. The distribution of quadrature points is illustrated with equispaced lines. Execution Space Serial AVX SSE2 AVX2 AVX512 SVE/SVE2 Device CUDA HIP SYCL CUDA HIP OpenCL [PITH_FULL_IMAGE:figures/full_fig_p020_1.png] view at source ↗

**Figure 2.** Figure 2: Nektar++ execution space model hierarchy. Execution spaces are shown in green, while specific backends are shown in blue. Availability of backends depends on the host and/or device architecture. J.Y. Xing et al.: Preprint submitted to Elsevier Page 20 of 19 [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

**Figure 3.** Figure 3: Diagram showing the SumFac implementation on GPU. The four grey dots are stored contiguously in the memory and accessed by different threads. For illustrative purposes, we suppose there are 4 threads in each work group. Each work group processes a different element group. Work group e=1 element 1 ← Threads 1-4 ← × Threads 1-4 × [P1 × P2 ] B1 we û e [Q1 × P1 [Q ] 1 × P2 ] we [Q1 × P2 ] BT 2 [P2 × Q2 ] ue [… view at source ↗

**Figure 4.** Figure 4: Diagram showing the SumFacTOP implementation on GPU. The four grey dots are stored contiguously in memory and accessed by different threads. For illustrative purposes, we suppose there are four threads in each work group. Each work group processes a different element. J.Y. Xing et al.: Preprint submitted to Elsevier Page 21 of 19 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: (a) NVIDIA GH200 Grace Hopper Superchip and (b) Intel Xeon 6526Y throughput performance versus elemental degrees of freedom for the mass operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr) and tetrahedral (Tet) elements. For different polynomial degree (P) the standard matrix (StdMat) approach is labelled in red, the vectorised sum-factorisation (SumFac) is labelled in blue and the sum-factor… view at source ↗

**Figure 6.** Figure 6: (a) NVIDIA GH200 Grace Hopper Superchip and (b) Intel Xeon 6526Y throughput performance versus elemental degrees of freedom for the Helmholtz operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr), and tetrahedral (Tet) elements. For different polynomial degree (P) the standard matrix (StdMat) approach is labelled in red, the vectorised sum-factorisation (SumFac) is labelled in blue and the sum-… view at source ↗

**Figure 7.** Figure 7: Intel Xeon 6526Y throughput performance versus elemental degrees of freedom on mass operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr) and tetrahedral (Tet) elements using the vectorised sum-factorisation (SumFac) implementation. The solid lines indicate throughput assuming regular (Regular) elements that have an affine transformation where as the dashed lines indicate the throughput for cur… view at source ↗

**Figure 8.** Figure 8: Intel Xeon 6526Y throughput performance versus elemental degrees of freedom on Helmholtz operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr) and tetrahedral (Tet) elements using the vectorised sum-factorisation (SumFac) implementation. The solid lines indicate throughput assuming regular (Regular) elements that have an affine transformation where as the dashed lines indicate the throughput fo… view at source ↗

**Figure 9.** Figure 9: Intel Xeon 6526Y throughput performance versus elemental degrees of freedom on Helmholtz operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr), and tetrahedral (Tet) elements using the vectorised sum-factorisation (SumFac) implementation. The solid lines indicate throughput assuming collocated (coll) approach on regular elements where as the dashed lines indicate the throughput for the traditio… view at source ↗

**Figure 10.** Figure 10: Field and Block data structure association. Field contains a series of blocks representing similar properties of elements that are associated with a contiguous block of memory on the device (i.e. GPU). J.Y. Xing et al.: Preprint submitted to Elsevier Page 25 of 19 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Schematic representation of Operator class hierarchy [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: NVIDIA GH200 Grace Hopper Superchip throughput performance versus elemental degrees of freedom on mass operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr) and tetrahedral (Tet) elements using the sumfactorisation threaded on output-point (SumFacTOP) implementation. The solid lines indicate throughput assuming regular (Regular) elements that have an affine transformation where as the dashed … view at source ↗

**Figure 13.** Figure 13: NVIDIA GH200 Grace Hopper Superchip throughput performance versus elemental degrees of freedom on Helmholtz operator for hexahedral (Hex), prismatic (Prism), pyramidic (Pyr) and tetrahedral (Tet) elements using the sum-factorisation threaded on output-point (SumFacTOP) implementation. The solid lines indicate throughput assuming regular (Regular) elements that have an affine transformation where as the da… view at source ↗

**Figure 14.** Figure 14: Throughput results for the AMD EPYC 9554 processor for (a) deformed and (b) regular elements for the Helmholtz operator on all element types using the SumFac and StdMat implementation strategies. J.Y. Xing et al.: Preprint submitted to Elsevier Page 28 of 19 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

read the original abstract

We extend earlier international efforts to optimise hexahedral-based spectral element methods on GPUs and vectorised CPUs to mixed element meshes additionally involving prismatic, pyramidic, and tetrahedral shapes using tensorial expansions. We demonstrate that common finite element operators (such as the mass and Helmholtz matrices) benefit from alternative implementation strategies depending on the element shape, choice of polynomial order, and system architecture in order to achieve optimal performance. In addition, we introduce a new approach/interpretation to efficiently evaluate more complex operations involving inner products with the derivative of the expansions as part of the integrand such as the stiffness matrix. This approach seeks to maximise operations using the collocation properties of the nodal tensorial expansion associated with classical quadrature rules. Our GPU performance tests demonstrate that the throughput of the Helmholtz operator on tetrahedral elements is at most 2.5 times slower than on hexahedral elements, despite tetrahedra having a factor of six greater floating-point operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper extends GPU optimizations for spectral/hp operators from pure hexahedral meshes to mixed-element ones and shows that tets incur only a 2.5x throughput penalty versus hexes on the Helmholtz operator despite 6x the FLOPs.

read the letter

The main takeaway is that different element shapes benefit from different kernel strategies on GPUs and vector CPUs, and the authors give a collocation-based route to handle derivative inner products without full quadrature. They apply this to mass and Helmholtz operators on hexes, tets, prisms, and pyramids using tensor-product expansions. The practical result is that real engineering meshes no longer force a big performance hit when you move away from all-hex grids. The 2.5x slowdown figure for tets is the clearest evidence they provide, and it lines up with the extra floating-point work once the shape-specific kernels are in place. That part is useful and directly addresses a gap left by the earlier hexahedral-only papers they cite. The implementation choices look thoughtful: they avoid a one-size-fits-all approach and instead pick the better route per shape and order. The stress-test note is right that the numbers are internally consistent with the stated operation counts and do not rest on circular fitting. Soft spots are limited. The claims rest on benchmark throughput rather than formal error bounds or exhaustive comparison against other libraries, so a reader still needs to check whether the reported numbers include all data-movement costs in a true mixed-mesh setting. More explicit pseudocode or small code snippets for the collocation derivative step would make the new interpretation easier to reproduce. Overall the work is for people building or tuning high-order finite-element codes for CFD or structural problems on GPUs. It is worth sending to a serious referee because the performance data are concrete, the extension to mixed elements is new, and the central claim holds up without obvious contradictions.

Referee Report

2 major / 3 minor

Summary. The manuscript extends prior work on optimizing spectral/hp element operators for hexahedral meshes on GPUs and vectorized CPUs to mixed-element meshes that also include prisms, pyramids, and tetrahedra. It employs tensor-product expansions and demonstrates that mass and Helmholtz operators benefit from shape-specific kernel strategies depending on polynomial order and architecture. A new collocation-based interpretation is introduced for operations involving derivatives in the integrand, such as the stiffness matrix. GPU benchmarks are reported showing that Helmholtz operator throughput on tetrahedral elements is at most 2.5 times slower than on hexahedral elements, despite tetrahedra requiring six times more floating-point operations.

Significance. If the empirical throughput ratios hold under the stated implementation choices, the work is significant for enabling practical high-order discretizations on complex geometries that require mixed meshes. It provides concrete evidence that the FLOPs penalty of non-hex elements can be largely offset by architecture-aware kernels, which could accelerate adoption of spectral/hp methods in applications such as CFD and structural mechanics where pure hexahedral meshes are infeasible.

major comments (2)

[GPU performance tests] The central performance claim (Helmholtz throughput on tets at most 2.5× slower than hexes) is presented in the GPU tests; however, the manuscript does not report achieved memory bandwidth or arithmetic intensity for each element type, making it difficult to confirm that the 2.5× factor reflects optimal kernel realization rather than residual data-movement overhead in the mixed-mesh setting.
[Stiffness matrix evaluation] The new collocation approach for the stiffness matrix (maximizing operations via nodal tensorial expansions and classical quadrature) is described conceptually, but the manuscript lacks a side-by-side comparison of its floating-point operation count versus a standard quadrature implementation for tetrahedral elements at representative polynomial orders (e.g., P=4–8).

minor comments (3)

[Introduction] The abstract and introduction use the term 'tensorial expansions' without an early equation defining the basis functions for each element shape; a compact notation table would improve readability.
[Figures] Several figure captions for the performance plots omit the exact polynomial orders and mesh sizes used in the timing runs.
[Related work] The manuscript cites prior hexahedral-only GPU work but does not explicitly contrast the new mixed-element kernels against the most recent published tensor-product implementations for prisms or tets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation and recommendation for minor revision. The comments are constructive and we address each major point below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [GPU performance tests] The central performance claim (Helmholtz throughput on tets at most 2.5× slower than hexes) is presented in the GPU tests; however, the manuscript does not report achieved memory bandwidth or arithmetic intensity for each element type, making it difficult to confirm that the 2.5× factor reflects optimal kernel realization rather than residual data-movement overhead in the mixed-mesh setting.

Authors: We agree that additional performance metrics would strengthen the interpretation of the reported throughput ratios. In the revised manuscript we will add a new table (or subsection in the GPU results) reporting measured memory bandwidth utilization and arithmetic intensity for the Helmholtz operator on each element type. These values are readily obtainable from the existing benchmark runs and will clarify that the kernels are operating near architectural limits rather than being limited by unoptimized data movement in the mixed-mesh setting. The 2.5× throughput factor itself remains an empirical, directly measured quantity; the added metrics will simply provide supporting context. revision: yes
Referee: [Stiffness matrix evaluation] The new collocation approach for the stiffness matrix (maximizing operations via nodal tensorial expansions and classical quadrature) is described conceptually, but the manuscript lacks a side-by-side comparison of its floating-point operation count versus a standard quadrature implementation for tetrahedral elements at representative polynomial orders (e.g., P=4–8).

Authors: We accept that a quantitative FLOPs comparison would make the advantage of the collocation interpretation more concrete. In the revision we will insert a short table (new Table X) that tabulates the floating-point operation counts for the collocation-based stiffness evaluation versus a conventional quadrature implementation on tetrahedral elements for polynomial degrees P=4 through P=8. The counts follow directly from the tensor-product structure and quadrature rules already detailed in the manuscript; no new algorithmic development is required. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmarks independent of derivation

full rationale

The paper's central result is an empirical GPU throughput measurement (Helmholtz operator on tets at most 2.5x slower than hexes despite 6x FLOPs). This is obtained from implemented tensor-product kernels with shape-specific strategies (mass, stiffness via collocation) and reported benchmarks across polynomial orders and architectures. No load-bearing step reduces a claimed prediction or first-principles result to fitted inputs, self-citations, or ansatzes by construction. The performance ratio follows directly from the stated implementation choices and measured timings rather than any equation that equates to its own inputs. Self-citations to prior hexahedral optimization work are present but not required to establish the mixed-element comparison or the 2.5x factor.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard properties of tensorial spectral expansions and classical quadrature rules with no new free parameters or invented entities introduced.

axioms (1)

standard math Collocation properties of nodal tensorial expansions with classical quadrature rules allow efficient evaluation of derivative inner products
Invoked to justify the new approach for stiffness-like operators.

pith-pipeline@v0.9.0 · 5488 in / 1176 out tokens · 49620 ms · 2026-05-10T19:17:54.593961+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our GPU performance tests demonstrate that the throughput of the Helmholtz operator on tetrahedral elements is at most 2.5 times slower than on hexahedral elements, despite tetrahedra having a factor of six greater floating-point operations.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a new approach/interpretation to efficiently evaluate more complex operations involving inner products with the derivative of the expansions as part of the integrand such as the stiffness matrix.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

[1]

Chris D. Cantwell, David Moxey, Andrew Comerford, Alessandro Bolis, Gabriele Rocco, Gianmarco Mengaldo, Daniele De Grazia, Sergey Yakovlev, Jean-Eloi Lombard, Dirk Ekelschot, Bastien Jordi, Hui Xu, Yumnah Mohamied, Claus Eskilsson, Blake Nelson, Peter Vos, Cristian Biotto, Robert M. Kirby, and Spencer J. Sherwin. Nektar++: An open-source spectral/hp eleme...

work page 2015
[2]

Cantwell, Spencer J

Chris D. Cantwell, Spencer J. Sherwin, Roberty M. Kirby, and Paul H. J. Kelly. From h to p efficiently: Strategy selection for operator evaluation on hexahedral and tetrahedral elements.Computers and Fluids, 43(1):23–28, 2011

work page 2011
[3]

Cantwell, Spencer J

Chris D. Cantwell, Spencer J. Sherwin, Robert M. Kirby, and Paul H. J. Kelly. From h to p efficiently: Selecting the optimal spectral/hp discretisation in three dimensions.Mathematical Modelling of Natural Phenomena, 6(3):84–96, 2011

work page 2011
[4]

JanEichstädt,MartinVymazal,DavidMoxey,andJoaquimPeiró.Acomparisonoftheshared-memoryparallelprogrammingmodelsOpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM.Computer Physics Communications, page 107245, 2020

work page 2020
[5]

Paul Fischer, James Lottes, and Henry Tufo. Nek5000. Technical report, Argonne National Laboratory (ANL), Argonne, IL (United States), 2007

work page 2007
[6]

Openacc acceleration ofthe nek5000spectral element code.TheInternational Journal ofHigh PerformanceComputing Applications, 29(3):311–319, 2015

Stefano Markidis, Jing Gong, Michael Schliephake, Erwin Laure, Alistair Hart, David Henty, Katherine Heisey, and Paul Fischer. Openacc acceleration ofthe nek5000spectral element code.TheInternational Journal ofHigh PerformanceComputing Applications, 29(3):311–319, 2015

work page 2015
[7]

Strong scaling of openacc enabled nek5000 on several gpu based hpc systems

JonathanVincent,JingGong,MartinKarp,AdamPeplinski,NiclasJansson,ArturPodobas,AndreasJocksch,JieYao,FazleHussain,Stefano Markidis, et al. Strong scaling of openacc enabled nek5000 on several gpu based hpc systems. InInternational Conference on High Performance Computing in Asia-Pacific Region, pages 94–102, 2022

work page 2022
[8]

Openacc—firstexperienceswithreal-worldapplications

SandraWienke,PaulSpringer,ChristianTerboven,andDieteranMey. Openacc—firstexperienceswithreal-worldapplications. InEuropean Conference on Parallel Processing, pages 859–870. Springer, 2012

work page 2012
[9]

Nekrs, a gpu-accelerated spectral element navier–stokes solver.Parallel Computing, 114:102982, 2022

Paul Fischer, Stefan Kerkemeier, Misun Min, Yu-Hsiang Lan, Malachi Phillips, Thilina Rathnayake, Elia Merzari, Ananias Tomboulides, Ali Karakus, Noel Chalmers, et al. Nekrs, a gpu-accelerated spectral element navier–stokes solver.Parallel Computing, 114:102982, 2022

work page 2022
[10]

Occa:Aunifiedapproachtomulti-threadinglanguages.arXivpreprintarXiv:1403.0968, 2014

DavidSMedina,AmikSt-Cyr,andTimWarburton. Occa:Aunifiedapproachtomulti-threadinglanguages.arXivpreprintarXiv:1403.0968, 2014

work page arXiv 2014
[11]

Neko: A modern, portable, and scalable framework for high-fidelity computational fluid dynamics.Computers & Fluids, 275:106243, 2024

Niclas Jansson, Martin Karp, Artur Podobas, Stefano Markidis, and Philipp Schlatter. Neko: A modern, portable, and scalable framework for high-fidelity computational fluid dynamics.Computers & Fluids, 275:106243, 2024

work page 2024
[12]

Mfem:Amodularfiniteelementmethodslibrary.Computers&MathematicswithApplications,81:42–74, 2021

Robert Anderson, Julian Andrej, Andrew Barker, Jamie Bramwell, Jean-Sylvain Camier, Jakub Cerveny, Veselin Dobrev, Yohann Dudouit, AaronFisher,TzanioKolev,etal. Mfem:Amodularfiniteelementmethodslibrary.Computers&MathematicswithApplications,81:42–74, 2021

work page 2021
[13]

Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J

David A. Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J. Kunen, Olga Pearce, Peter Robinson, Brian S. Ryujin, and Thomas R. W. Scogland. Raja: Portable performance for large-scale scientific applications. In2019 ieee/acm international workshop on performance, portability and productivity in hpc (p3hpc), pages 71–81. IEEE, 2019

work page 2019
[14]

WolfgangBangerth,RalfHartmann,andGuidoKanschat.deal.ii—ageneral-purposeobject-orientedfiniteelementlibrary.ACMTransactions on Mathematical Software (TOMS), 33(4):24–es, 2007

work page 2007
[15]

The deal

DanielArndt,WolfgangBangerth,DenisDavydov,TimoHeister,LucaHeltai,MartinKronbichler,MatthiasMaier,Jean-PaulPelteret,Bruno Turcksin, and David Wells. The deal. ii finite element library: Design, features, and insights.Computers & Mathematics with Applications, 81:407–422, 2021

work page 2021
[16]

The deal

Daniel Arndt, Wolfgang Bangerth, Maximilian Bergbauer, Marco Feder, Marc Fehling, Johannes Heinz, Timo Heister, Luca Heltai, Martin Kronbichler, Matthias Maier, et al. The deal. ii library, version 9.5.Journal of Numerical Mathematics, 31(3):231–246, 2023

work page 2023
[17]

Carter Edwards, Christian R

H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns.Journal of Parallel and Distributed Computing, 74(12):3202–3216, 2014

work page 2014
[18]

Afluxreconstructionapproachtohigh-orderschemesincludingdiscontinuousgalerkinmethods

HungTHuynh. Afluxreconstructionapproachtohigh-orderschemesincludingdiscontinuousgalerkinmethods. In18thAIAAcomputational fluid dynamics conference, page 4079, 2007

work page 2007
[19]

Hesthaven and Tim Warburton.Nodal Discontinuous Galerkin Methods

Jan S. Hesthaven and Tim Warburton.Nodal Discontinuous Galerkin Methods. Springer, 2008

work page 2008
[20]

The development of discontinuous galerkin methods

Bernardo Cockburn, George E Karniadakis, and Chi-Wang Shu. The development of discontinuous galerkin methods. InDiscontinuous Galerkin methods: theory, computation and applications, pages 3–50. Springer, 2000

work page 2000
[21]

Freddie D Witherden, Antony M Farrington, and Peter E Vincent. Pyfr: An open source framework for solving advection–diffusion type problems on streaming architectures using the flux reconstruction approach.Computer Physics Communications, 185(11):3028–3040, 2014

work page 2014
[22]

Freddie D Witherden, Peter E Vincent, Will Trojak, Yoshiaki Abe, Amir Akbarzadeh, Semih Akkurt, Mohammad Alhawwary, Lidia Caros, Tarik Dzanic, Giorgio Giangaspero, et al. Pyfr v2. 0.3: Towards industrial adoption of scale-resolving simulations.Computer Physics Communications, 311:109567, 2025

work page 2025
[23]

Mako templates for python.BibSonomy www

Michael Bayer. Mako templates for python.BibSonomy www. bibsonomy. org/bibtex/aa47d818a1c2f889b7456117003b3d42, 2012

work page 2012
[24]

Galæxi:Solvingcomplexcompressibleflowswithhigh-orderdiscontinuousgalerkinmethodsonaccelerator-basedsystems.Computer Physics Communications, 306:109388, 2025

Marius Kurz, Daniel Kempf, Marcel P Blind, Patrick Kopper, Philipp Offenhäuser, Anna Schwarz, Spencer Starr, Jens Keim, and Andrea Beck. Galæxi:Solvingcomplexcompressibleflowswithhigh-orderdiscontinuousgalerkinmethodsonaccelerator-basedsystems.Computer Physics Communications, 306:109388, 2025

work page 2025
[25]

Flexi:Ahighorderdiscontinuousgalerkinframeworkforhyperbolic–parabolicconservationlaws.Computers & Mathematics with Applications, 81:186–219, 2021

Nico Krais, Andrea Beck, Thomas Bolemann, Hannes Frank, David Flad, Gregor Gassner, Florian Hindenlang, Malte Hoffmann, Thomas Kuhn,MatthiasSonntag,etal. Flexi:Ahighorderdiscontinuousgalerkinframeworkforhyperbolic–parabolicconservationlaws.Computers & Mathematics with Applications, 81:186–219, 2021

work page 2021
[26]

Horses3d: A high-order discontinuous galerkin solver for flow simulations and multi- physics applications.Computer Physics Communications, 287:108700, 2023

Esteban Ferrer, Gonzalo Rubio, Gerasimos Ntoukas, Wojciech Laskowski, Oscar A Mariño, Stefano Colombo, Andrés Mateo-Gabín, H Marbona, F Manrique de Lara, David Huergo, et al. Horses3d: A high-order discontinuous galerkin solver for flow simulations and multi- physics applications.Computer Physics Communications, 287:108700, 2023. J.Y. Xing et al.:Preprint...

work page 2023
[27]

A comparison of h-and p- refinement to capture wind turbine wakes.Physics of Fluids, 36(12), 2024

Hatem Kessasra, Marta Cordero-Gracia, Mariola Gómez, Eusebio Valero, Gonzalo Rubio, and Esteban Ferrer. A comparison of h-and p- refinement to capture wind turbine wakes.Physics of Fluids, 36(12), 2024

work page 2024
[28]

OxfordUniversityPress,second edition, 2005

GeorgeKarniadakisandSpencerSherwin.Spectral/hpElementMethodsforComputationalFluidDynamics. OxfordUniversityPress,second edition, 2005

work page 2005
[29]

OxfordUniversityPress,2013

GeorgeKarniadakisandSpencerSherwin.Spectral/ℎ𝑝ElementMethodsforComputationalFluidDynamics. OxfordUniversityPress,2013

work page 2013
[30]

Vincent, and Spencer J

Gianmarco Mengaldo, Daniele De Grazia, David Moxey, Peter E. Vincent, and Spencer J. Sherwin. Dealiasing techniques for high-order spectral element methods on regular and irregular grids.Journal of Computational Physics, 299:56–81, 2015

work page 2015
[31]

Michael G. Duffy. Quadrature over a pyramid or cube of integrands with a singularity at a vertex.SIAM Journal on Numerical Analysis, 19(6):1260–1262, 1982

work page 1982
[32]

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns.Journal of parallel and distributed computing, 74(12):3202–3216, 2014

H Carter Edwards, Christian R Trott, and Daniel Sunderland. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns.Journal of parallel and distributed computing, 74(12):3202–3216, 2014

work page 2014
[33]

Starpu: a unified platform for task scheduling on heterogeneous multicore architectures

Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. InEuropean Conference on Parallel Processing, pages 863–874. Springer, 2009

work page 2009
[34]

Efficientmatrix-freehigh-orderfiniteelementevaluationforsimplicialelements.SIAMJournal on Scientific Computing, 42(3):C97–C123, 2020

DavidMoxey,RomanAmici,andMikeKirby. Efficientmatrix-freehigh-orderfiniteelementevaluationforsimplicialelements.SIAMJournal on Scientific Computing, 42(3):C97–C123, 2020

work page 2020
[35]

xsimd: C++ wrappers for simd intrinsics and parallelized, optimized mathematical functions.https://github.com/ xtensor-stack/xsimd, 2025

xtensor-stack. xsimd: C++ wrappers for simd intrinsics and parallelized, optimized mathematical functions.https://github.com/ xtensor-stack/xsimd, 2025

work page 2025
[36]

Cantwell, Robert M

David Moxey, Chris D. Cantwell, Robert M. Kirby, and Spencer J. Sherwin. Optimizing the performance of the spectral/hp element method with collective linear algebra operations.Computer Methods in Applied Mechanics and Engineering, 310:628–645, 2016

work page 2016
[37]

David Moxey, Chris D. Cantwell, Yan Bao, Andrea Cassinelli, Giacomo Castiglioni, Sehun Chun, Emilia Juda, Ehsan Kazemi, Kilian Lack- hove,JulianMarcon,GianmarcoMengaldo,DouglasSerson,MichaelTurner,HuiXu,JoaquimPeiró,RobertM.Kirby,andSpencerJ.Sherwin. Nektar++: Enhancing the capability and application of high-fidelity spectral/hp element methods.Computer P...

work page 2020
[38]

Libxsmm:acceleratingsmallmatrixmultiplicationsbyruntimecode generation

AlexanderHeinecke,GregHenry,MaxwellHutchinson,andHansPabst. Libxsmm:acceleratingsmallmatrixmultiplicationsbyruntimecode generation. InSC’16:ProceedingsoftheInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis,pages 981–991. IEEE, 2016

work page 2016
[39]

oneapi math library (onemath).https://github.com/uxlfoundation/oneMath

UXL Unified Acceleration Foundation. oneapi math library (onemath).https://github.com/uxlfoundation/oneMath

work page
[40]

libCEED:Fastalgebraforhigh-orderelement-baseddiscretizations.JournalofOpenSourceSoftware,6(63):2945, 2021

JedBrown,AhmadAbdelfattah,ValeriaBarra,NatalieBeams,Jean-SylvainCamier,VeselinDobrev,YohannDudouit,LeilaGhaffari,Tzanio Kolev,DavidMedina,etal. libCEED:Fastalgebraforhigh-orderelement-baseddiscretizations.JournalofOpenSourceSoftware,6(63):2945, 2021

work page 2021
[41]

Scalabilityofhigh-performancepdesolvers.TheInternationalJournalofHighPerformanceComputing Applications, 34(5):562–586, 2020

Paul Fischer, Misun Min, Thilina Rathnayake, Som Dutta, Tzanio Kolev, Veselin Dobrev, Jean-Sylvain Camier, Martin Kronbichler, Tim Warburton,KasiaŚwirydowicz,etal. Scalabilityofhigh-performancepdesolvers.TheInternationalJournalofHighPerformanceComputing Applications, 34(5):562–586, 2020. J.Y. Xing et al.:Preprint submitted to ElsevierPage 19 of 19 Archite...

work page 2020