RECURSUM: Automated Code Generation for Recurrence Relations Exceeds Expert Optimization via LayeredCodegen

Rub\'en Dar\'io Guerrero

arxiv: 2604.18585 · v1 · submitted 2026-01-22 · 💻 cs.PL

RECURSUM: Automated Code Generation for Recurrence Relations Exceeds Expert Optimization via LayeredCodegen

Rub\'en Dar\'io Guerrero This is my paper

Pith reviewed 2026-05-16 12:37 UTC · model grok-4.3

classification 💻 cs.PL

keywords automated code generationrecurrence relationsperformance optimizationdomain-specific languagescientific computingC++ codegenHermite coefficients

0 comments

The pith

RECURSUM generates C++ for recurrence relations that runs 9.8 times faster than expert hand-written code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RECURSUM, a Python DSL that lets users specify recurrence relations in 10-30 lines using einsum-inspired notation and generates over 650 lines of optimized C++ code. It introduces the LayeredCodegen backend whose architectural changes produce 9.8x speedup over expert hand-written implementations and 1.9x over template metaprogramming on McMurchie-Davidson Hermite coefficients. The system is tested on 24 recurrence types across polynomials, numerical quadrature, and quantum chemistry integrals, with generated code matching expert accuracy to within 3.3 percent. The work shows that automated generation can eliminate the dual-expertise requirement of domain knowledge plus low-level C++ optimization.

Core claim

LayeredCodegen achieves 9.8x speedup over expert hand-written implementations and 1.9x over template metaprogramming for McMurchie-Davidson Hermite coefficients by applying zero-copy output parameters (70-80 percent of the gain), guaranteed function inlining (15-20 percent), and exact-sized stack buffers (5-10 percent), while the generated code matches expert baselines within 3.3 percent across 24 recurrence types spanning mathematics, numerical analysis, and quantum chemistry.

What carries the argument

The LayeredCodegen backend, which replaces return-by-value and MAX-sized arrays with zero-copy parameters, forced inlining, and exact stack allocation to remove compiler and cache overheads in generated C++.

Load-bearing premise

The reported speedups result from fair comparisons against truly optimal expert implementations and that the three architectural effects generalize beyond the 24 tested recurrences without measurement bias.

What would settle it

Independent re-implementation of the expert McMurchie-Davidson baselines followed by identical runtime measurements on the same hardware and compiler flags would confirm or refute the claimed 9.8x factor.

Figures

Figures reproduced from arXiv: 2604.18585 by Rub\'en Dar\'io Guerrero.

**Figure 1.** Figure 1: LayeredCodegen automatic code generation achieves optimal performance for Hermite expansion coefficients. Performance comparison of four implementations for computing Hermite expansion coefficients E i,j t across shell pairs (ss, sp, pp, sd, pd, dd, ff, gg) with increasing total angular momentum L (0–8). The LayeredCodegen implementation (pink diamonds, solid line) generated by the automatic code genera… view at source ↗

**Figure 2.** Figure 2: LayeredCodegen achieves significant speedup over alternative code generation strategies for the ss shell pair. Speedup comparison for the ss shell pair showing LayeredCodegen performance relative to three alternative implementations: hand-written Layered, TMP (template metaprogramming), and Symbolic (SymPy-generated). The automatically generated LayeredCodegen implementation achieves 9.75× speedup over … view at source ↗

**Figure 3.** Figure 3: LayeredCodegen maintains constant performance while competing implementations exhibit exponential scaling with angular momentum. Performance scaling of Hermite expansion coefficient computation as a function of total angular momentum L = nA + nB (0–8). Data points represent execution times for representative shell pairs at each L value. Four implementations are compared: TMP (blue circles), Layered (oran… view at source ↗

**Figure 4.** Figure 4: Coulomb auxiliary integrals reveal dramatic performance differences: TMP and LayeredCodegen dominate while Layered implementation exhibits severe overhead. Performance comparison of three implementations for computing Coulomb auxiliary integrals R (m) tuv as a function of total angular momentum Ltotal (0–8): TMP (blue circles), Layered (orange squares), and LayeredCodegen (green diamonds). The implementa… view at source ↗

**Figure 5.** Figure 5: Coulomb auxiliary integrals reveal similar scaling behavior but dramatically different absolute performance across implementations. Total execution time as a function of problem size (number of integrals computed) for three implementations: TMP (blue circles), Layered (orange squares), and LayeredCodegen (green diamonds). Both axes use logarithmic scales. The number of integrals follows the tetrahedral s… view at source ↗

**Figure 6.** Figure 6: RECURSUM-accelerated J and K matrix construction demonstrates efficient scaling and K’s computational advantage. Performance comparison of Coulomb (J) and Exchange (K) matrix construction for alkane chains (CH4 through C4H10) with 6-31G basis set. Red bars show J matrix construction times, blue bars show K matrix construction times. Numbers above bars indicate K’s speedup factor relative to J (2.0–2.4×). … view at source ↗

**Figure 7.** Figure 7: J and K matrix algorithms exhibit O(N4 ) scaling with measured exponents matching theoretical predictions when Schwarz screening is disabled. Loglog plots showing computational scaling for (left) Coulomb (J) matrix and (right) Exchange (K) matrix construction as a function of basis shell count. Measured data points (circles for J, squares for K) fitted with power law curves (dashed lines) reveal scaling … view at source ↗

**Figure 8.** Figure 8: RECURSUM’s 9.8× Hermite coefficient speedup translates directly to 9.8× faster J and K matrix construction. Comparison of J (left) and K (right) matrix construction times using hand-written Hermite coefficient evaluation (gray bars, baseline) versus RECURSUM LayeredCodegen-generated code (colored bars). Green annotations show 9.8× speedup across all alkane systems. Both J and K algorithms spend ∼80% of exe… view at source ↗

**Figure 9.** Figure 9: Four-panel summary of J/K matrix performance demonstrating RECURSUM’s impact on practical quantum chemistry calculations. (A) Direct performance comparison showing K matrix’s 2.0–2.4× computational advantage over J matrix across alkane series. (B) Computational scaling analysis with power law fits revealing N4.02 (J) and N4.17 (K) exponents matching theoretical O(N4 ) complexity (all benchmarks performed… view at source ↗

read the original abstract

Automated code generation can systematically exceed expert hand-optimization for recurrence relations-computational primitives ubiquitous in orthogonal polynomials, special functions, numerical integration, and molecular integral evaluation. We present RECURSUM, a Python-based domain-specific language generating optimized C++ for arbitrary recurrence relations via three backends: template metaprogramming for compile-time evaluation, a novel LayeredCodegen backend with architectural optimizations, and runtime loop-based evaluation. The DSL uses einsum-inspired notation to specify recurrences, validity constraints, and base cases in 10-30 lines of Python, generating 650+ lines of production C++. LayeredCodegen achieves 9.8x speedup over expert hand-written implementations and 1.9x over template metaprogramming for McMurchie-Davidson Hermite coefficients. Architecture analysis reveals three quantifiable effects: (1) zero-copy output parameters eliminate return-by-value overhead (70-80% of speedup), (2) guaranteed function inlining eliminates compiler-refused overhead (15-20%), (3) exact-sized stack buffers achieve 100% cache efficiency vs 27% for MAX-sized arrays (5-10%). We validate on 24 recurrence types spanning pure mathematics (Legendre, Chebyshev, Hermite, Laguerre polynomials), numerical analysis (Clenshaw, Golub-Welsch), and quantum chemistry (McMurchie-Davidson, Rys quadrature, Boys function). Production benchmarks show speedups propagate to complete algorithms, with generated code matching expert baselines within 3.3%. RECURSUM demonstrates that systematic code generation serves as the performance ceiling for recurrence algorithms. By eliminating the dual expertise barrier (domain knowledge + C++ metaprogramming), the framework democratizes high-performance scientific computing-establishing a paradigm where automated generation systematically exceeds manual optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RECURSUM's LayeredCodegen delivers a practical DSL for recurrence code with quantified architectural gains, but the 9.8x expert speedup claim needs clearer baseline details to rule out measurement artifacts.

read the letter

The main thing to know is that this paper gives a Python DSL for writing recurrence relations in a compact einsum-style notation and then emitting optimized C++ through three backends, with the new LayeredCodegen one claiming to beat hand-written expert code by up to 9.8x on McMurchie-Davidson Hermite coefficients while staying within 3.3% accuracy on 24 different recurrence types from polynomials to quantum chemistry integrals. Generating 650 lines of production C++ from 10-30 lines of spec is a real reduction in the expertise barrier. The breakdown of where the speed comes from—zero-copy output parameters for most of the gain, guaranteed inlining, and exact-sized stack buffers for cache efficiency—is useful and concrete. That kind of architecture analysis is what makes the work worth looking at rather than just another code generator. The coverage across pure math, numerical methods, and molecular integrals shows the approach is not narrowly scoped. The central argument that systematic generation can serve as the performance ceiling holds up on its own terms if the numbers check out. The soft spot is the benchmark comparison itself. The headline speedups rest on the assumption that the expert implementations were already written with equivalent constraints on inlining, buffer sizing, and return semantics. If the hand-written baselines used MAX-sized arrays or return-by-value patterns, then a large fraction of the reported gain is just the application of standard C++ idioms rather than the generator discovering something the experts missed. The abstract gives no hardware specs, compiler flags, or error bars, so the evidence for outperformance is only partially supported right now. This paper is for people who build or maintain high-performance libraries for special functions and integrals. A reader who cares about automating the last mile of performance in scientific code would get concrete value from the DSL design and the optimization breakdown. It deserves a serious referee because the core engineering contribution is grounded and the empirical claims are falsifiable once the benchmark details are supplied. Send it to review with a request for expanded methodology and preferably released code artifacts.

Referee Report

2 major / 2 minor

Summary. The paper presents RECURSUM, a Python DSL for specifying recurrence relations (via einsum-like notation) that generates optimized C++ via three backends: template metaprogramming, a novel LayeredCodegen backend, and runtime loops. It claims LayeredCodegen produces code that achieves 9.8x speedup over expert hand-written implementations and 1.9x over template metaprogramming for McMurchie-Davidson Hermite coefficients, with the generated code matching expert baselines within 3.3% across 24 recurrence types from mathematics, numerical analysis, and quantum chemistry; the speedups are attributed to zero-copy outputs, guaranteed inlining, and exact-sized stack buffers.

Significance. If the performance claims are substantiated with fair, reproducible benchmarks, the work would be significant for high-performance scientific computing: it offers a systematic route to optimal implementations of recurrence primitives that appear in orthogonal polynomials, special functions, and molecular integrals, while lowering the barrier of dual expertise in domain science and C++ metaprogramming. The explicit decomposition of the three architectural effects is a concrete contribution that could guide future code-generation systems.

major comments (2)

[Abstract] Abstract: The headline claim of a 9.8x speedup over expert hand-written implementations for McMurchie-Davidson Hermite coefficients rests on the unverified assumption that the expert baselines were measured under equivalent constraints (zero-copy output parameters, guaranteed inlining, and exact-sized stack buffers). No description is given of the expert code's implementation style, compiler directives, or measurement protocol, so it is impossible to determine whether the reported factor reflects genuine superiority of LayeredCodegen or simply the application of standard C++ idioms that any expert could also have used.
[Abstract] Abstract and validation section: The manuscript reports speedups and a 3.3% accuracy match but supplies no information on hardware platform, compiler version and flags, number of timing repetitions, statistical error bars, or how the 24 recurrence types were selected and timed. These omissions make the central empirical claim only partially supported and prevent independent reproduction.

minor comments (2)

[Abstract] The abstract states that the DSL generates '650+ lines of production C++' from '10-30 lines of Python'; a concrete example of input specification versus generated output for one recurrence would strengthen the presentation.
[Abstract] The three quantified effects (70-80%, 15-20%, 5-10%) are presented as additive; a short table or breakdown showing how these percentages were isolated would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the empirical validation. We agree that the manuscript requires additional details on baselines and experimental setup for full reproducibility and will incorporate these in a major revision.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim of a 9.8x speedup over expert hand-written implementations for McMurchie-Davidson Hermite coefficients rests on the unverified assumption that the expert baselines were measured under equivalent constraints (zero-copy output parameters, guaranteed inlining, and exact-sized stack buffers). No description is given of the expert code's implementation style, compiler directives, or measurement protocol, so it is impossible to determine whether the reported factor reflects genuine superiority of LayeredCodegen or simply the application of standard C++ idioms that any expert could also have used.

Authors: We agree that the manuscript lacks sufficient description of the expert baselines. The expert implementations referenced are standard published codes for McMurchie-Davidson coefficients from the quantum chemistry literature. In the revised manuscript we will add a dedicated subsection in the validation section that describes the expert code structure, confirms the compiler directives and optimization flags applied to both the generated and expert code, and explains the measurement protocol. This will show that LayeredCodegen systematically enforces the zero-copy, inlining, and exact-buffer properties rather than relying on ad-hoc expert application of the same idioms. revision: yes
Referee: [Abstract] Abstract and validation section: The manuscript reports speedups and a 3.3% accuracy match but supplies no information on hardware platform, compiler version and flags, number of timing repetitions, statistical error bars, or how the 24 recurrence types were selected and timed. These omissions make the central empirical claim only partially supported and prevent independent reproduction.

Authors: We acknowledge that the current text omits these critical experimental details. The revised manuscript will include a new 'Benchmarking Methodology' subsection that specifies the hardware platform, compiler version and flags, number of timing repetitions with statistical error bars, and the selection criteria plus timing protocol for the 24 recurrence types. These additions will fully substantiate the performance claims and enable independent reproduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks

full rationale

The paper's central claims derive from direct empirical benchmarks of generated C++ code against hand-written expert implementations and template metaprogramming across 24 recurrence types. No mathematical derivation chain exists that reduces by construction to fitted parameters, self-definitions, or self-citation load-bearing premises. The three architectural effects (zero-copy parameters, guaranteed inlining, exact-sized buffers) are post-hoc explanations of measured speedups rather than inputs that force the reported ratios. Results remain falsifiable via independent re-implementation and timing under equivalent compiler constraints, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical performance measurements of generated code rather than mathematical axioms or new postulated entities. No free parameters or invented entities are introduced.

pith-pipeline@v0.9.0 · 5639 in / 1031 out tokens · 23624 ms · 2026-05-16T12:37:58.418326+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Abramowitz, I

M. Abramowitz, I. A. Stegun, Handbook of mathematical functions with formulas, graphs, and mathematical tables, US Government Printing Of- fice, Washington, DC, 1964

work page 1964
[2]

F. W. Olver, D. W. Lozier, R. F. Boisvert, C. W. Clark, NIST handbook of mathematical functions, Cambridge University Press, Cambridge, UK, 2010

work page 2010
[3]

G. B. Arfken, H. J. Weber, F. E. Harris, Mathematical methods for physi- cists: A comprehensive guide, 7th Edition, Academic Press, Waltham, MA, 2013

work page 2013
[4]

E. F. Valeev, Libint: A library for the evaluation of molecular integrals of many-body operators over gaussian functions,http://libint.valeyev. net/, version 2.0 (2014)

work page 2014
[5]

B. P. Pritchard, E. Chow, Simint: A vectorized approach to molecular integral evaluation, Computer Physics Communications 221 (2017) 160– 171

work page 2017
[6]

Guennebaud, B

G. Guennebaud, B. Jacob, et al., Eigen v3,http://eigen.tuxfamily.org (2010)

work page 2010
[7]

Iglberger, G

K. Iglberger, G. Hager, J. Treibig, G. Wellein, High performance matrix- vector multiplication with BLAZE, in: 2012 International Conference on High Performance Computing & Simulation (HPCS), IEEE, 2012, pp. 632– 639

work page 2012
[8]

Dionne, Boost.hana: Your standard library for metaprogramming, https://www.boost.org/doc/libs/release/libs/hana/(2017)

L. Dionne, Boost.hana: Your standard library for metaprogramming, https://www.boost.org/doc/libs/release/libs/hana/(2017)

work page 2017
[9]

Ragan-Kelley, C

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amaras- inghe, Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, in: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013, pp. 519–530

work page 2013
[10]

Kjolstad, S

F. Kjolstad, S. Kamil, J. Ragan-Kelley, D. I. Levin, S. Sueda, D. Chen, E. Vouga, D. M. Kaufman, G. Kanwar, W. Matusik, et al., The tensor algebra compiler, Proceedings of the ACM on Programming Languages 1 (OOPSLA) (2017) 1–29

work page 2017
[11]

Püschel, J

M. Püschel, J. M. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gačić, Y. Voronenko, et al., SPIRAL: Code generation for DSP transforms, Proceedings of the IEEE93 (2) (2005) 232–275. 68

work page 2005
[12]

Baghdadi, J

R. Baghdadi, J. Ray, M. B. Romdhane, E. Del Sozzo, A. Akkas, Y. Zhang, P. Suriana, S. Kamil, S. Amarasinghe, Tiramisu: A polyhedral compiler for expressing fast and portable code, in: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), IEEE, 2019, pp. 193–205

work page 2019
[13]

Meurer, C

A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rock- lin, A. Kumar, S. Ivanov, J. K. Moore, S. Singh, et al., SymPy: symbolic computing in Python, PeerJ Computer Science 3 (2017) e103

work page 2017
[14]

J. Rys, M. Dupuis, H. F. King, Evaluation of two-electron repulsion in- tegrals over gaussian functions, The Journal of Computational Chemistry 4 (2) (1983) 154–157

work page 1983
[15]

Dupuis, J

M. Dupuis, J. Rys, H. F. King, Evaluation of molecular integrals over gaussian basis functions, The Journal of Chemical Physics 65 (1) (1976) 111–116

work page 1976
[16]

C. W. Clenshaw, A note on the summation of chebyshev series, Mathemat- ics of Computation 9 (51) (1955) 118–120

work page 1955
[17]

G. H. Golub, J. H. Welsch, Calculation of gauss quadrature rules, Mathe- matics of Computation 23 (106) (1969) 221–230

work page 1969
[18]

B. D. Shizgal, Spectral Methods in Chemistry and Physics: Applications to Kinetic Theory and Quantum Mechanics, Springer., 2015

work page 2015
[19]

I. S. Ufimtsev, T. J. Martínez, Quantum chemistry on graphical processing units. 1. Strategies for two-electron integral evaluation, Journal of Chemical Theory and Computation 4 (2) (2008) 222–231.doi:10.1021/ct700268q

work page doi:10.1021/ct700268q 2008
[20]

I. S. Ufimtsev, T. J. Martínez, Quantum chemistry on graphical process- ing units. 2. Direct self-consistent-field implementation, Journal of Chem- ical Theory and Computation 5 (4) (2009) 1004–1015.doi:10.1021/ ct800526s

work page 2009
[21]

I. S. Ufimtsev, T. J. Martínez, Quantum chemistry on graphical processing units. 3. Analytical energy gradients, geometry optimization, and first prin- ciples molecular dynamics, Journal of Chemical Theory and Computation 5 (10) (2009) 2619–2628.doi:10.1021/ct9003004

work page doi:10.1021/ct9003004 2009
[22]

A. V. Titov, I. S. Ufimtsev, N. Luehr, T. J. Martínez, Generating efficient quantum chemistry codes for novel architectures, Journal of Chemical The- ory and Computation 9 (1) (2013) 213–221.doi:10.1021/ct300321a

work page doi:10.1021/ct300321a 2013
[23]

Y. Wang, D. Hait, K. G. Johnson, O. J. Fajen, J. H. Zhang, R. D. Guerrero, T. J. Martínez, Extending GPU-accelerated Gaussian integrals in the TeraChem software package to f type orbitals: Implementation and applications, The Journal of Chemical Physics 161 (2024) 174118. doi:10.1063/5.0233523. 69

work page doi:10.1063/5.0233523 2024
[24]

Google Inc., Google benchmark: A microbenchmark support library, https://github.com/google/benchmark(2015)

work page 2015
[25]

T. L. Veldhuizen, Arrays in blitz++, Computing in Object-Oriented Par- allel Environments (1998) 223–230. 70 0 1 2 3 4 5 6 7 8 Total Angular Momentum Ltotal 100 101 102 103 104 105 Time (nanoseconds) Coulomb Hermite Integrals: TMP vs Layered vs LayeredCodegen TMP Layered LayeredCodegen Figure 4:Coulomb auxiliary integrals reveal dramatic performance diffe...

work page 1998

[1] [1]

Abramowitz, I

M. Abramowitz, I. A. Stegun, Handbook of mathematical functions with formulas, graphs, and mathematical tables, US Government Printing Of- fice, Washington, DC, 1964

work page 1964

[2] [2]

F. W. Olver, D. W. Lozier, R. F. Boisvert, C. W. Clark, NIST handbook of mathematical functions, Cambridge University Press, Cambridge, UK, 2010

work page 2010

[3] [3]

G. B. Arfken, H. J. Weber, F. E. Harris, Mathematical methods for physi- cists: A comprehensive guide, 7th Edition, Academic Press, Waltham, MA, 2013

work page 2013

[4] [4]

E. F. Valeev, Libint: A library for the evaluation of molecular integrals of many-body operators over gaussian functions,http://libint.valeyev. net/, version 2.0 (2014)

work page 2014

[5] [5]

B. P. Pritchard, E. Chow, Simint: A vectorized approach to molecular integral evaluation, Computer Physics Communications 221 (2017) 160– 171

work page 2017

[6] [6]

Guennebaud, B

G. Guennebaud, B. Jacob, et al., Eigen v3,http://eigen.tuxfamily.org (2010)

work page 2010

[7] [7]

Iglberger, G

K. Iglberger, G. Hager, J. Treibig, G. Wellein, High performance matrix- vector multiplication with BLAZE, in: 2012 International Conference on High Performance Computing & Simulation (HPCS), IEEE, 2012, pp. 632– 639

work page 2012

[8] [8]

Dionne, Boost.hana: Your standard library for metaprogramming, https://www.boost.org/doc/libs/release/libs/hana/(2017)

L. Dionne, Boost.hana: Your standard library for metaprogramming, https://www.boost.org/doc/libs/release/libs/hana/(2017)

work page 2017

[9] [9]

Ragan-Kelley, C

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amaras- inghe, Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, in: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013, pp. 519–530

work page 2013

[10] [10]

Kjolstad, S

F. Kjolstad, S. Kamil, J. Ragan-Kelley, D. I. Levin, S. Sueda, D. Chen, E. Vouga, D. M. Kaufman, G. Kanwar, W. Matusik, et al., The tensor algebra compiler, Proceedings of the ACM on Programming Languages 1 (OOPSLA) (2017) 1–29

work page 2017

[11] [11]

Püschel, J

M. Püschel, J. M. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gačić, Y. Voronenko, et al., SPIRAL: Code generation for DSP transforms, Proceedings of the IEEE93 (2) (2005) 232–275. 68

work page 2005

[12] [12]

Baghdadi, J

R. Baghdadi, J. Ray, M. B. Romdhane, E. Del Sozzo, A. Akkas, Y. Zhang, P. Suriana, S. Kamil, S. Amarasinghe, Tiramisu: A polyhedral compiler for expressing fast and portable code, in: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), IEEE, 2019, pp. 193–205

work page 2019

[13] [13]

Meurer, C

A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rock- lin, A. Kumar, S. Ivanov, J. K. Moore, S. Singh, et al., SymPy: symbolic computing in Python, PeerJ Computer Science 3 (2017) e103

work page 2017

[14] [14]

J. Rys, M. Dupuis, H. F. King, Evaluation of two-electron repulsion in- tegrals over gaussian functions, The Journal of Computational Chemistry 4 (2) (1983) 154–157

work page 1983

[15] [15]

Dupuis, J

M. Dupuis, J. Rys, H. F. King, Evaluation of molecular integrals over gaussian basis functions, The Journal of Chemical Physics 65 (1) (1976) 111–116

work page 1976

[16] [16]

C. W. Clenshaw, A note on the summation of chebyshev series, Mathemat- ics of Computation 9 (51) (1955) 118–120

work page 1955

[17] [17]

G. H. Golub, J. H. Welsch, Calculation of gauss quadrature rules, Mathe- matics of Computation 23 (106) (1969) 221–230

work page 1969

[18] [18]

B. D. Shizgal, Spectral Methods in Chemistry and Physics: Applications to Kinetic Theory and Quantum Mechanics, Springer., 2015

work page 2015

[19] [19]

I. S. Ufimtsev, T. J. Martínez, Quantum chemistry on graphical processing units. 1. Strategies for two-electron integral evaluation, Journal of Chemical Theory and Computation 4 (2) (2008) 222–231.doi:10.1021/ct700268q

work page doi:10.1021/ct700268q 2008

[20] [20]

I. S. Ufimtsev, T. J. Martínez, Quantum chemistry on graphical process- ing units. 2. Direct self-consistent-field implementation, Journal of Chem- ical Theory and Computation 5 (4) (2009) 1004–1015.doi:10.1021/ ct800526s

work page 2009

[21] [21]

I. S. Ufimtsev, T. J. Martínez, Quantum chemistry on graphical processing units. 3. Analytical energy gradients, geometry optimization, and first prin- ciples molecular dynamics, Journal of Chemical Theory and Computation 5 (10) (2009) 2619–2628.doi:10.1021/ct9003004

work page doi:10.1021/ct9003004 2009

[22] [22]

A. V. Titov, I. S. Ufimtsev, N. Luehr, T. J. Martínez, Generating efficient quantum chemistry codes for novel architectures, Journal of Chemical The- ory and Computation 9 (1) (2013) 213–221.doi:10.1021/ct300321a

work page doi:10.1021/ct300321a 2013

[23] [23]

Y. Wang, D. Hait, K. G. Johnson, O. J. Fajen, J. H. Zhang, R. D. Guerrero, T. J. Martínez, Extending GPU-accelerated Gaussian integrals in the TeraChem software package to f type orbitals: Implementation and applications, The Journal of Chemical Physics 161 (2024) 174118. doi:10.1063/5.0233523. 69

work page doi:10.1063/5.0233523 2024

[24] [24]

Google Inc., Google benchmark: A microbenchmark support library, https://github.com/google/benchmark(2015)

work page 2015

[25] [25]

T. L. Veldhuizen, Arrays in blitz++, Computing in Object-Oriented Par- allel Environments (1998) 223–230. 70 0 1 2 3 4 5 6 7 8 Total Angular Momentum Ltotal 100 101 102 103 104 105 Time (nanoseconds) Coulomb Hermite Integrals: TMP vs Layered vs LayeredCodegen TMP Layered LayeredCodegen Figure 4:Coulomb auxiliary integrals reveal dramatic performance diffe...

work page 1998