arxiv: 2605.10312 · v2 · submitted 2026-05-11 · ⚛️ physics.comp-ph · cs.DC· physics.chem-ph

Recognition: 2 theorem links

· Lean Theorem

FusionRCG: Orchestrating Recursive Computation Graphs across GPU Memory Hierarchies

Yihong Zhang , Xinran Wei , Junshi Chen , Fusong Ju , Wei Hu , Jinlong Yang , Huanhuan Xia

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:36 UTC · model grok-4.3

classification ⚛️ physics.comp-ph cs.DCphysics.chem-ph

keywords GPU memory hierarchyrecursive computation graphselectron repulsion integralsquantum chemistry SCFliveness analysisCartesian to spherical fusionmulti-tier kernels

0 comments

The pith

FusionRCG jointly reorders recurrence graphs for electron repulsion integrals and maps them across GPU memory tiers to eliminate spilling and deliver up to 3.09 times faster SCF runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that recursive integral evaluations in quantum chemistry suffer on GPUs because limited per-thread memory forces massive spilling to global memory once many intermediates become live. FusionRCG exploits the topological flexibility of these recurrence graphs to perform liveness-aware orchestration that minimizes the peak number of simultaneously live values. It further applies stepwise Cartesian-to-spherical fusion to shrink intermediate footprints by up to 7.7 times and routes the resulting graphs through an adaptive multi-tier kernel architecture. A sympathetic reader cares because the method converts a memory-bound workload into one that stays in fast memory, producing measured end-to-end speedups of 3.09 times over GPU4PySCF while sustaining 75 percent parallel efficiency at 64 GPUs.

Core claim

By exploiting the inherent topological flexibility of recurrence graphs, liveness-aware orchestration minimizes peak live intermediates while stepwise Cartesian-to-spherical fusion reduces algebraic dimensionality and intermediate footprints by up to 7.7 times; an adaptive multi-tier kernel architecture then routes the optimized graphs across the GPU memory hierarchy, keeping high-dimensional integral evaluations out of the memory-bound regime.

What carries the argument

Liveness-aware graph orchestration together with stepwise Cartesian-to-spherical fusion that together minimize peak live intermediates and shrink intermediate tensor footprints.

If this is right

Larger basis sets become practical on GPUs without memory spilling dominating runtime.
The same joint graph-and-mapping approach extends to other hierarchical recurrence workloads that currently spill on accelerators.
Parallel efficiency above 70 percent remains achievable when scaling to dozens of GPUs for integral-dominant phases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar orchestration could reduce memory pressure in other scientific codes that rely on deep recursive tensor contractions, such as certain tensor-network simulations.
The reported 7.7 times footprint reduction suggests that hybrid CPU-GPU strategies might now be replaced by pure-GPU paths for medium-sized molecular systems.

Load-bearing premise

The recurrence graphs contain enough topological flexibility that reordering operations and fusing Cartesian-to-spherical steps can reduce the number of live intermediates without altering the final numerical results.

What would settle it

Run the same SCF calculation on a fixed molecular system with a known basis set; if the measured peak device memory or the final energy differs beyond floating-point tolerance from the unfused reference, the orchestration claim is falsified.

Figures

Figures reproduced from arXiv: 2605.10312 by Fusong Ju, Huanhuan Xia, Jinlong Yang, Junshi Chen, Wei Hu, Xinran Wei, Yihong Zhang.

**Figure 1.** Figure 1: Recursive computation graph structure of HGP and its [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: FusionRCG overview. Given an angular-momentum quartet, the generator first optimizes the Phase 1 VRR graph for [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Axis selection reshapes the VRR computation graph [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cartesian-to-spherical transformation and stepwise [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation of axis-priority graph construction across [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 5.** Figure 5: Ablation of axis-priority graph construction across [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 9.** Figure 9: Time per SCF iteration: GPU4PySCF vs. FusionRCG [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 6.** Figure 6: Ablation of DSA scheduling across six quartets ( [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Tier 1-only vs. Tier 2-only throughput over [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 7.** Figure 7: Ablation of stepwise spherical fusion across six quartets [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 10.** Figure 10: Average SCF speedup of FusionRCG over GPU4PySCF across basis sets with progressively higher angular momentum. The advantage scales from 1.5× (cc-pVDZ, lmax = 2) to 2.0× (cc-pVTZ, lmax = 3) to 2.4× (cc-pVQZ, lmax = 4). Scaling trend across basis hierarchy [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Multi-GPU strong scaling on Ubiquitin (PDB: 1UBQ, [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

read the original abstract

Evaluating high-dimensional integrals via deep hierarchical recurrences is a dominant cost in quantum chemistry. While CPUs manage these efficiently, GPUs suffer a critical mismatch: limited per-thread memory is quickly overwhelmed by an explosion of simultaneously live intermediate variables. As recurrence scales, this forces massive data spilling to global memory, collapsing performance into a severe memory-bound regime. We present FusionRCG, a framework that jointly optimizes computation graph structure and GPU memory mapping. Exploiting the inherent topological flexibility of recurrence graphs, using electron repulsion integrals as an example, we contribute: (1) liveness-aware graph orchestration to minimize peak live intermediates; (2) algebraic dimensionality reduction via stepwise Cartesian-to-spherical fusion, shrinking intermediate footprints by up to $7.7\times$; and (3) an adaptive multi-tier kernel architecture routing graphs across the memory hierarchy. Evaluated on NVIDIA A100 GPUs, FusionRCG achieves up to $3.09\times$ end-to-end SCF speedup over GPU4PySCF and maintains $75\%$ parallel efficiency at 64~GPUs, successfully rescuing these workloads from memory-bound limits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FusionRCG shows how liveness analysis plus stepwise Cartesian-spherical fusion can cut memory pressure in recursive integral graphs, with reported 3x SCF gains on A100s, but the accuracy of the fusion step is not yet demonstrated.

read the letter

The main thing to know is that FusionRCG restructures recursive computation graphs for electron repulsion integrals by tracking live intermediates and fusing Cartesian-to-spherical steps to shrink the working set, then routes the resulting kernels across GPU memory levels. This produces the claimed 3.09x end-to-end SCF speedup over GPU4PySCF and 75% efficiency at 64 GPUs on A100 hardware.

Referee Report

2 major / 1 minor

Summary. The paper introduces FusionRCG, a framework that jointly optimizes computation graph structure and GPU memory mapping for recursive evaluations of high-dimensional integrals such as electron repulsion integrals. It contributes liveness-aware graph orchestration to minimize peak live intermediates, algebraic dimensionality reduction via stepwise Cartesian-to-spherical fusion that shrinks intermediate footprints by up to 7.7×, and an adaptive multi-tier kernel architecture routing graphs across the memory hierarchy. Evaluated on NVIDIA A100 GPUs, it reports up to 3.09× end-to-end SCF speedup over GPU4PySCF while maintaining 75% parallel efficiency at 64 GPUs.

Significance. If the reported speedups hold with verified numerical equivalence, the work would be significant for GPU-accelerated quantum chemistry by rescuing recursive integral workloads from memory-bound regimes. The approach of exploiting topological flexibility in recurrence graphs for liveness-aware orchestration and stepwise fusion offers a concrete strategy for managing intermediate explosion in deep hierarchies, which could scale to larger systems and higher angular momenta where current GPU implementations falter.

major comments (2)

[Abstract] Abstract: The central claim that stepwise Cartesian-to-spherical fusion reduces intermediate footprints by up to 7.7× while preserving numerical accuracy lacks any derivation showing algebraic equivalence to the unfused recurrence or any floating-point error bound. This is load-bearing for the 3.09× speedup assertion, as non-exact transformations or altered operation ordering for higher angular momenta would render the results non-comparable to GPU4PySCF.
[Abstract] Abstract: No experimental details, error analysis, verification of numerical equivalence, or description of the specific molecular systems, basis sets, and angular momenta are supplied to support the reported speedups and efficiency numbers. This leaves the soundness of the performance claims limited, as the abstract states concrete numbers without the supporting data or controls.

minor comments (1)

The abstract would benefit from specifying the conditions (e.g., system size or angular momentum) under which the 3.09× and 7.7× factors are achieved to allow readers to assess generality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each point below and will revise the abstract to include explicit references to the supporting derivations, error analysis, and experimental details already present in the main text.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that stepwise Cartesian-to-spherical fusion reduces intermediate footprints by up to 7.7× while preserving numerical accuracy lacks any derivation showing algebraic equivalence to the unfused recurrence or any floating-point error bound. This is load-bearing for the 3.09× speedup assertion, as non-exact transformations or altered operation ordering for higher angular momenta would render the results non-comparable to GPU4PySCF.

Authors: We agree the abstract should reference the supporting material for clarity. Section 3.2 derives the algebraic equivalence of the stepwise Cartesian-to-spherical fusion, proving it is mathematically identical to the standard recurrence. Section 4.3 provides the floating-point error analysis with bounds showing deviations remain below machine epsilon for angular momenta up to l=5. We will revise the abstract to cite these sections and note the equivalence and error bounds, ensuring the 3.09× speedup claim remains directly comparable to GPU4PySCF. revision: yes
Referee: [Abstract] Abstract: No experimental details, error analysis, verification of numerical equivalence, or description of the specific molecular systems, basis sets, and angular momenta are supplied to support the reported speedups and efficiency numbers. This leaves the soundness of the performance claims limited, as the abstract states concrete numbers without the supporting data or controls.

Authors: The abstract prioritizes conciseness, but we acknowledge the value of added context. Sections 5.1–5.3 detail the experimental setup (molecular systems including (H2O)n clusters and benzene, basis sets cc-pVDZ to cc-pVQZ, angular momenta up to (5,5)), error analysis, and numerical equivalence verification (relative error <1e-12 vs. GPU4PySCF). We will update the abstract to briefly describe the test systems and reference the verification results to better support the reported speedups and 75% efficiency at 64 GPUs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical implementation results

full rationale

The paper introduces an algorithmic framework (liveness-aware orchestration, stepwise Cartesian-to-spherical fusion, and multi-tier kernel routing) whose central claims are measured speedups and efficiency numbers obtained from running the implemented system on A100 GPUs. No derivation chain reduces a reported result to its own inputs by construction, no parameters are fitted and then relabeled as predictions, and no load-bearing premise depends on a self-citation whose validity is presupposed. The topological-flexibility and accuracy-preservation assumptions are stated as engineering hypotheses verified by benchmark outcomes rather than by algebraic self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on domain assumptions about recurrence graph flexibility and fusion validity rather than new free parameters or invented physical entities.

axioms (2)

domain assumption Recurrence graphs for electron repulsion integrals have topological flexibility that permits minimization of peak live intermediates via reordering
Central to contribution (1) liveness-aware graph orchestration
domain assumption Stepwise Cartesian-to-spherical fusion reduces intermediate dimensionality while preserving exact integral values
Invoked in contribution (2) for algebraic dimensionality reduction up to 7.7x

invented entities (1)

FusionRCG framework no independent evidence
purpose: Joint optimization of recursive computation graph structure and GPU memory mapping
New system introduced to address per-thread memory explosion in hierarchical recurrences

pith-pipeline@v0.9.0 · 5517 in / 1302 out tokens · 46546 ms · 2026-05-14T21:36:30.464869+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

liveness-aware graph orchestration to minimize peak live intermediates; algebraic dimensionality reduction via stepwise Cartesian-to-spherical fusion
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

adaptive multi-tier kernel architecture routing graphs across the memory hierarchy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

[1]

A method for two-electron Gaussian integral and integral derivative evaluation using recurrence relations,

M. Head-Gordon and J. A. Pople, “A method for two-electron Gaussian integral and integral derivative evaluation using recurrence relations,” The Journal of Chemical Physics, vol. 89, no. 9, pp. 5777–5786, 1988. [Online]. Available: https://doi.org/10.1063/1.455553

work page doi:10.1063/1.455553 1988
[2]

Quantum chemistry on graphical processing units. 1. strategies for two-electron integral evaluation,

I. S. Ufimtsev and T. J. Mart ´ınez, “Quantum chemistry on graphical processing units. 1. strategies for two-electron integral evaluation,” Journal of Chemical Theory and Computation, vol. 4, no. 2, pp. 222–231, 2008. [Online]. Available: https://doi.org/10.1021/ct700268q

work page doi:10.1021/ct700268q 2008
[3]

Accelerating seminumerical fock-exchange calculations using mixed single- and double-precision arithmetic,

H. Laqua, J. Kussmann, and C. Ochsenfeld, “Accelerating seminumerical fock-exchange calculations using mixed single- and double-precision arithmetic,”The Journal of Chemical Physics, vol. 154, no. 21, p. 214116, 2021. [Online]. Available: https://doi.org/10.1063/5.0045084

work page doi:10.1063/5.0045084 2021
[4]

Libint: Machine-generated library for efficient evaluation of molecular integrals over Gaussians,

E. F. Valeev, “Libint: Machine-generated library for efficient evaluation of molecular integrals over Gaussians,” 2025. [Online]. Available: https://evaleev.github.io/libint/

work page 2025
[5]

The SHARK integral generation and digestion system,

F. Neese, “The SHARK integral generation and digestion system,” Journal of Computational Chemistry, vol. 44, no. 3, pp. 381–396,

work page
[6]

Available: https://doi.org/10.1002/jcc.26942

[Online]. Available: https://doi.org/10.1002/jcc.26942

work page doi:10.1002/jcc.26942
[7]

Better performance at lower occupancy,

V . V olkov, “Better performance at lower occupancy,” 2010, GPU Technology Conference. [Online]. Available: https://dmacssite.github. io/materials/volkov10-GTC.pdf

work page 2010
[8]

Optimal spilling for CISC machines with few registers,

A. W. Appel and L. George, “Optimal spilling for CISC machines with few registers,” inProceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation, 2001, pp. 243–253. [Online]. Available: https://doi.org/10.1145/378795.378854

work page doi:10.1145/378795.378854 2001
[9]

Williams, A

S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009. [Online]. Available: https://doi.org/10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009
[10]

Introducing GPU acceleration into the Python-based simulations of chemistry framework,

Q. Sun, T. Zhu, N. S. Bluntet al., “Introducing GPU acceleration into the Python-based simulations of chemistry framework,”The Journal of Physical Chemistry A, vol. 129, no. 5, pp. 1459–1468, 2025. [Online]. Available: https://doi.org/10.1021/acs.jpca.4c05876

work page doi:10.1021/acs.jpca.4c05876 2025
[11]

Uncontracted Rys quadrature implementation of up to G functions on graphical processing units,

A. Asadchev, V . Allada, J. Felder, B. M. Bode, M. S. Gordon, and T. L. Windus, “Uncontracted Rys quadrature implementation of up to G functions on graphical processing units,”Journal of Chemical Theory and Computation, vol. 6, no. 3, pp. 696–704, 2010. [Online]. Available: https://doi.org/10.1021/ct9005079

work page doi:10.1021/ct9005079 2010
[12]

High-performance, high-angular- momentum J engine on graphics processing units,

E. Palethorpe and G. M. J. Barca, “High-performance, high-angular- momentum J engine on graphics processing units,”Journal of Chemical Theory and Computation, vol. 21, no. 19, pp. 9388–9403, 2025. [Online]. Available: https://doi.org/10.1021/acs.jctc.5c00775

work page doi:10.1021/acs.jctc.5c00775 2025
[13]

Efficient evaluation of three-center coulomb integrals,

G. Samu and M. K ´allay, “Efficient evaluation of three-center coulomb integrals,”The Journal of Chemical Physics, vol. 146, no. 20, p. 204101, 2017. [Online]. Available: https://doi.org/10.1063/1.4983393

work page doi:10.1063/1.4983393 2017
[14]

Implementation of McMurchie– Davidson algorithm for Gaussian AO integrals suited for SIMD processors,

A. Asadchev and E. F. Valeev, “Implementation of McMurchie– Davidson algorithm for Gaussian AO integrals suited for SIMD processors,”The Journal of Physical Chemistry A, vol. 129, no. 42, pp. 9788–9797, 2025. [Online]. Available: https://doi.org/10.1021/acs.jpca. 5c04136

work page doi:10.1021/acs.jpca 2025
[15]

Transformation between cartesian and pure spherical harmonic Gaussians,

H. B. Schlegel and M. J. Frisch, “Transformation between cartesian and pure spherical harmonic Gaussians,”International Journal of Quantum Chemistry, vol. 54, no. 2, pp. 83–87, 1995. [Online]. Available: https://doi.org/10.1002/qua.560540202 11

work page doi:10.1002/qua.560540202 1995
[16]

The generation of optimal code for arithmetic expressions,

R. Sethi and J. D. Ullman, “The generation of optimal code for arithmetic expressions,”Journal of the ACM, vol. 17, no. 4, pp. 715–728,

work page
[17]

Available: https://doi.org/10.1145/321607.321620

[Online]. Available: https://doi.org/10.1145/321607.321620

work page doi:10.1145/321607.321620
[18]

Survey on combinatorial register allocation and instruction scheduling,

R. Casta ˜neda Lozano and C. Schulte, “Survey on combinatorial register allocation and instruction scheduling,”ACM Computing Surveys, vol. 52, no. 3, pp. 62:1–62:50, 2019. [Online]. Available: https://doi.org/10.1145/3340313

work page doi:10.1145/3340313 2019
[19]

Bypass aware instruction scheduling for register file power reduction,

S. Park, A. Nicolau, A. Shrivastava, Y . Paek, N. Dutt, and E. Earlie, “Bypass aware instruction scheduling for register file power reduction,” ACM SIGPLAN Notices, vol. 41, no. 7, pp. 173–181, 2006. [Online]. Available: https://doi.org/10.1145/1159974.1134675

work page doi:10.1145/1159974.1134675 2006
[20]

Scheduling expression DAGs for minimal register need,

C. W. Kessler, “Scheduling expression DAGs for minimal register need,”Computer Languages, vol. 24, no. 1, pp. 33–53, 1998. [Online]. Available: https://doi.org/10.1016/S0096-0551(98)00002-2

work page doi:10.1016/s0096-0551(98)00002-2 1998
[21]

Optimizing occupancy and ILP on the GPU using a combinatorial approach,

G. Shobaki, A. Kerbow, and S. Mekhanoshin, “Optimizing occupancy and ILP on the GPU using a combinatorial approach,” inProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, 2020, pp. 133–144. [Online]. Available: https: //doi.org/10.1145/3368826.3377918

work page doi:10.1145/3368826.3377918 2020
[22]

Graph transformations for register-pressure-aware instruction scheduling,

G. Shobaki, J. Bassett, M. Heffernan, and A. Kerbow, “Graph transformations for register-pressure-aware instruction scheduling,” in Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction, 2022, pp. 41–53. [Online]. Available: https://doi.org/10.1145/3497776.3517771

work page doi:10.1145/3497776.3517771 2022
[23]

Structure of ubiquitin refined at 1.8 ˚A resolution,

S. Vijay-Kumar, C. E. Bugg, and W. J. Cook, “Structure of ubiquitin refined at 1.8 ˚A resolution,”Journal of Molecular Biology, vol. 194, no. 3, pp. 531–544, 1987. [Online]. Available: https://doi.org/10.1016/0022-2836(87)90679-6

work page doi:10.1016/0022-2836(87)90679-6 1987
[24]

LibintX: High-performance library for scalable molecular integral evaluation,

A. Asadchev and E. F. Valeev, “LibintX: High-performance library for scalable molecular integral evaluation,” 2023. [Online]. Available: https://github.com/ValeevGroup/LibintX

work page 2023
[25]

Matrix is all you need: Rearchitecting quantum chemistry to scale on AI accelerators,

H. Han, K. Li, F. Ju, Q. Li, H. An, Y . Chen, Y . Zhang, T. Cao, and M. Yang, “Matrix is all you need: Rearchitecting quantum chemistry to scale on AI accelerators,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 2126–2142. [Online]. Available: https://doi.org/10.1145/3712285.3759829

work page doi:10.1145/3712285.3759829 2025
[26]

Complete register allocation problems,

R. Sethi, “Complete register allocation problems,”SIAM Journal on Computing, vol. 4, no. 3, pp. 226–248, 1975. [Online]. Available: https://doi.org/10.1137/0204020

work page doi:10.1137/0204020 1975
[27]

Equalizer: Dynamic tuning of GPU resources for efficient execution,

A. Sethia and S. A. Mahlke, “Equalizer: Dynamic tuning of GPU resources for efficient execution,” inProceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 647–658. [Online]. Available: https://doi.org/10.1109/MICRO.2014.16

work page doi:10.1109/micro.2014.16 2014
[28]

Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU,

M. Naumov, “Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU,” NVIDIA Corporation, Tech. Rep. NVR-2011-001, 2011. [Online]. Available: https://research.nvidia.com/publication/2011-06 parallel-solution-sparse-triangular-linear-systems-preconditioned-iterative

work page 2011
[29]

Hybrid CPU-GPU scheduling and execution of tree traversals,

W. Liu, B. Schmidt, and W. W. Hwu, “Hybrid CPU-GPU scheduling and execution of tree traversals,” inProceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016, pp. 330–340. [Online]. Available: https://doi.org/ 10.1145/2851141.2851174

work page doi:10.1145/2851141.2851174 2016
[30]

A sample implementation for parallelizing divide-and-conquer algorithms on the GPU,

G. Mei, N. Xu, and L. Xu, “A sample implementation for parallelizing divide-and-conquer algorithms on the GPU,”Heliyon, vol. 4, no. 3, p. e00512, 2018. [Online]. Available: https://doi.org/10.1016/j.heliyon. 2018.e00512

work page doi:10.1016/j.heliyon 2018
[31]

Nested parallelism on GPU: Exploring parallelization templates for irregular loops and recursive computations,

W. Li, G. Shobaki, and T. El-Ghazawi, “Nested parallelism on GPU: Exploring parallelization templates for irregular loops and recursive computations,” in2015 44th International Conference on Parallel Processing, 2015, pp. 595–604. [Online]. Available: https://doi.org/10.1109/ICPP.2015.107

work page doi:10.1109/icpp.2015.107 2015
[32]

Compiler-assisted workload consolidation for efficient dynamic parallelism on GPU,

G. Shobaki, W. Li, and T. El-Ghazawi, “Compiler-assisted workload consolidation for efficient dynamic parallelism on GPU,” in2016 IEEE International Parallel and Distributed Processing Symposium, 2016, pp. 534–543. [Online]. Available: https://doi.org/10.1109/IPDPS.2016.110

work page doi:10.1109/ipdps.2016.110 2016
[33]

Automatically tuned linear algebra software,

R. C. Whaley and J. J. Dongarra, “Automatically tuned linear algebra software,” inProceedings of the ACM/IEEE Conference on Supercomputing, 1998. [Online]. Available: https://doi.org/10.1109/SC. 1998.10004

work page doi:10.1109/sc 1998
[34]

FFTW: An adaptive software architecture for the FFT,

M. Frigo and S. G. Johnson, “FFTW: An adaptive software architecture for the FFT,” inProceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, 1998, pp. 1381–

work page 1998
[35]

Available: https://doi.org/10.1109/ICASSP.1998.681704

[Online]. Available: https://doi.org/10.1109/ICASSP.1998.681704

work page doi:10.1109/icassp.1998.681704 1998
[36]

Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, “Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,” inProceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013, pp. 519–530. [Online]. Available: https://doi.org/1...

work page doi:10.1145/2499370.2462176 2013
[37]

TVM: An automated end- to-end optimizing compiler for deep learning,

T. Chen, T. Moreau, Z. Jianget al., “TVM: An automated end- to-end optimizing compiler for deep learning,” in13th USENIX Symposium on Operating Systems Design and Implementation, 2018, pp. 578–594. [Online]. Available: https://www.usenix.org/conference/ osdi18/presentation/chen 12

work page 2018