Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs
Pith reviewed 2026-05-18 08:59 UTC · model grok-4.3
The pith
Neptune enables better GPU performance for attention by fusing reduction operators through dependency breaking and algebraic corrections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Neptune presents a new approach for advanced operator fusion, which intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result. Applying Neptune's advanced operator fusion to a plain attention operator generates operators equivalent to FlashAttention and FlashDecoding. On ten attention-based benchmarks across four GPU architectures, Neptune-generated kernels achieve an average speedup of 1.35× over the next best alternative.
What carries the argument
Algebraic correction expressions built after intentionally breaking dependencies in reduction operator sequences.
If this is right
- Generates operators equivalent to FlashAttention and FlashDecoding from plain attention code.
- Delivers average 1.35× speedup over Triton, TVM, and FlexAttention on attention benchmarks.
- Achieves up to 2.65× speedup on Nvidia GPUs and up to 3.32× on AMD GPUs.
- Applies effectively to deep learning workloads with complex reduction computations.
Where Pith is reading between the lines
- This technique could potentially extend to fusion of other types of operators beyond reductions in ML models.
- High-level scheduling templates might become a standard way to guide compilers without low-level manual tuning.
- Such dependency-breaking with corrections might simplify the development of custom kernels for new hardware.
Load-bearing premise
The algebraic correction expressions always yield mathematically equivalent results to the original dependent computations for the reduction sequences involved.
What would settle it
Observing a discrepancy in the numerical output between the Neptune fused kernel and the original attention computation on any of the benchmarks would indicate the corrections do not preserve correctness.
Figures
read the original abstract
Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor compilers struggle to fuse complex reduction computations involving loop-carried dependencies, such as attention mechanisms. This paper introduces Neptune, a tensor compiler for advanced operator fusion for sequences of reduction operators. Neptune presents a new approach for advanced operator fusion, which intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result. Applying Neptune's advanced operator fusion to a plain attention operator generates operators equivalent to FlashAttention and FlashDecoding. On ten attention-based benchmarks, Neptune, starting from a plain attention code and a high-level scheduling template, outperforms existing compilers like Triton, TVM, and FlexAttention, including Triton-based implementations of FlashAttention. Across four different GPU architectures from NVIDIA and AMD, Neptune-generated kernels have an average speedup of $1.35\times$ over the next best alternative, with up to $2.65\times$ speedup on Nvidia GPUs and up to $3.32\times$ on AMD GPUs, demonstrating its effectiveness for deep learning workloads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Neptune, a tensor compiler for advanced operator fusion on sequences of reduction operators with loop-carried dependencies (e.g., attention). It achieves fusion by intentionally breaking dependencies and compensating via constructed algebraic correction expressions, claiming to produce FlashAttention- and FlashDecoding-equivalent operators from plain attention code plus a high-level scheduling template. On ten attention benchmarks across four NVIDIA and AMD GPUs, Neptune reports an average 1.35× speedup over the next-best compiler (Triton, TVM, FlexAttention), with peaks of 2.65× (NVIDIA) and 3.32× (AMD).
Significance. If the algebraic corrections are shown to preserve mathematical equivalence across the targeted reduction patterns, the technique would offer a principled route to automatic fusion of complex reductions that current compilers handle poorly, with clear practical value for attention-heavy workloads. The cross-architecture speedups are a positive signal, but the absence of any derivation details, equivalence arguments, or numerical validation in the manuscript limits the strength of the contribution until those elements are supplied.
major comments (2)
- Abstract: the central performance claims rest on the assertion that the algebraic correction expressions 'allow the kernel to produce the correct result' and yield FlashAttention-equivalent operators. No derivation of the corrections, symbolic equivalence argument, machine-checked proof, or numerical stress-test protocol across input ranges and reduction orders is supplied, leaving the equivalence claim unverified and directly undermining the reported speedups.
- Experimental section (implied by benchmark results): the reported average 1.35× speedup (and per-architecture maxima) on ten benchmarks is presented without error bars, full experimental protocol, or reproducibility artifacts, making it impossible to assess whether the gains are robust or sensitive to floating-point ordering, overflow, or unhandled edge cases in the correction expressions.
minor comments (2)
- Abstract: the high-level scheduling template is mentioned but never characterized; a short description of its structure and how it exposes fusion opportunities would improve clarity.
- Consider adding a table that lists the ten benchmarks together with the exact speedup numbers against each baseline for quick reference.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: the central performance claims rest on the assertion that the algebraic correction expressions 'allow the kernel to produce the correct result' and yield FlashAttention-equivalent operators. No derivation of the corrections, symbolic equivalence argument, machine-checked proof, or numerical stress-test protocol across input ranges and reduction orders is supplied, leaving the equivalence claim unverified and directly undermining the reported speedups.
Authors: The manuscript explains the construction of algebraic correction expressions that compensate for intentionally broken loop-carried dependencies in reduction sequences, enabling generation of FlashAttention-equivalent operators from plain attention code. We acknowledge that the current presentation provides only a high-level description without a full step-by-step symbolic derivation or explicit equivalence argument in the main text or appendix. In the revised manuscript we will add a dedicated subsection (or appendix) that derives the correction expressions for the targeted attention reduction patterns and presents a symbolic argument establishing mathematical equivalence to the unfused computation. We will also include a numerical validation protocol together with results across representative input ranges and reduction orders to confirm that results match within floating-point tolerance. A machine-checked proof lies outside the scope of this systems paper, but the added algebraic argument and empirical checks will directly address the verification concern. revision: yes
-
Referee: Experimental section (implied by benchmark results): the reported average 1.35× speedup (and per-architecture maxima) on ten benchmarks is presented without error bars, full experimental protocol, or reproducibility artifacts, making it impossible to assess whether the gains are robust or sensitive to floating-point ordering, overflow, or unhandled edge cases in the correction expressions.
Authors: We agree that the experimental reporting can be improved to allow readers to evaluate robustness. The revised manuscript will add error bars derived from multiple independent runs, a detailed experimental protocol section (including hardware configurations, software versions, benchmark construction, and measurement methodology), and explicit reproducibility artifacts such as a public code repository link and scripts. We will further include results from targeted stress tests on the correction expressions that examine sensitivity to floating-point ordering, potential overflow conditions, and edge-case inputs. revision: yes
Circularity Check
No circularity: performance claims rest on external benchmark comparisons
full rationale
The paper's central results are empirical speedups measured by executing Neptune-generated kernels on ten fixed attention benchmarks and comparing runtimes against independent external systems (Triton, TVM, FlexAttention). No equations, fitted parameters, or self-citations are shown that would make any reported speedup equivalent to an internal input by construction. The algebraic correction step is presented as a mechanism to restore equivalence after dependency breaking, but the performance numbers themselves are obtained from direct, externally verifiable execution rather than from any self-referential derivation or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Algebraic correction expressions can be constructed that restore exact equivalence after selected dependencies are broken inside reduction loops.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference
Ada-MK fuses LLM operators into persistent MegaKernels via MLIR DAG search and 3D shared-memory modeling, delivering up to 23.6% higher single-batch throughput than TensorRT-LLM on NVIDIA L20.
Reference graph
Works this paper leans on
-
[1]
{TensorFlow}: a system for {Large-Scale} machine learning
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. {TensorFlow}: a system for {Large-Scale} machine learning. In12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016
work page 2016
-
[2]
Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. Learning to optimize halide with tree search and random programs.ACM Transactions on Graphics (TOG), 38, 2019
work page 2019
-
[3]
Pallas: a jax kernel language.https: //docs.jax.dev/en/latest/pallas/index.html, 2024
The JAX Authors. Pallas: a jax kernel language.https: //docs.jax.dev/en/latest/pallas/index.html, 2024
work page 2024
-
[4]
Bhaskaracharya, Julien Demouth, and Vinod Grover
Somashekaracharya G. Bhaskaracharya, Julien Demouth, and Vinod Grover. Automatic kernel generation for volta tensor cores.CoRR, abs/2006.12645, 2020
-
[5]
W. Blume and R. Eigenmann. Nonlinear and symbolic data dependence testing.IEEE Transactions on Parallel and Distributed Systems, 9(12):1180–1194, 1998
work page 1998
-
[6]
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: end-to-end optimization stack for deep learning. CoRR, abs/1802.04799, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Evt: Accelerating deep learning training with epilogue visitor tree
Zhaodong Chen, Andrew Kerr, Richard Cai, Jack Kosaian, Haicheng Wu, Yufei Ding, and Yuan Xie. Evt: Accelerating deep learning training with epilogue visitor tree. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS ’24, page 301–316, 2024
work page 2024
-
[8]
Chillee. Where do the 2000+ pytorch operators come from? (pytorch developer discussions).https://dev-discuss.pytorch.org/t/where-do- the-2000-pytorch-operators-come-from-more-than-you-wanted- to-know/373
work page 2000
-
[9]
NVIDIA cuSPARSELt.https://docs.nvidia.com/ cuda/cusparselt/types.html, 2021
NVIDIA Corporation. NVIDIA cuSPARSELt.https://docs.nvidia.com/ cuda/cusparselt/types.html, 2021
work page 2021
-
[10]
NVIDIA Corporation. NVIDIA A10 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/a100/, 2024
work page 2024
-
[11]
NVIDIA Corporation. NVIDIA RTX 6000 Ada-generation Graphics Card.https://www.nvidia.com/en-us/design-visualization/rtx-6000/, 2024
work page 2024
-
[12]
NVIDIA RTX A5000 Graphics Card
NVIDIA Corporation. NVIDIA RTX A5000 Graphics Card. https://www.nvidia.com/en-us/design-visualization/rtx-a5000/, 2024
work page 2024
-
[13]
FlashAttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[14]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022
work page 2022
-
[15]
Flash-decoding for long-context inference, Oct 2023
Tri Dao, Grigory Sizov, Francisco Massa, and Daniel Haziza. Flash-decoding for long-context inference, Oct 2023
work page 2023
-
[16]
FlashAttention.https://github.com/Dao-AILab/flash- attention, 2023
Dao-AILab. FlashAttention.https://github.com/Dao-AILab/flash- attention, 2023
work page 2023
-
[17]
Flex attention: A programming model for generating optimized attention kernels, 2024
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels, 2024
work page 2024
-
[18]
Tensorir: An abstraction for automatic tensorized program optimization
Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, et al. Tensorir: An abstraction for automatic tensorized program optimization. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 804–817, 2023
work page 2023
-
[19]
Taso: optimizing deep learning computation with automatic generation of graph substitutions
Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Za- haria, and Alex Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. InProceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47–62, 2019
work page 2019
-
[20]
Zhihao Jia, James Thomas, Todd Warszawski, Mingyu Gao, Matei Zaharia, and Alex Aiken. Optimizing dnn computation with relaxed graph substitutions.Proceedings of Machine Learning and Systems, 12 1:27–39, 2019
work page 2019
-
[21]
Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, Luca Wehrstedt, Jeremy Reizen- stein, and Grigory Sizov. xformers: A modular and hackable transformer modelling library.https://github.com/facebookresearch/xformers, 2022
work page 2022
-
[22]
Differentiable programming for image processing and deep learning in halide.ACM Trans
Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-Kelley. Differentiable programming for image processing and deep learning in halide.ACM Trans. Graph., 37(4), 2018
work page 2018
-
[23]
Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B
Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Rou...
work page 2017
-
[24]
Automatically scheduling halide image processing pipelines.ACM Trans
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. Automatically scheduling halide image processing pipelines.ACM Trans. Graph., 35(4), jul 2016
work page 2016
-
[25]
Newcomb, Andrew Adams, Steven Johnson, Rastislav Bodik, and Shoaib Kamil
Julie L. Newcomb, Andrew Adams, Steven Johnson, Rastislav Bodik, and Shoaib Kamil. Verifying and improving halide’s term rewriting system with program synthesis.Proc. ACM Program. Lang., 4(OOPSLA), November 2020
work page 2020
-
[26]
Dnnfusion: accelerating deep neural networks execution with advanced operator fusion
Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. Dnnfusion: accelerating deep neural networks execution with advanced operator fusion. InProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2021, page 883–898, 2021
work page 2021
-
[27]
CUTLASS: CUDA Templates for Linear Algebra Subroutines
NVIDIA. CUTLASS: CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass, 2021
work page 2021
-
[28]
Fused Attention – Triton Documentation.https://triton-lang
OpenAI. Fused Attention – Triton Documentation.https://triton-lang. org/main/getting-started/tutorials/06-fused-attention.html, 2024
work page 2024
- [29]
-
[30]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learn- ing library.Advances in neural information processing systems, 32, 2019
work page 2019
-
[31]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. InACM SIGPLAN Conference on Programming Language Design and Implementation, 2013
work page 2013
-
[32]
Alex Rogozhnikov. Einops: Clear and versatile tensor manipulations for deep learning.https://github.com/arogozhnikov/einops, 2020. GitHub repository, Accessed: [Date Accessed]
work page 2020
-
[33]
Tensor program optimization with probabilistic programs
Junru Shao, Xiyou Zhou, Siyuan Feng, Bohan Hou, Ruihang Lai, Hongyi Jin, Wuwei Lin, Masahiro Masuda, Cody Hao Yu, and Tianqi Chen. Tensor program optimization with probabilistic programs. In Advances in Neural Information Processing Systems, volume 35, 2022
work page 2022
-
[34]
Approxtuner: a compiler and runtime system for adaptive approximations
Hashim Sharif, Yifan Zhao, Maria Kotsifakou, Akash Kothari, Ben Schreiber, Elizabeth Wang, Yasmin Sarita, Nathan Zhao, Keyur Joshi, Vikram S Adve, Sasa Misailovic, and Sarita V Adve. Approxtuner: a compiler and runtime system for adaptive approximations. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021
work page 2021
-
[35]
Spector, Simran Arora, Aaryan Singhal, Daniel Y
Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, and Christopher Ré. Thunderkittens: Simple, fast, and adorable ai kernels, 2024
work page 2024
-
[36]
Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, page 10–19, June 2019
work page 2019
-
[37]
PET: Optimizing tensor programs with partially equivalent transformations and automated corrections
Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. PET: Optimizing tensor programs with partially equivalent transformations and automated corrections. In15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 37–54, July 2021
work page 2021
-
[38]
Unit: Unifying tensorized instruction compilation
Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and Tony Nowatzki. Unit: Unifying tensorized instruction compilation. In2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), page 77–89, February 2021
work page 2021
-
[39]
Mirage: A {Multi-Level} superoptimizer for tensor programs
Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A {Multi-Level} superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 21–38, 2025
work page 2025
-
[40]
Equality saturation for tensor graph superop- timization
Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. Equality saturation for tensor graph superop- timization. In A. Smola, A. Dimakis, and I. Stoica, editors,Proceedings of Machine Learning and Systems, volume 3, pages 255–268, 2021
work page 2021
-
[41]
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customiz- able attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Felix: Optimizing tensor programs with gradient descent
Yifan Zhao, Hashim Sharif, Vikram Adve, and Sasa Misailovic. Felix: Optimizing tensor programs with gradient descent. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS ’24, page 367–381, 2024
work page 2024
-
[43]
Approxcaliper: A programmable framework for application-aware neural network optimization
Yifan Zhao, Hashim Sharif, Peter Pao-Huang, Vatsin Ninad Shah, Arun Narenthiran Sivakumar, Mateus Valverde Gasparino, Abdulrahman Mahmoud, Nathan Zhao, Sarita Adve, Girish Chowdhary, Sasa Misailovic, and Vikram Adve. Approxcaliper: A programmable framework for application-aware neural network optimization. In Proceedings of Machine Learning and Systems 5, 2023
work page 2023
-
[44]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: Generating high- performance tensor programs for deep learning. InUSENIX Conference on Operating Systems Design and Implementation, OSDI’20, 2020
work page 2020
-
[45]
Size Zheng, Renze Chen, Anjiang Wei, Yicheng Jin, Qin Han, Liqiang Lu, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. Amos: enabling automatic mapping for tensor computationson spatial accelerators with hardware abstraction. InProceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, page 874–887, 2022
work page 2022
-
[46]
Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, page 859–873, 2020
work page 2020
-
[47]
Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. ROLLER: Fast and efficient tensor compilation for deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 233–248, July 2022. ...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.