arxiv: 2604.09558 · v1 · submitted 2026-02-11 · 💻 cs.DC · cs.LG· cs.PL

Recognition: 2 theorem links

· Lean Theorem

VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination

Muyan Hu , Ahan Gupta , Jiachen Yuan , Vima Gupta , Taeksang Kim , Xin Xu , Janardhan Kulkarni , Ofer Dekel

show 2 more authors

Vikram Adve Charith Mendis

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:26 UTC · model grok-4.3

classification 💻 cs.DC cs.LGcs.PL

keywords datamovementtensorcompilationmemoryoperatorsvirtualaverage

0 comments

The pith

VTC eliminates unnecessary data movement in DNN compilation using virtual tensors tracked by index mappings, achieving up to 1.93x speedup and 60% memory savings on NVIDIA GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a compiler framework called VTC for deep neural networks. It creates virtual tensors that represent data without actually moving it around in memory. Instead of copying tensors to and from global memory for each operation, the system uses index mappings to describe how data should be accessed. This virtual approach works alongside existing compute kernels and can handle any combination of tensor operators. A new algorithm decides when creating these virtual tensors will save time and memory. Tests on various neural networks show faster inference and lower memory use compared to current compilers.

Core claim

VTC proposes the concept of virtual tensors to track data movement between compute operators via index mappings rather than expensive physical data transfers to and from global memory, which can seamlessly interoperate with existing computation kernels and handle arbitrary tensor operator compositions.

Load-bearing premise

That virtual tensors can seamlessly interoperate with existing computation kernels without overhead and that the data movement elimination algorithm can automatically identify profitable strategies for arbitrary operator compositions in real workloads.

Figures

Figures reproduced from arXiv: 2604.09558 by Ahan Gupta, Charith Mendis, Janardhan Kulkarni, Jiachen Yuan, Muyan Hu, Ofer Dekel, Taeksang Kim, Vikram Adve, Vima Gupta, Xin Xu.

**Figure 2.** Figure 2: Motivating example: computation graph and latency [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: An example of data movement operator elimination [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The overview of VTC. 4.1 Virtual Tensor Definition The key idea behind the virtual tensor technique is to avoid redundant instantiation of tensor data in global memory. If a tensor is obtained by performing data movement operators on other tensors, we can directly represent it using a mapping function and a set of physical tensor pointers, instead of allocating global memory to store the tensor data. Defi… view at source ↗

**Figure 5.** Figure 5: Comparison between contiguous Split (performed on the first axis) and incontiguous Split (performed on other axes) on a 2D matrix stored in row major. examples of data movement operators from ONNX, demonstrate how VTC eliminates them using virtual tensors, and discuss the profitability. All of the data movement operators presented in this subsection are outside the optimization space of previous layout op… view at source ↗

**Figure 6.** Figure 6: (a) The computation graph. (b) The corresponding [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: An example of a computation graph and conflicting [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Several valid points-to graph of Figure [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: End-to-end inference latency comparison on a single NVIDIA A100 GPU and H100 GPU with batch sizes 1 and 16. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Breakdown of latency proportions for data movement and compu [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 12.** Figure 12: VTC generates 5 kernels (k1–k5) for EfficientViT attention block with batch size 16. All the unframed data movement operators are eliminated with virtual tensor. The figure on the right shows the virtual tensor strategy represented by a points-to graph. 8 Related Work Virtual tensors. VTensor [34] is a programming framework which decouples tensor layouts from the programming interface, enabling develope… view at source ↗

read the original abstract

With the widening gap between compute and memory operation latencies, data movement optimizations have become increasingly important for DNN compilation. Current optimizations such as layout transformations and operator fusion only target a subset of tensor operators and consequently miss important opportunities for reducing data movement in contemporary DNN workloads, including large language models. We introduce VTC, a novel tensor compilation framework that for the first time eliminates all unnecessary data movement by targeting the full spectrum of data movement operators. VTC proposes the concept of virtual tensors to track data movement between compute operators via index mappings rather than expensive physical data transfers to and from global memory, which can seamlessly interoperate with existing computation kernels and handle arbitrary tensor operator compositions. We also introduce a novel data movement elimination algorithm to automatically identify a profitable virtual tensor creation strategy. Evaluation on a variety of DNNs shows that VTC can outperform existing ML compilers by up to 1.93x (1.28x on average) on NVIDIA GPUs with up to 60% (17.5% on average) inference memory savings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VTC's virtual tensor idea uses index mappings to skip physical data transfers across more DNN operators than fusion or layout methods, with reported speedups that look useful if the details hold up.

read the letter

The main point is that VTC introduces virtual tensors to track data movement between operators via index mappings rather than forcing physical copies to and from global memory. This targets the full set of data movement operators in DNN graphs, going beyond what layout transformations and operator fusion usually cover. The design claims to interoperate directly with existing kernels and includes an algorithm that picks profitable virtual tensor strategies automatically. On the positive side, the evaluation across various DNNs shows concrete gains: up to 1.93x inference speedup (1.28x average) and up to 60% memory reduction (17.5% average) on NVIDIA GPUs. Those numbers address a real bottleneck as the compute-memory gap grows, and the results suggest practical value for large model deployment. The approach seems grounded in systems work rather than fitted parameters or circular claims. The softer spots are the limited visibility into implementation overhead, edge cases for arbitrary operator compositions, and any verification steps beyond the selected workloads. Without more on how the algorithm decides strategies or measures runtime costs from the mappings, it's hard to judge if the seamless integration holds in all cases. This paper is for ML systems researchers and compiler developers working on inference efficiency. Readers focused on memory traffic reduction in deep learning would get value from the empirical results and the virtual tensor concept. It deserves peer review because the core mechanism is distinct from prior techniques and the reported benefits are large enough to warrant closer examination, even if revisions will likely add implementation details and broader testing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on interoperability of virtual tensors with existing kernels and effectiveness of the elimination algorithm; no explicit free parameters or fitted values mentioned.

axioms (1)

domain assumption Existing computation kernels can interoperate with virtual tensor index mappings without modification or overhead
Invoked when claiming seamless operation with current kernels

invented entities (1)

virtual tensor no independent evidence
purpose: Track data movement via index mappings to avoid physical transfers
Core new concept introduced to eliminate data movement

pith-pipeline@v0.9.0 · 5517 in / 1128 out tokens · 179509 ms · 2026-05-16T03:26:39.472438+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VTC proposes the concept of virtual tensors to track data movement between compute operators via index mappings rather than expensive physical data transfers
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

virtual tensor opportunity graph (VTOG) and global greedy algorithm

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

[1]

tensorflow.org/xla, 2017

Xla: Optimizing compiler for tensorflow.https://www. tensorflow.org/xla, 2017

work page 2017
[2]

https: //developer.nvidia.com/tensorrt, 2024

NVIDIA TensorRT: An sdk with an optimizer for high-performance deep learning inference. https: //developer.nvidia.com/tensorrt, 2024

work page 2024
[3]

{TensorFlow}: a system for {Large-Scale} machine 14 learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, San- jay Ghemawat, Geoffrey Irving, Michael Isard, et al. {TensorFlow}: a system for {Large-Scale} machine 14 learning. In12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016

work page 2016
[4]

PhD thesis, Uni- versity of Copenhagen, 1994

Lars Ole Andersen.Program analysis and specializa- tion for the C programming language. PhD thesis, Uni- versity of Copenhagen, 1994

work page 1994
[5]

Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. InProceedings of the 29th ACM International Confer- ence on Architectural Support for Programming La...

work page 2024
[6]

An overview of the pl

Marc Auslander and Martin Hopkins. An overview of the pl. 8 compiler. InProceedings of the 1982 SIGPLAN Symposium on Compiler Construction, pages 22–31, 1982

work page 1982
[7]

JAX: com- posable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclau- rin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: com- posable transformations of Python+NumPy programs, 2018

work page 2018
[8]

Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction

Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17302–17313, 2023

work page 2023
[9]

{TVM}: An automated {End-to-End} optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 18), pages 578–594, 2018

work page 2018
[10]

Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344– 16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344– 16359, 2022

work page 2022
[11]

Flash-decoding for long-context infer- ence

Tri Dao, Daniel Haziza, Francisco Massa, and Grig- ory Sizov. Flash-decoding for long-context infer- ence. https://crfm.stanford.edu/2023/10/12/ flashdecoding.html, 2023

work page 2023
[12]

Onnx runtime

ONNX Runtime developers. Onnx runtime. https: //onnxruntime.ai/, 2024. Version: 1.20.1

work page 2024
[13]

Hidet: Task- mapping programming paradigm for deep learning ten- sor programs

Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. Hidet: Task- mapping programming paradigm for deep learning ten- sor programs. InProceedings of the 28th ACM Inter- national Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volume 2, pages 370–384, 2023

work page 2023
[14]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Optimal kernel orchestration for tensor programs with korch

Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Bal- amurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai, et al. Optimal kernel orchestration for tensor programs with korch. InProceedings of the 29th ACM Inter- national Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volu...

work page 2024
[16]

Data movement is all you need: A case study on optimizing transformers.Proceedings of Machine Learning and Systems, 3:711–732, 2021

Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Data movement is all you need: A case study on optimizing transformers.Proceedings of Machine Learning and Systems, 3:711–732, 2021

work page 2021
[17]

Taso: optimizing deep learning computation with automatic generation of graph substitutions

Zhihao Jia, Oded Padon, James Thomas, Todd Warsza- wski, Matei Zaharia, and Alex Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. InProceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47– 62, 2019

work page 2019
[18]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Submodular func- tion maximization.Tractability, 3(71-104):3, 2014

Andreas Krause and Daniel Golovin. Submodular func- tion maximization.Tractability, 3(71-104):3, 2014

work page 2014
[20]

Efficient memory man- agement for large language model serving with page- dattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023
[21]

Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}

Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881–897, 2020. 15

work page 2020
[22]

PhD thesis, Massachusetts Institute of Technology, 2020

Thirimadura Charith Yasendra Mendis.Towards au- tomated construction of compiler optimizations. PhD thesis, Massachusetts Institute of Technology, 2020

work page 2020
[23]

Dnnfusion: accelerating deep neural net- works execution with advanced operator fusion

Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. Dnnfusion: accelerating deep neural net- works execution with advanced operator fusion. InPro- ceedings of the 42nd ACM SIGPLAN International Con- ference on Programming Language Design and Imple- mentation, pages 883–898, 2021

work page 2021
[24]

Smartmem: Layout transformation elim- ination and adaptation for efficient dnn execution on mobile

Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, and Bin Ren. Smartmem: Layout transformation elim- ination and adaptation for efficient dnn execution on mobile. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages...

work page 2024
[25]

Efficiently scal- ing transformer inference.Proceedings of Machine Learning and Systems, 5:606–624, 2023

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scal- ing transformer inference.Proceedings of Machine Learning and Systems, 5:606–624, 2023

work page 2023
[26]

Halide: a language and compiler for optimiz- ing parallelism, locality, and recomputation in image processing pipelines.Acm Sigplan Notices, 48(6):519– 530, 2013

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Ama- rasinghe. Halide: a language and compiler for optimiz- ing parallelism, locality, and recomputation in image processing pipelines.Acm Sigplan Notices, 48(6):519– 530, 2013

work page 2013
[27]

Lean atten- tion: Hardware-aware scalable attention mechanism for the decode-phase of transformers.arXiv preprint arXiv:2405.10480, 2024

Rya Sanovar, Srikant Bharadwaj, Renee St Amant, Victor Rühle, and Saravan Rajmohan. Lean atten- tion: Hardware-aware scalable attention mechanism for the decode-phase of transformers.arXiv preprint arXiv:2405.10480, 2024

work page arXiv 2024
[28]

Welder: Scheduling deep learning memory access via tile-graph

Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. Welder: Scheduling deep learning memory access via tile-graph. In17th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 23), pages 701–718, 2023

work page 2023
[29]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Triton: an intermediate language and compiler for tiled neural network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

work page 2019
[31]

In15th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 21), pages 37–54, 2021

Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia.{PET}: Optimizing ten- sor programs with partially equivalent transformations and automated corrections. In15th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 21), pages 37–54, 2021

work page 2021
[32]

Mirage: A {Multi- Level} superoptimizer for tensor programs

Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A {Multi- Level} superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 21–38, 2025

work page 2025
[33]

vtensor: Flexible virtual tensor management for efficient llm serving.arXiv preprint arXiv:2407.15309, 2024

Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, et al. vtensor: Flexible virtual tensor management for efficient llm serving.arXiv preprint arXiv:2407.15309, 2024

work page arXiv 2024
[34]

Vtensor: Using virtual tensors to build a layout-oblivious ai programming framework.JOUR- NAL OF COMPUTER SCIENCE AND TECHNOLOGY, 38(5):1074–1097, 2023

Jingling Xue. Vtensor: Using virtual tensors to build a layout-oblivious ai programming framework.JOUR- NAL OF COMPUTER SCIENCE AND TECHNOLOGY, 38(5):1074–1097, 2023

work page 2023
[35]

Equal- ity saturation for tensor graph superoptimization.Pro- ceedings of Machine Learning and Systems, 3:255–268, 2021

Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. Equal- ity saturation for tensor graph superoptimization.Pro- ceedings of Machine Learning and Systems, 3:255–268, 2021

work page 2021
[36]

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashin- fer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025

work page internal anchor Pith review arXiv 2025
[37]

Cocktailer: Analyzing and optimizing dynamic control flow in deep learning

Chen Zhang, Lingxiao Ma, Jilong Xue, Yining Shi, Zim- ing Miao, Fan Yang, Jidong Zhai, Zhi Yang, and Mao Yang. Cocktailer: Analyzing and optimizing dynamic control flow in deep learning. In17th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 23), pages 681–699, 2023

work page 2023
[38]

Shufflenet: An extremely efficient convolutional neural network for mobile devices

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018

work page 2018
[39]

Felix: Optimizing tensor programs with gradient 16 descent

Yifan Zhao, Hashim Sharif, Vikram Adve, and Sasa Mis- ailovic. Felix: Optimizing tensor programs with gradient 16 descent. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 367– 381, 2024

work page 2024
[40]

Ansor: Generating {High-Performance} tensor programs for deep learn- ing

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. Ansor: Generating {High-Performance} tensor programs for deep learn- ing. In14th USENIX symposium on operating systems design and implementation (OSDI 20), pages 863–879, 2020

work page 2020
[41]

{EINNET}: Optimizing tensor programs with {Derivation-Based} transformations

Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shuhong Huang, Xupeng Miao, Shizhi Tang, Kezhao Huang, et al. {EINNET}: Optimizing tensor programs with {Derivation-Based} transformations. In17th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 23), pages 739–755, 2023

work page 2023
[42]

Flextensor: An automatic schedule ex- ploration and optimization framework for tensor compu- tation on heterogeneous system

Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. Flextensor: An automatic schedule ex- ploration and optimization framework for tensor compu- tation on heterogeneous system. InProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 859–873, 2020. 17

work page 2020