Recognition: 2 theorem links
· Lean TheoremVTC: DNN Compilation with Virtual Tensors for Data Movement Elimination
Pith reviewed 2026-05-16 03:26 UTC · model grok-4.3
The pith
VTC eliminates unnecessary data movement in DNN compilation using virtual tensors tracked by index mappings, achieving up to 1.93x speedup and 60% memory savings on NVIDIA GPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VTC proposes the concept of virtual tensors to track data movement between compute operators via index mappings rather than expensive physical data transfers to and from global memory, which can seamlessly interoperate with existing computation kernels and handle arbitrary tensor operator compositions.
Load-bearing premise
That virtual tensors can seamlessly interoperate with existing computation kernels without overhead and that the data movement elimination algorithm can automatically identify profitable strategies for arbitrary operator compositions in real workloads.
Figures
read the original abstract
With the widening gap between compute and memory operation latencies, data movement optimizations have become increasingly important for DNN compilation. Current optimizations such as layout transformations and operator fusion only target a subset of tensor operators and consequently miss important opportunities for reducing data movement in contemporary DNN workloads, including large language models. We introduce VTC, a novel tensor compilation framework that for the first time eliminates all unnecessary data movement by targeting the full spectrum of data movement operators. VTC proposes the concept of virtual tensors to track data movement between compute operators via index mappings rather than expensive physical data transfers to and from global memory, which can seamlessly interoperate with existing computation kernels and handle arbitrary tensor operator compositions. We also introduce a novel data movement elimination algorithm to automatically identify a profitable virtual tensor creation strategy. Evaluation on a variety of DNNs shows that VTC can outperform existing ML compilers by up to 1.93x (1.28x on average) on NVIDIA GPUs with up to 60% (17.5% on average) inference memory savings.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing computation kernels can interoperate with virtual tensor index mappings without modification or overhead
invented entities (1)
-
virtual tensor
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VTC proposes the concept of virtual tensors to track data movement between compute operators via index mappings rather than expensive physical data transfers
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
virtual tensor opportunity graph (VTOG) and global greedy algorithm
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Xla: Optimizing compiler for tensorflow.https://www. tensorflow.org/xla, 2017
work page 2017
-
[2]
https: //developer.nvidia.com/tensorrt, 2024
NVIDIA TensorRT: An sdk with an optimizer for high-performance deep learning inference. https: //developer.nvidia.com/tensorrt, 2024
work page 2024
-
[3]
{TensorFlow}: a system for {Large-Scale} machine 14 learning
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, San- jay Ghemawat, Geoffrey Irving, Michael Isard, et al. {TensorFlow}: a system for {Large-Scale} machine 14 learning. In12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016
work page 2016
-
[4]
PhD thesis, Uni- versity of Copenhagen, 1994
Lars Ole Andersen.Program analysis and specializa- tion for the C programming language. PhD thesis, Uni- versity of Copenhagen, 1994
work page 1994
-
[5]
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. InProceedings of the 29th ACM International Confer- ence on Architectural Support for Programming La...
work page 2024
-
[6]
Marc Auslander and Martin Hopkins. An overview of the pl. 8 compiler. InProceedings of the 1982 SIGPLAN Symposium on Compiler Construction, pages 22–31, 1982
work page 1982
-
[7]
JAX: com- posable transformations of Python+NumPy programs, 2018
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclau- rin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: com- posable transformations of Python+NumPy programs, 2018
work page 2018
-
[8]
Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction
Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17302–17313, 2023
work page 2023
-
[9]
{TVM}: An automated {End-to-End} optimizing compiler for deep learning
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 18), pages 578–594, 2018
work page 2018
-
[10]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344– 16359, 2022
work page 2022
-
[11]
Flash-decoding for long-context infer- ence
Tri Dao, Daniel Haziza, Francisco Massa, and Grig- ory Sizov. Flash-decoding for long-context infer- ence. https://crfm.stanford.edu/2023/10/12/ flashdecoding.html, 2023
work page 2023
-
[12]
ONNX Runtime developers. Onnx runtime. https: //onnxruntime.ai/, 2024. Version: 1.20.1
work page 2024
-
[13]
Hidet: Task- mapping programming paradigm for deep learning ten- sor programs
Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. Hidet: Task- mapping programming paradigm for deep learning ten- sor programs. InProceedings of the 28th ACM Inter- national Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volume 2, pages 370–384, 2023
work page 2023
-
[14]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Optimal kernel orchestration for tensor programs with korch
Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Bal- amurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai, et al. Optimal kernel orchestration for tensor programs with korch. InProceedings of the 29th ACM Inter- national Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volu...
work page 2024
-
[16]
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Data movement is all you need: A case study on optimizing transformers.Proceedings of Machine Learning and Systems, 3:711–732, 2021
work page 2021
-
[17]
Taso: optimizing deep learning computation with automatic generation of graph substitutions
Zhihao Jia, Oded Padon, James Thomas, Todd Warsza- wski, Matei Zaharia, and Alex Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. InProceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47– 62, 2019
work page 2019
-
[18]
YOLOv11: An Overview of the Key Architectural Enhancements
Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Submodular func- tion maximization.Tractability, 3(71-104):3, 2014
Andreas Krause and Daniel Golovin. Submodular func- tion maximization.Tractability, 3(71-104):3, 2014
work page 2014
-
[20]
Efficient memory man- agement for large language model serving with page- dattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023
work page 2023
-
[21]
Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}
Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881–897, 2020. 15
work page 2020
-
[22]
PhD thesis, Massachusetts Institute of Technology, 2020
Thirimadura Charith Yasendra Mendis.Towards au- tomated construction of compiler optimizations. PhD thesis, Massachusetts Institute of Technology, 2020
work page 2020
-
[23]
Dnnfusion: accelerating deep neural net- works execution with advanced operator fusion
Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. Dnnfusion: accelerating deep neural net- works execution with advanced operator fusion. InPro- ceedings of the 42nd ACM SIGPLAN International Con- ference on Programming Language Design and Imple- mentation, pages 883–898, 2021
work page 2021
-
[24]
Smartmem: Layout transformation elim- ination and adaptation for efficient dnn execution on mobile
Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, and Bin Ren. Smartmem: Layout transformation elim- ination and adaptation for efficient dnn execution on mobile. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages...
work page 2024
-
[25]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scal- ing transformer inference.Proceedings of Machine Learning and Systems, 5:606–624, 2023
work page 2023
-
[26]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Ama- rasinghe. Halide: a language and compiler for optimiz- ing parallelism, locality, and recomputation in image processing pipelines.Acm Sigplan Notices, 48(6):519– 530, 2013
work page 2013
-
[27]
Rya Sanovar, Srikant Bharadwaj, Renee St Amant, Victor Rühle, and Saravan Rajmohan. Lean atten- tion: Hardware-aware scalable attention mechanism for the decode-phase of transformers.arXiv preprint arXiv:2405.10480, 2024
-
[28]
Welder: Scheduling deep learning memory access via tile-graph
Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. Welder: Scheduling deep learning memory access via tile-graph. In17th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 23), pages 701–718, 2023
work page 2023
-
[29]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Triton: an intermediate language and compiler for tiled neural network computations
Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019
work page 2019
-
[31]
Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia.{PET}: Optimizing ten- sor programs with partially equivalent transformations and automated corrections. In15th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 21), pages 37–54, 2021
work page 2021
-
[32]
Mirage: A {Multi- Level} superoptimizer for tensor programs
Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A {Multi- Level} superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 21–38, 2025
work page 2025
-
[33]
Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, et al. vtensor: Flexible virtual tensor management for efficient llm serving.arXiv preprint arXiv:2407.15309, 2024
-
[34]
Jingling Xue. Vtensor: Using virtual tensors to build a layout-oblivious ai programming framework.JOUR- NAL OF COMPUTER SCIENCE AND TECHNOLOGY, 38(5):1074–1097, 2023
work page 2023
-
[35]
Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. Equal- ity saturation for tensor graph superoptimization.Pro- ceedings of Machine Learning and Systems, 3:255–268, 2021
work page 2021
-
[36]
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashin- fer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025
work page internal anchor Pith review arXiv 2025
-
[37]
Cocktailer: Analyzing and optimizing dynamic control flow in deep learning
Chen Zhang, Lingxiao Ma, Jilong Xue, Yining Shi, Zim- ing Miao, Fan Yang, Jidong Zhai, Zhi Yang, and Mao Yang. Cocktailer: Analyzing and optimizing dynamic control flow in deep learning. In17th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 23), pages 681–699, 2023
work page 2023
-
[38]
Shufflenet: An extremely efficient convolutional neural network for mobile devices
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018
work page 2018
-
[39]
Felix: Optimizing tensor programs with gradient 16 descent
Yifan Zhao, Hashim Sharif, Vikram Adve, and Sasa Mis- ailovic. Felix: Optimizing tensor programs with gradient 16 descent. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 367– 381, 2024
work page 2024
-
[40]
Ansor: Generating {High-Performance} tensor programs for deep learn- ing
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. Ansor: Generating {High-Performance} tensor programs for deep learn- ing. In14th USENIX symposium on operating systems design and implementation (OSDI 20), pages 863–879, 2020
work page 2020
-
[41]
{EINNET}: Optimizing tensor programs with {Derivation-Based} transformations
Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shuhong Huang, Xupeng Miao, Shizhi Tang, Kezhao Huang, et al. {EINNET}: Optimizing tensor programs with {Derivation-Based} transformations. In17th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 23), pages 739–755, 2023
work page 2023
-
[42]
Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. Flextensor: An automatic schedule ex- ploration and optimization framework for tensor compu- tation on heterogeneous system. InProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 859–873, 2020. 17
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.