pith. machine review for the scientific record. sign in

arxiv: 2604.09558 · v1 · submitted 2026-02-11 · 💻 cs.DC · cs.LG· cs.PL

Recognition: 2 theorem links

· Lean Theorem

VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:26 UTC · model grok-4.3

classification 💻 cs.DC cs.LGcs.PL
keywords datamovementtensorcompilationmemoryoperatorsvirtualaverage
0
0 comments X

The pith

VTC eliminates unnecessary data movement in DNN compilation using virtual tensors tracked by index mappings, achieving up to 1.93x speedup and 60% memory savings on NVIDIA GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a compiler framework called VTC for deep neural networks. It creates virtual tensors that represent data without actually moving it around in memory. Instead of copying tensors to and from global memory for each operation, the system uses index mappings to describe how data should be accessed. This virtual approach works alongside existing compute kernels and can handle any combination of tensor operators. A new algorithm decides when creating these virtual tensors will save time and memory. Tests on various neural networks show faster inference and lower memory use compared to current compilers.

Core claim

VTC proposes the concept of virtual tensors to track data movement between compute operators via index mappings rather than expensive physical data transfers to and from global memory, which can seamlessly interoperate with existing computation kernels and handle arbitrary tensor operator compositions.

Load-bearing premise

That virtual tensors can seamlessly interoperate with existing computation kernels without overhead and that the data movement elimination algorithm can automatically identify profitable strategies for arbitrary operator compositions in real workloads.

Figures

Figures reproduced from arXiv: 2604.09558 by Ahan Gupta, Charith Mendis, Janardhan Kulkarni, Jiachen Yuan, Muyan Hu, Ofer Dekel, Taeksang Kim, Vikram Adve, Vima Gupta, Xin Xu.

Figure 1
Figure 1. Figure 1: Trend of compute/memory ratio for NVIDIA GPUs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Motivating example: computation graph and latency [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of data movement operator elimination [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The overview of VTC. 4.1 Virtual Tensor Definition The key idea behind the virtual tensor technique is to avoid redundant instantiation of tensor data in global memory. If a tensor is obtained by performing data movement operators on other tensors, we can directly represent it using a map￾ping function and a set of physical tensor pointers, instead of allocating global memory to store the tensor data. Defi… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between contiguous Split (performed on the first axis) and incontiguous Split (performed on other axes) on a 2D matrix stored in row major. examples of data movement operators from ONNX, demon￾strate how VTC eliminates them using virtual tensors, and discuss the profitability. All of the data movement operators presented in this subsection are outside the optimization space of previous layout op… view at source ↗
Figure 6
Figure 6. Figure 6: (a) The computation graph. (b) The corresponding [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example of a computation graph and conflicting [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Several valid points-to graph of Figure [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: End-to-end inference latency comparison on a single NVIDIA A100 GPU and H100 GPU with batch sizes 1 and 16. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Breakdown of latency proportions for data movement and compu [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: VTC generates 5 kernels (k1–k5) for EfficientViT attention block with batch size 16. All the unframed data movement operators are eliminated with virtual tensor. The figure on the right shows the virtual tensor strategy repre￾sented by a points-to graph. 8 Related Work Virtual tensors. VTensor [34] is a programming frame￾work which decouples tensor layouts from the programming interface, enabling develope… view at source ↗
read the original abstract

With the widening gap between compute and memory operation latencies, data movement optimizations have become increasingly important for DNN compilation. Current optimizations such as layout transformations and operator fusion only target a subset of tensor operators and consequently miss important opportunities for reducing data movement in contemporary DNN workloads, including large language models. We introduce VTC, a novel tensor compilation framework that for the first time eliminates all unnecessary data movement by targeting the full spectrum of data movement operators. VTC proposes the concept of virtual tensors to track data movement between compute operators via index mappings rather than expensive physical data transfers to and from global memory, which can seamlessly interoperate with existing computation kernels and handle arbitrary tensor operator compositions. We also introduce a novel data movement elimination algorithm to automatically identify a profitable virtual tensor creation strategy. Evaluation on a variety of DNNs shows that VTC can outperform existing ML compilers by up to 1.93x (1.28x on average) on NVIDIA GPUs with up to 60% (17.5% on average) inference memory savings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on interoperability of virtual tensors with existing kernels and effectiveness of the elimination algorithm; no explicit free parameters or fitted values mentioned.

axioms (1)
  • domain assumption Existing computation kernels can interoperate with virtual tensor index mappings without modification or overhead
    Invoked when claiming seamless operation with current kernels
invented entities (1)
  • virtual tensor no independent evidence
    purpose: Track data movement via index mappings to avoid physical transfers
    Core new concept introduced to eliminate data movement

pith-pipeline@v0.9.0 · 5517 in / 1128 out tokens · 179509 ms · 2026-05-16T03:26:39.472438+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

  1. [1]

    tensorflow.org/xla, 2017

    Xla: Optimizing compiler for tensorflow.https://www. tensorflow.org/xla, 2017

  2. [2]

    https: //developer.nvidia.com/tensorrt, 2024

    NVIDIA TensorRT: An sdk with an optimizer for high-performance deep learning inference. https: //developer.nvidia.com/tensorrt, 2024

  3. [3]

    {TensorFlow}: a system for {Large-Scale} machine 14 learning

    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, San- jay Ghemawat, Geoffrey Irving, Michael Isard, et al. {TensorFlow}: a system for {Large-Scale} machine 14 learning. In12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016

  4. [4]

    PhD thesis, Uni- versity of Copenhagen, 1994

    Lars Ole Andersen.Program analysis and specializa- tion for the C programming language. PhD thesis, Uni- versity of Copenhagen, 1994

  5. [5]

    Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. InProceedings of the 29th ACM International Confer- ence on Architectural Support for Programming La...

  6. [6]

    An overview of the pl

    Marc Auslander and Martin Hopkins. An overview of the pl. 8 compiler. InProceedings of the 1982 SIGPLAN Symposium on Compiler Construction, pages 22–31, 1982

  7. [7]

    JAX: com- posable transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclau- rin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: com- posable transformations of Python+NumPy programs, 2018

  8. [8]

    Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction

    Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17302–17313, 2023

  9. [9]

    {TVM}: An automated {End-to-End} optimizing compiler for deep learning

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 18), pages 578–594, 2018

  10. [10]

    Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344– 16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344– 16359, 2022

  11. [11]

    Flash-decoding for long-context infer- ence

    Tri Dao, Daniel Haziza, Francisco Massa, and Grig- ory Sizov. Flash-decoding for long-context infer- ence. https://crfm.stanford.edu/2023/10/12/ flashdecoding.html, 2023

  12. [12]

    Onnx runtime

    ONNX Runtime developers. Onnx runtime. https: //onnxruntime.ai/, 2024. Version: 1.20.1

  13. [13]

    Hidet: Task- mapping programming paradigm for deep learning ten- sor programs

    Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. Hidet: Task- mapping programming paradigm for deep learning ten- sor programs. InProceedings of the 28th ACM Inter- national Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volume 2, pages 370–384, 2023

  14. [14]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  15. [15]

    Optimal kernel orchestration for tensor programs with korch

    Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Bal- amurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai, et al. Optimal kernel orchestration for tensor programs with korch. InProceedings of the 29th ACM Inter- national Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volu...

  16. [16]

    Data movement is all you need: A case study on optimizing transformers.Proceedings of Machine Learning and Systems, 3:711–732, 2021

    Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Data movement is all you need: A case study on optimizing transformers.Proceedings of Machine Learning and Systems, 3:711–732, 2021

  17. [17]

    Taso: optimizing deep learning computation with automatic generation of graph substitutions

    Zhihao Jia, Oded Padon, James Thomas, Todd Warsza- wski, Matei Zaharia, and Alex Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. InProceedings of the 27th ACM Symposium on Operating Systems Principles, pages 47– 62, 2019

  18. [18]

    YOLOv11: An Overview of the Key Architectural Enhancements

    Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024

  19. [19]

    Submodular func- tion maximization.Tractability, 3(71-104):3, 2014

    Andreas Krause and Daniel Golovin. Submodular func- tion maximization.Tractability, 3(71-104):3, 2014

  20. [20]

    Efficient memory man- agement for large language model serving with page- dattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  21. [21]

    Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}

    Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: Enabling holistic deep learning compiler optimizations with {rTasks}. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 881–897, 2020. 15

  22. [22]

    PhD thesis, Massachusetts Institute of Technology, 2020

    Thirimadura Charith Yasendra Mendis.Towards au- tomated construction of compiler optimizations. PhD thesis, Massachusetts Institute of Technology, 2020

  23. [23]

    Dnnfusion: accelerating deep neural net- works execution with advanced operator fusion

    Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. Dnnfusion: accelerating deep neural net- works execution with advanced operator fusion. InPro- ceedings of the 42nd ACM SIGPLAN International Con- ference on Programming Language Design and Imple- mentation, pages 883–898, 2021

  24. [24]

    Smartmem: Layout transformation elim- ination and adaptation for efficient dnn execution on mobile

    Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, and Bin Ren. Smartmem: Layout transformation elim- ination and adaptation for efficient dnn execution on mobile. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages...

  25. [25]

    Efficiently scal- ing transformer inference.Proceedings of Machine Learning and Systems, 5:606–624, 2023

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scal- ing transformer inference.Proceedings of Machine Learning and Systems, 5:606–624, 2023

  26. [26]

    Halide: a language and compiler for optimiz- ing parallelism, locality, and recomputation in image processing pipelines.Acm Sigplan Notices, 48(6):519– 530, 2013

    Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Ama- rasinghe. Halide: a language and compiler for optimiz- ing parallelism, locality, and recomputation in image processing pipelines.Acm Sigplan Notices, 48(6):519– 530, 2013

  27. [27]

    Lean atten- tion: Hardware-aware scalable attention mechanism for the decode-phase of transformers.arXiv preprint arXiv:2405.10480, 2024

    Rya Sanovar, Srikant Bharadwaj, Renee St Amant, Victor Rühle, and Saravan Rajmohan. Lean atten- tion: Hardware-aware scalable attention mechanism for the decode-phase of transformers.arXiv preprint arXiv:2405.10480, 2024

  28. [28]

    Welder: Scheduling deep learning memory access via tile-graph

    Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. Welder: Scheduling deep learning memory access via tile-graph. In17th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 23), pages 701–718, 2023

  29. [29]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  30. [30]

    Triton: an intermediate language and compiler for tiled neural network computations

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

  31. [31]

    In15th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 21), pages 37–54, 2021

    Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia.{PET}: Optimizing ten- sor programs with partially equivalent transformations and automated corrections. In15th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 21), pages 37–54, 2021

  32. [32]

    Mirage: A {Multi- Level} superoptimizer for tensor programs

    Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A {Multi- Level} superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 21–38, 2025

  33. [33]

    vtensor: Flexible virtual tensor management for efficient llm serving.arXiv preprint arXiv:2407.15309, 2024

    Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, et al. vtensor: Flexible virtual tensor management for efficient llm serving.arXiv preprint arXiv:2407.15309, 2024

  34. [34]

    Vtensor: Using virtual tensors to build a layout-oblivious ai programming framework.JOUR- NAL OF COMPUTER SCIENCE AND TECHNOLOGY, 38(5):1074–1097, 2023

    Jingling Xue. Vtensor: Using virtual tensors to build a layout-oblivious ai programming framework.JOUR- NAL OF COMPUTER SCIENCE AND TECHNOLOGY, 38(5):1074–1097, 2023

  35. [35]

    Equal- ity saturation for tensor graph superoptimization.Pro- ceedings of Machine Learning and Systems, 3:255–268, 2021

    Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. Equal- ity saturation for tensor graph superoptimization.Pro- ceedings of Machine Learning and Systems, 3:255–268, 2021

  36. [36]

    FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashin- fer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025

  37. [37]

    Cocktailer: Analyzing and optimizing dynamic control flow in deep learning

    Chen Zhang, Lingxiao Ma, Jilong Xue, Yining Shi, Zim- ing Miao, Fan Yang, Jidong Zhai, Zhi Yang, and Mao Yang. Cocktailer: Analyzing and optimizing dynamic control flow in deep learning. In17th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 23), pages 681–699, 2023

  38. [38]

    Shufflenet: An extremely efficient convolutional neural network for mobile devices

    Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018

  39. [39]

    Felix: Optimizing tensor programs with gradient 16 descent

    Yifan Zhao, Hashim Sharif, Vikram Adve, and Sasa Mis- ailovic. Felix: Optimizing tensor programs with gradient 16 descent. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 367– 381, 2024

  40. [40]

    Ansor: Generating {High-Performance} tensor programs for deep learn- ing

    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. Ansor: Generating {High-Performance} tensor programs for deep learn- ing. In14th USENIX symposium on operating systems design and implementation (OSDI 20), pages 863–879, 2020

  41. [41]

    {EINNET}: Optimizing tensor programs with {Derivation-Based} transformations

    Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shuhong Huang, Xupeng Miao, Shizhi Tang, Kezhao Huang, et al. {EINNET}: Optimizing tensor programs with {Derivation-Based} transformations. In17th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 23), pages 739–755, 2023

  42. [42]

    Flextensor: An automatic schedule ex- ploration and optimization framework for tensor compu- tation on heterogeneous system

    Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. Flextensor: An automatic schedule ex- ploration and optimization framework for tensor compu- tation on heterogeneous system. InProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 859–873, 2020. 17