pith. sign in

arxiv: 2605.25954 · v1 · pith:2AQ2445Mnew · submitted 2026-05-25 · 💻 cs.LG · cs.AI

Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization

Pith reviewed 2026-06-29 23:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords tensor program optimizationLLM-guided optimizationchain-of-thought reasoningstep-level supervisionintermediate representationTVM TIRdataset construction
0
0 comments X

The pith

Step-TP supplies step-level chain-of-thought supervision so language models can learn reliable single-step tensor program transformations instead of copying final outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Step-TP, a post-training dataset that gives LLMs grounded, atomic decisions for tensor program optimization. It pairs a token-efficient intermediate representation that lowers deterministically to TVM TIR with atomic, composable optimization strategies and explicit state transitions. Structured chain-of-thought traces record each IR-to-IR change, and strategy filtering balances coverage against shortcut learning. The central claim is that this closed loop over intermediate states produces training signals that let models perform reliable multi-step optimization where prior end-to-end datasets do not.

Core claim

Step-TP forms a closed reasoning loop over intermediate program states, enabling reliable multi-step optimization rather than outcome imitation, through a token-efficient verifiable IR, atomic composable strategies, structured CoT supervision with explicit IR-to-IR transitions, and strategy filtering.

What carries the argument

The closed reasoning loop that couples atomic optimization strategies with structured CoT traces and deterministic IR-to-IR state transitions.

If this is right

  • Models trained on Step-TP can generate interpretable single-step decisions that compose into longer trajectories.
  • The token-efficient IR reduces context length while remaining verifiable against TVM TIR.
  • Strategy filtering prevents the model from learning to skip hard steps while still covering diverse optimization paths.
  • The dataset supports iterative refinement because each step is paired with an explicit state change and reasoning trace.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the loop works, it may generalize to other compiler or hardware-mapping tasks that currently rely on end-to-end imitation.
  • The same atomic-strategy design could be applied to search spaces in scheduling or memory layout where combinatorial explosion is the main barrier.
  • Testing whether the filtered dataset still covers rare but high-value transformations would be a direct next measurement.

Load-bearing premise

Atomic and composable optimization strategies paired with the chosen IR and filtering will produce training signals that let models make reliable single-step decisions without exploiting shortcuts or losing coverage.

What would settle it

An LLM trained on Step-TP is evaluated on held-out optimization trajectories; if it consistently fails to reach competitive performance on programs that require more than three steps or exhibits the same shortcut patterns seen in outcome-imitation baselines, the claim is falsified.

Figures

Figures reproduced from arXiv: 2605.25954 by Chuan Wu, Da Zheng, Junwei Su, Mengfan Liu.

Figure 1
Figure 1. Figure 1: Comparison of the same tensor program represen [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of dataset construction. attention, multi-group attention). Based on these building blocks, we further construct composite programs by sampling and assem￾bling 2-5 components from the aforementioned categories, covering the majority of KernelBench [30] level-2 programs and additional randomly composed cases. To further enhance data diversity, we randomize the input and output shapes (e.g., with di… view at source ↗
Figure 3
Figure 3. Figure 3: Single-answer and multi-answer examples for Mat [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Token consumption of 6335 tensor programs across [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Despite the strong reasoning capabilities of large language models (LLMs), optimizing the execution efficiency of tensor programs remains challenging due to the need for precise, composable transformation decisions. Recent LLM-guided approaches frame tensor program optimization as an iterative decision process, but existing datasets provide only end-to-end optimized program pairs using token-inefficient representations, lacking verifiable step-level supervision and interpretability. As a result, LLMs struggle to make reliable single-step decisions in large combinatorial optimization spaces. We introduce Step-TP, a post-training dataset for tensor program optimization that provides grounded, atomic, step-level supervision with structured chain-of-thought (CoT) reasoning. Step-TP forms a closed reasoning loop over intermediate program states, enabling reliable multi-step optimization rather than outcome imitation. Its design is guided by four principles: (i) a token-efficient, verifiable intermediate representation (IR) that deterministically lowers to TVM TIR; (ii) atomic and composable optimization strategies that decompose complex trajectories into interpretable single-step decisions; (iii) structured CoT supervision coupled with explicit IR-to-IR state transitions; and (iv) strategy filtering to balance coverage while preventing shortcut exploitation. The dataset and implementation are available at a GitHub link, https://github.com/LIUMENGFAN-gif/StepTP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Step-TP, a post-training dataset for tensor program optimization. It supplies grounded, atomic, step-level supervision with structured chain-of-thought reasoning over intermediate program states. The dataset is constructed according to four design principles: a token-efficient verifiable IR that deterministically lowers to TVM TIR, atomic and composable optimization strategies, explicit IR-to-IR state transitions paired with CoT, and strategy filtering to balance coverage while avoiding shortcut exploitation. The central claim is that this construction forms a closed reasoning loop enabling reliable multi-step LLM-guided optimization rather than end-to-end outcome imitation.

Significance. If the described construction is sound and the released implementation functions as stated, Step-TP would constitute a useful addition to the resources available for LLM-guided program optimization. It directly targets the gap between existing end-to-end program-pair datasets and the need for interpretable, verifiable step-level signals in a combinatorial domain. The public release of both the dataset and the implementation code is a concrete strength for reproducibility.

major comments (2)
  1. [Abstract / design principles] Abstract (design principles section): the claim that the chosen IR 'deterministically lowers to TVM TIR' and thereby supplies 'verifiable supervision' is load-bearing for the entire contribution, yet the manuscript provides no verification, error rates, sample lowering traces, or correctness checks on the lowering procedure. Without such evidence the 'verifiable' property remains an untested assertion.
  2. [Abstract / design principles] Abstract (design principles section): the assertion that 'strategy filtering' simultaneously achieves coverage and prevents shortcut exploitation is central to the claim of reliable single-step decisions, but no coverage metrics, filtering statistics, or concrete examples of filtered versus retained trajectories are supplied to substantiate the balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments identifying two areas where the manuscript lacks supporting evidence for central claims. We agree with both points and will revise the paper to supply the requested verification, metrics, and examples. This addresses the major revision recommendation.

read point-by-point responses
  1. Referee: [Abstract / design principles] Abstract (design principles section): the claim that the chosen IR 'deterministically lowers to TVM TIR' and thereby supplies 'verifiable supervision' is load-bearing for the entire contribution, yet the manuscript provides no verification, error rates, sample lowering traces, or correctness checks on the lowering procedure. Without such evidence the 'verifiable' property remains an untested assertion.

    Authors: We agree that the manuscript currently provides no explicit verification of the lowering procedure, error rates, or sample traces. This is a substantive gap given the centrality of the 'verifiable supervision' claim. In the revised version we will add: (1) sample IR-to-TVM-TIR lowering traces, (2) quantitative error rates from validation runs on held-out programs, and (3) a description of the automated correctness checks performed during dataset generation. revision: yes

  2. Referee: [Abstract / design principles] Abstract (design principles section): the assertion that 'strategy filtering' simultaneously achieves coverage and prevents shortcut exploitation is central to the claim of reliable single-step decisions, but no coverage metrics, filtering statistics, or concrete examples of filtered versus retained trajectories are supplied to substantiate the balance.

    Authors: We acknowledge the absence of coverage metrics, filtering statistics, and concrete examples in the current manuscript. The revision will incorporate: (1) coverage metrics showing strategy distribution before and after filtering, (2) filtering statistics (e.g., fraction of trajectories removed and primary removal criteria), and (3) side-by-side examples of filtered versus retained trajectories to illustrate how the procedure balances coverage against shortcut prevention. revision: yes

Circularity Check

0 steps flagged

No significant circularity; dataset construction paper with no derivations or self-referential predictions

full rationale

The paper introduces a new dataset (Step-TP) for LLM-guided tensor program optimization, guided by four explicit design principles for IR, atomic strategies, CoT supervision, and filtering. No mathematical derivations, equations, fitted parameters, or predictions are present in the abstract or described contribution. The central claim is the creation and release of step-level training data rather than any result that reduces to prior fitted values or self-citations by construction. This matches the default expectation for non-circular dataset papers; the reader's assessment of score 1.0 is consistent with minor self-citation tolerance but no load-bearing circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a dataset introduction rather than a derivation; it relies on standard assumptions about program lowering and LLM training but introduces no free parameters, new axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5777 in / 1277 out tokens · 24945 ms · 2026-06-29T23:14:15.275055+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 21 canonical work pages · 7 internal anchors

  1. [1]

    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al

  2. [2]

    In12th USENIX symposium on operating systems design and implementation (OSDI 16)

    {TensorFlow}: a system for {Large-Scale} machine learning. In12th USENIX symposium on operating systems design and implementation (OSDI 16). 265–283

  3. [3]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  4. [4]

    Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Ab- durrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 193–205

  5. [5]

    Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti

  6. [6]

    Kevin: Multi-Turn RL for Generating CUDA Kernels.arXiv preprint arXiv:2507.11948(2025)

  7. [7]

    Tyler A Chang and Benjamin K Bergen. 2024. Language model behavior: A comprehensive survey.Computational Linguistics50, 1 (2024), 293–350

  8. [8]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594

  9. [9]

    Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs.Advances in Neural Information Processing Systems31 (2018)

  10. [10]

    Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Kim Hazelwood, Gabriel Syn- naeve, et al. 2023. Large language models for compiler optimization.arXiv preprint arXiv:2309.07062(2023)

  11. [11]

    Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather. 2024. Meta large language model compiler: Foundation models of compiler optimization.arXiv preprint arXiv:2407.02524 (2024)

  12. [12]

    Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)

  13. [13]

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

  14. [14]

    Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, and Shuang Yang. 2025. Stark: Strategic team of agents for refining kernels.arXiv preprint arXiv:2510.16996(2025)

  15. [15]

    Jingzhi Fang, Yanyan Shen, Yue Wang, and Lei Chen. 2021. ETO: Accelerating optimization of DNN operators by high-performance tensor program reuse. Proceedings of the VLDB Endowment15, 2 (2021), 183–195

  16. [16]

    Pratik Fegade, Tianqi Chen, Phillip B Gibbons, and Todd C Mowry. 2024. AC- RoBat: Optimizing auto-batching of dynamic deep learning at compile time. Proceedings of Machine Learning and Systems6 (2024), 14–30

  17. [17]

    Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, et al. 2023. Tensorir: An abstraction for automatic tensorized program optimization. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 804–817. Mengf...

  18. [18]

    Junfeng Gong, Zhiyi Wei, Junying Chen, Cheng Liu, and Huawei Li. 2025. From large to small: Transferring cuda optimization expertise via reasoning graph. arXiv preprint arXiv:2510.19873(2025)

  19. [19]

    Hanpeng Hu, Junwei Su, Juntao Zhao, Yanghua Peng, Yibo Zhu, Haibin Lin, and Chuan Wu. 2024. CDMPP: A device-model agnostic framework for latency pre- diction of tensor programs. InProceedings of the Nineteenth European Conference on Computer Systems. 1054–1074

  20. [20]

    Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. InProceedings of the 27th ACM Symposium on Operating Systems Principles. 47–62

  21. [21]

    Zhihao Jia, James Thomas, Todd Warszawski, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2019. Optimizing DNN computation with relaxed graph substitutions. Proceedings of Machine Learning and Systems1 (2019), 27–39

  22. [22]

    Hyeonjin Kim, Sungwoo Ahn, Yunho Oh, Bogil Kim, Won Woo Ro, and William J Song. 2020. Duplo: Lifting redundant memory accesses of deep neural networks for gpu tensor cores. In2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 725–737

  23. [23]

    Lingcheng Kong, Jiateng Wei, Hanzhang Shen, and Huan Wang. 2025. Con- cur: Conciseness makes state-of-the-art kernel generation.arXiv preprint arXiv:2510.07356(2025)

  24. [24]

    Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long. 2022. Automatic horizontal fusion for GPU kernels. In2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 14–27

  25. [25]

    Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. 2024. Long-context llms struggle with long in-context learning.arXiv preprint arXiv:2404.02060(2024)

  26. [26]

    Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, and Chris Shum. 2025. Cuda-l1: Improving cuda optimization via contrastive reinforcement learning.arXiv preprint arXiv:2507.14111(2025)

  27. [27]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  28. [28]

    Mengfan Liu, Wei Wang, and Chuan Wu. 2025. Optimizing distributed deploy- ment of mixture-of-experts model inference in serverless computing. InIeee infocom 2025-ieee conference on computer communications. IEEE, 1–10

  29. [29]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

  30. [30]

    Massinissa Merouani, Afif Boudaoud, and Riyadh Baghdadi. 2025. Looperset: A large-scale dataset for data-driven polyhedral compiler optimization.arXiv preprint arXiv:2510.10209(2025)

  31. [31]

    Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867(2018)

  32. [32]

    Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. 2025. KernelBench: Can LLMs write efficient GPU kernels?, 2025.URL https://arxiv. or g/abs/2502.10517(2025)

  33. [33]

    Mangpo Phothilimthana, Sami Abu-El-Haija, Kaidi Cao, Bahare Fatemi, Michael Burrows, Charith Mendis, and Bryan Perozzi. 2023. Tpugraphs: A performance prediction dataset on large tensor computational graphs.Advances in Neural Information Processing Systems36 (2023), 70355–70375

  34. [34]

    Guicheng Qi, Junwei Su, Liqi Yang, Tao Li, Tingwen Xie, Yerui Sun, Yuchen Xie, and Chuan Wu. 2026. HetAuto: Cross-Cluster Auto-Parallelism for Heteroge- neous Distributed Training. InProceedings of the 21st European Conference on Computer Systems. 759–779

  35. [35]

    Daniel Snider and Ruofan Liang. 2023. Operator fusion in XLA: analysis and evaluation.arXiv preprint arXiv:2301.13062(2023)

  36. [36]

    Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, and Chris Shum. 2025. CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning.arXiv preprint arXiv:2512.02551(2025)

  37. [37]

    Annabelle Sujun Tang, Christopher Priebe, Rohan Mahapatra, Lianhui Qin, and Hadi Esmaeilzadeh. 2025. REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  38. [38]

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Pro- gramming Languages. 10–19

  39. [39]

    Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions.arXiv preprint arXiv:1802.04730(2018)

  40. [40]

    Vasily Volkov and James W Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. InSC’08: Proceedings of the 2008 ACM/IEEE conference on Super- computing. IEEE, 1–11

  41. [41]

    Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. 2021. {PET}: Opti- mizing tensor programs with partially equivalent transformations and automated corrections. In15th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 21). 37–54

  42. [42]

    Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingx- iao Ma, Yuqing Xia, Jilong Xue, Fan Yang, et al. 2025. TileLang: A Composable Tiled Programming Model for AI Systems.arXiv preprint arXiv:2504.17577(2025)

  43. [43]

    Lei Wang, Lingxiao Ma, Shijie Cao, Quanlu Zhang, Jilong Xue, Yining Shi, Ningxin Zheng, Ziming Miao, Fan Yang, Ting Cao, et al. 2024. Ladder: Enabling efficient {Low-Precision} deep learning computing through hardware-aware tensor transformation. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 307–323

  44. [44]

    Weiyang Wang, Moein Khazraee, Zhizhen Zhong, Zhijao Jia, Dheevatsa Mudi- gere, Ying Zhang, Anthony Kewitsch, and Manya Ghobadi. 2022. Topoopt: Optimizing the network topology for distributed dnn training.arXiv preprint arXiv:2202.00433(2022)

  45. [45]

    Jiin Woo, Shaowei Zhu, Allen Nie, Zhen Jia, Yida Wang, and Youngsuk Park

  46. [46]

    Tritonrl: Training llms to think and code triton without cheating.arXiv preprint arXiv:2510.17891(2025)

  47. [47]

    Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. 2025. Mirage: A {Multi-Level} superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 21–38

  48. [48]

    Haofeng Xu, Junwei Su, Yukun Tian, Lansong Diao, Zhengping Qian, and Chuan Wu. 2026. GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control.arXiv preprint arXiv:2603.01501(2026)

  49. [49]

    Zi Yang, Lei Qiu, Fang Lyu, Ming Zhong, Zhilei Chai, Haojie Zhou, Huimin Cui, and Xiaobing Feng. [n. d.]. IR-OptSet: An Optimization-Sensitive Dataset for Advancing LLM-Based IR Optimizer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  50. [50]

    Yi Zhai, Sijia Yang, Keyu Pan, Renwei Zhang, Shuo Liu, Chao Liu, Zichun Ye, Jianmin Ji, Jie Zhao, Yu Zhang, et al. 2024. Enabling Tensor Language Model to Assist in Generating {High-Performance} Tensor Programs for Deep Learning. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 289–305

  51. [51]

    Yi Zhai, Yu Zhang, Shuo Liu, Xiaomeng Chu, Jie Peng, Jianmin Ji, and Yanyong Zhang. 2023. Tlp: A deep learning-based cost model for tensor program tuning. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 833–845

  52. [52]

    Jie Zhao, Xiong Gao, Ruijie Xia, Zhaochuang Zhang, Deshi Chen, Lei Chen, Renwei Zhang, Zhen Geng, Bin Cheng, and Xuefeng Jin. 2022. Apollo: Automatic partition-based operator fusion through layer by layer optimization.Proceedings of Machine Learning and Systems4 (2022), 1–19

  53. [53]

    Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, et al. 2021. AKG: automatic kernel generation for neural processing units using polyhedral transformations. InProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 1233–1248

  54. [54]

    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating {High-Performance} tensor programs for deep learning. In14th USENIX symposium on operating systems design and implementation (OSDI 20). 863–879

  55. [55]

    Lianmin Zheng, Ruochen Liu, Junru Shao, Tianqi Chen, Joseph E Gonzalez, Ion Stoica, and Ameer Haj Ali. 2021. Tenset: A large-scale program performance dataset for learned tensor compilers. InThirty-fifth Conference on Neural Infor- mation Processing Systems Datasets and Benchmarks Track (Round 1)

  56. [56]

    Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shuhong Huang, Xupeng Miao, Shizhi Tang, Kezhao Huang, et al. 2023. {EINNET}: Optimizing tensor programs with{Derivation-Based}transforma- tions. In17th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 23). 739–755

  57. [57]

    Size Zheng, Siyuan Chen, Siyuan Gao, Liancheng Jia, Guangyu Sun, Runsheng Wang, and Yun Liang. 2023. Tileflow: A framework for modeling fusion dataflow via tree-based analysis. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 1271–1288

  58. [58]

    Yuchen Zhong, Junwei Su, Chuan Wu, and Minjie Wang. 2025. Heta: Distributed Training of Heterogeneous Graph Neural Networks.Proceedings of the VLDB Endowment18, 9 (2025), 2790–2803

  59. [59]

    Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, et al. 2022. {ROLLER}: Fast and efficient tensor compilation for deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 233–248. Step-TP Dataset for Tensor Program Optimization A Intermediate represen...

  60. [60]

    For each new optimization, you MUST build it ON TOP OF these existing changes, namely ON TOP OF the current IR

    Before you suggest each new transformation, you MUST iden- tify what has been changed in the current IR compared to the root IR. For each new optimization, you MUST build it ON TOP OF these existing changes, namely ON TOP OF the current IR. The strategies MUST be used on the current IR! You MUST compare your modi- fied parts in each transformed IR with th...