pith. machine review for the scientific record. sign in

arxiv: 2604.15272 · v1 · submitted 2026-04-16 · 💻 cs.PL · cs.AI· cs.LG

Recognition: unknown

Prism: Symbolic Superoptimization of Tensor Programs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:00 UTC · model grok-4.3

classification 💻 cs.PL cs.AIcs.LG
keywords symbolicprismprogramssearchtensortimesbestoptimization
0
0 comments X

The pith

Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tensor programs are sequences of operations like matrix multiplications that form the backbone of modern machine learning models, especially large language models. Finding the fastest way to execute these programs on specific hardware is challenging because there are many possible implementations, each with different performance characteristics. Traditional approaches either use compilers that apply fixed rules or superoptimizers that exhaustively search but struggle with scale. Prism addresses this by introducing sGraph, a symbolic representation that can stand for many different concrete programs at once by treating some parameters symbolically rather than fixing them early. The system performs a two-level search: it first generates these symbolic graphs to cover families of possible programs, then uses symbolic reasoning to eliminate provably bad choices based on how operators work, mathematical identities, and hardware limits. Equivalence between different representations is checked using e-graph rewriting, which efficiently tracks equivalent expressions. Finally, promising symbolic graphs are turned into actual code through auto-tuning. This combination allows Prism to explore the optimization space more smartly than previous methods. On five standard workloads from large language models, it found implementations that ran up to 2.2 times faster than the best existing superoptimizers and 4.9 times faster than standard compilers, while also cutting the time spent on optimization itself by up to 3.4 times.

Core claim

Prism achieves up to 2.2× speedup over best superoptimizers and 4.9× over best compiler-based approaches, while reducing end-to-end optimization time by up to 3.4× on five LLM workloads.

Load-bearing premise

That the symbolic reasoning over operator semantics, algebraic identities, and hardware constraints can correctly and completely prune provably suboptimal regions of the search space without excluding optimal implementations.

Figures

Figures reproduced from arXiv: 2604.15272 by Mengdi Wu, Oded Padon, Xiaoyu Jiang, Zhihao Jia.

Figure 1
Figure 1. Figure 1: Graph representations of a fused Softmax-Matmul operation. (a) The input computation graph. (b) A concrete 𝜇Graph with specific mappings and parallelization parameters. (c) Our symbolic graph (sGraph), where mappings and dimensions are represented as symbolic variables. graph with a data dimension 𝑑 of original size 𝐷. The per￾block, per-iteration size of 𝑑 is: 𝐷 𝜎(𝑇 , 𝑑) , where 𝜎(𝑇 , 𝑑) = Î 𝑝∈ P 𝑚𝑇 ,𝑑,𝑝 … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Prism pipeline. sGraph Generation (§3): exhaustive search builds sGraphs with symbolic mappings; dimension matching and expression-guided pruning eliminate invalid branches. Mapping Instantiation (§3.4): enumerates candidate concrete mapping assignments satisfying all constraints. sGraph Verification (§4): equivalence checking using rewrite axioms. Parameter Instantiation (§5): random sampl… view at source ↗
Figure 3
Figure 3. Figure 3: Tensor representation with parallelization dimen￾sions Symmetry breaking. Different mapping assignments may yield functionally identical 𝜇Graphs when they differ only by a permutation of dimensions in P𝑔. To eliminate redun￾dant verification, Prism retains only the lexicographically smallest assignment within each equivalence class, reducing the number of candidates by up to a factor of 𝑘! for 𝑘 =|P𝑔 |. 4 … view at source ↗
Figure 4
Figure 4. Figure 4: Parallel operators used in sGraph verification operator, we derive its output expression from its input ex￾pressions and the operator semantics. For most operators (e.g., elementwise or matmul), this is straightforward. The key operators are the InputLoader and OutputSaver in the block graph: the InputLoader applies part or repl accord￾ing to the imap (partitioning the tensor along the mapped dimension, or… view at source ↗
Figure 5
Figure 5. Figure 5: shows an example. The kernel graph has a single CustomOp that applies elementwise exponentiation with imap: {𝑟 ↔𝑥} and omap: {𝑟 ↔𝑥}. The Input operator par￾titions the input variable 𝑣𝐼 along the row dimension over parallel dimension 𝑥, yielding part(𝑣𝐼 , 𝑟, 𝑥). After applying Exp, the OutputSaver combines the result, producing the final expression comb(exp(part(𝑣𝐼 , 𝑟, 𝑥)), 𝑟, 𝑥). Kernel Graph I 𝑣𝐼 Custom… view at source ↗
Figure 6
Figure 6. Figure 6: Kernel performance and optimization time across 5 workloads. Upper: relative kernel execution time. Lower: total optimization time breakdown—Mirage’s time includes graph generation and profiling; Prism’s time includes symbolic graph generation and instantiation; TVM’s time is the Ansor auto-tuning time (1000 trials) [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

This paper presents Prism, the first symbolic superoptimizer for tensor programs. The key idea is sGraph, a symbolic, hierarchical representation that compactly encodes large classes of tensor programs by symbolically representing some execution parameters. Prism organizes optimization as a two-level search: it constructs symbolic graphs that represent families of programs, and then instantiates them into concrete implementations. This formulation enables structured pruning of provably suboptimal regions of the search space using symbolic reasoning over operator semantics, algebraic identities, and hardware constraints. We develop techniques for efficient symbolic graph generation, equivalence verification via e-graph rewriting, and parameter instantiation through auto-tuning. Together, these components allow Prism to bridge the rigor of exhaustive search with the scalability required for modern ML workloads. Evaluation on five commonly used LLM workloads shows that Prism achieves up to $2.2\times$ speedup over best superoptimizers and $4.9\times$ over best compiler-based approaches, while reducing end-to-end optimization time by up to $3.4\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach relies on domain assumptions about tensor operator semantics and algebraic identities for pruning; introduces sGraph as a new representation with no free parameters or fitted values mentioned in the abstract.

axioms (1)
  • domain assumption Tensor operator semantics and algebraic identities can be symbolically modeled for equivalence verification and pruning of suboptimal programs
    Invoked in the description of symbolic graph generation and e-graph rewriting for structured pruning.
invented entities (1)
  • sGraph no independent evidence
    purpose: Compact symbolic hierarchical representation that encodes large classes of tensor programs by keeping some execution parameters symbolic
    New representation introduced as the core of the two-level search approach.

pith-pipeline@v0.9.0 · 5474 in / 1459 out tokens · 54652 ms · 2026-05-10T09:00:24.262403+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 13 canonical work pages · 2 internal anchors

  1. [1]

    XLA: Optimizing Compiler for TensorFlow.https : //www

    2017. XLA: Optimizing Compiler for TensorFlow.https : //www. tensorflow.org/xla

  2. [2]

    Transformer related optimizations.https://github.com/NVIDIA/ FasterTransformer

    2020. Transformer related optimizations.https://github.com/NVIDIA/ FasterTransformer

  3. [3]

    Flash-Decoding for long-context inference.https : //crfm

    2023. Flash-Decoding for long-context inference.https : //crfm. stanford.edu/2023/10/12/flashdecoding.html

  4. [4]

    NVIDIA H100 Tensor Core GPU.https://www.nvidia.com/en- us/data- center/h100/

    2023. NVIDIA H100 Tensor Core GPU.https://www.nvidia.com/en- us/data- center/h100/

  5. [5]

    Murray, Benoit Steiner, Paul Tucker, Vijay Va- sudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv- ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Va- sudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

  6. [6]

    In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI)

    TensorFlow: A System for Large-Scale Machine Learning.. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI)

  7. [7]

    Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan- Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe

  8. [8]

    InProceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT)

    OpenTuner: An Extensible Framework for Program Autotun- ing. InProceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT). ACM

  9. [9]

    Sorav Bansal and Alex Aiken. 2006. Automatic Generation of Peephole Superoptimizers. InProceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems(San Jose, California, USA)(ASPLOS XII)

  10. [10]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dal- las Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, St...

  11. [11]

    Available: https://doi.org/10.48550/arXiv.1802.04799

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-End Optimization Stack for Deep Learning.CoRRabs/1802.04799 (2018).http : //arxiv.org/abs/1802. 04799

  12. [12]

    Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. InAdvances in Neural Infor- mation Processing Systems 31

  13. [13]

    Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Co- hen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning.CoRRabs/1410.0759 (2014). http://arxiv.org/abs/1410.0759

  14. [14]

    Dense Linear Algebra on GPUs.https : //developer

    cuBLAS 2016. Dense Linear Algebra on GPUs.https : //developer. nvidia.com/cublas

  15. [15]

    Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. 2023. Flash-Decoding for Long-Context Inference

  16. [16]

    Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. 2024. FlashDe- coding++: Faster Large Language Model Inference on GPUs. arXiv:2311.01282 [cs.LG]

  17. [17]

    Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Balamurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai, and Zhihao Jia. 2024. Optimal Kernel Orchestration for Tensor Programs with Korch. InProceedings of the 29th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating Sy...

  18. [18]

    Ganger, Tianqi Chen, and Zhihao Jia

    Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mohammad Alizadeh, Gregory R. Ganger, Tianqi Chen, and Zhihao Jia. 2025. GraphPipe: Improving Performance and Scalabil- ity of DNN Training with Graph Pipeline Parallelism. InProceedings of the 30th ACM International C...

  19. [19]

    Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Za- haria, and Alex Aiken. 2019. TASO: Optimizing Deep Learning Com- putation with Automatic Generation of Graph Substitutions. InPro- ceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada)(SOSP ’19). Association for Computing Machinery, New York, NY, US...

  20. [20]

    Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. InProceedings of the 2nd Conference on Systems and Machine Learning (SysML’19)

  21. [21]

    Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmabil- ity, Performance & Precision. In2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE. doi:10.1109/ipdpsw.2018.00091

  22. [22]

    Henry Massalin. 1987. Superoptimizer: a look at the smallest program. InACM SIGARCH Computer Architecture News, Vol. 15

  23. [23]

    Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan- Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines.ACM Trans. Graph.35, 4 (2016)

  24. [24]

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Ko- zlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. 2025. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131(2025)

  25. [25]

    Jongseok Park, Kyungmin Bin, Gibum Park, Sangtae Ha, and Kyung- han Lee. 2023. ASPEN: Breaking Operator Barriers for Efficient Paral- lelization of Deep Neural Networks. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 68625– 68638.https : //p...

  26. [26]

    Tensors and Dynamic neural networks in Python with strong GPU acceleration.https://pytorch.org

    PyTorch 2017. Tensors and Dynamic neural networks in Python with strong GPU acceleration.https://pytorch.org

  27. [27]

    Eric Schkufza, Rahul Sharma, and Alex Aiken. 2013. Stochastic super- optimization. InACM SIGPLAN Notices, Vol. 48

  28. [28]

    Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Zim- ing Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. 2023. Welder: Scheduling Deep Learning Memory Access via Tile-graph. In17th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 23). USENIX Association, Boston, MA, 701–718.https : //www.usenix.org/conference/osdi23/presentation/shi

  29. [29]

    NVIDIA TensorRT: Programmable Inference Acceler- ator.https://developer.nvidia.com/tensorrt

    TensorRT 2017. NVIDIA TensorRT: Programmable Inference Acceler- ator.https://developer.nvidia.com/tensorrt

  30. [30]

    McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken

    Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Patrick S. McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken. 2022. Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parall...

  31. [31]

    Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. 2021. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections. In15th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 37–54.https : ...

  32. [32]

    Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. 2025. Astra: A Multi-Agent System for GPU Kernel Performance Optimization. arXiv:2509.07506 [cs.DC]https://arxiv.org/abs/2509.07506

  33. [33]

    Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, and Benjamin Ummenhofer. 2026. KernelFoundry: Hardware-aware evolutionary GPU kernel optimization. arXiv:2603.12440 [cs.DC] https://arxiv.org/abs/2603.12440

  34. [34]

    Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, and Pavel Panchekha. 2021. egg: Fast and Extensible Equality Saturation.Proc. ACM Program. Lang.5, POPL, Article 23 (Jan. 2021), 29 pages. doi:10.1145/3434304

  35. [35]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Syl- vain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush

  36. [36]

    Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.https://github.com/huggingface/transformers

  37. [37]

    Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia

  38. [38]

    InProceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

    Mirage: A Multi-Level Superoptimizer for Tensor Programs. InProceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association

  39. [39]

    Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. 2021. Equality Saturation for Ten- sor Graph Superoptimization.Proceedings of Machine Learning and Systems3 (March 2021), 255–268

  40. [40]

    Ansor: Generating high-performance tensor programs for deep learning,

    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor : Generat- ing High-Performance Tensor Programs for Deep Learning.CoRR abs/2006.06762 (2020). arXiv:2006.06762https://arxiv.org/abs/2006. 06762

  41. [41]

    Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shuhong Huang, Xupeng Miao, Shizhi Tang, Kezhao Huang, and Zhihao Jia. 2023. EINNET: Optimizing Tensor Pro- grams with Derivation-Based Transformations. In17th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 739–755.https:/...