Prism: Symbolic Superoptimization of Tensor Programs

Mengdi Wu , Xiaoyu Jiang , Oded Padon , Zhihao Jia

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:00 UTC · model grok-4.3

classification 💻 cs.PL cs.AIcs.LG

keywords symbolicprismprogramssearchtensortimesbestoptimization

0 comments

The pith

Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tensor programs are sequences of operations like matrix multiplications that form the backbone of modern machine learning models, especially large language models. Finding the fastest way to execute these programs on specific hardware is challenging because there are many possible implementations, each with different performance characteristics. Traditional approaches either use compilers that apply fixed rules or superoptimizers that exhaustively search but struggle with scale. Prism addresses this by introducing sGraph, a symbolic representation that can stand for many different concrete programs at once by treating some parameters symbolically rather than fixing them early. The system performs a two-level search: it first generates these symbolic graphs to cover families of possible programs, then uses symbolic reasoning to eliminate provably bad choices based on how operators work, mathematical identities, and hardware limits. Equivalence between different representations is checked using e-graph rewriting, which efficiently tracks equivalent expressions. Finally, promising symbolic graphs are turned into actual code through auto-tuning. This combination allows Prism to explore the optimization space more smartly than previous methods. On five standard workloads from large language models, it found implementations that ran up to 2.2 times faster than the best existing superoptimizers and 4.9 times faster than standard compilers, while also cutting the time spent on optimization itself by up to 3.4 times.

Core claim

Prism achieves up to 2.2× speedup over best superoptimizers and 4.9× over best compiler-based approaches, while reducing end-to-end optimization time by up to 3.4× on five LLM workloads.

Load-bearing premise

That the symbolic reasoning over operator semantics, algebraic identities, and hardware constraints can correctly and completely prune provably suboptimal regions of the search space without excluding optimal implementations.

Figures

Figures reproduced from arXiv: 2604.15272 by Mengdi Wu, Oded Padon, Xiaoyu Jiang, Zhihao Jia.

**Figure 1.** Figure 1: Graph representations of a fused Softmax-Matmul operation. (a) The input computation graph. (b) A concrete 𝜇Graph with specific mappings and parallelization parameters. (c) Our symbolic graph (sGraph), where mappings and dimensions are represented as symbolic variables. graph with a data dimension 𝑑 of original size 𝐷. The perblock, per-iteration size of 𝑑 is: 𝐷 𝜎(𝑇 , 𝑑) , where 𝜎(𝑇 , 𝑑) = Î 𝑝∈ P 𝑚𝑇 ,𝑑,𝑝 … view at source ↗

**Figure 2.** Figure 2: Overview of the Prism pipeline. sGraph Generation (§3): exhaustive search builds sGraphs with symbolic mappings; dimension matching and expression-guided pruning eliminate invalid branches. Mapping Instantiation (§3.4): enumerates candidate concrete mapping assignments satisfying all constraints. sGraph Verification (§4): equivalence checking using rewrite axioms. Parameter Instantiation (§5): random sampl… view at source ↗

**Figure 3.** Figure 3: Tensor representation with parallelization dimensions Symmetry breaking. Different mapping assignments may yield functionally identical 𝜇Graphs when they differ only by a permutation of dimensions in P𝑔. To eliminate redundant verification, Prism retains only the lexicographically smallest assignment within each equivalence class, reducing the number of candidates by up to a factor of 𝑘! for 𝑘 =|P𝑔 |. 4 … view at source ↗

**Figure 4.** Figure 4: Parallel operators used in sGraph verification operator, we derive its output expression from its input expressions and the operator semantics. For most operators (e.g., elementwise or matmul), this is straightforward. The key operators are the InputLoader and OutputSaver in the block graph: the InputLoader applies part or repl according to the imap (partitioning the tensor along the mapped dimension, or… view at source ↗

**Figure 5.** Figure 5: shows an example. The kernel graph has a single CustomOp that applies elementwise exponentiation with imap: {𝑟 ↔𝑥} and omap: {𝑟 ↔𝑥}. The Input operator partitions the input variable 𝑣𝐼 along the row dimension over parallel dimension 𝑥, yielding part(𝑣𝐼 , 𝑟, 𝑥). After applying Exp, the OutputSaver combines the result, producing the final expression comb(exp(part(𝑣𝐼 , 𝑟, 𝑥)), 𝑟, 𝑥). Kernel Graph I 𝑣𝐼 Custom… view at source ↗

**Figure 6.** Figure 6: Kernel performance and optimization time across 5 workloads. Upper: relative kernel execution time. Lower: total optimization time breakdown—Mirage’s time includes graph generation and profiling; Prism’s time includes symbolic graph generation and instantiation; TVM’s time is the Ansor auto-tuning time (1000 trials) [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

This paper presents Prism, the first symbolic superoptimizer for tensor programs. The key idea is sGraph, a symbolic, hierarchical representation that compactly encodes large classes of tensor programs by symbolically representing some execution parameters. Prism organizes optimization as a two-level search: it constructs symbolic graphs that represent families of programs, and then instantiates them into concrete implementations. This formulation enables structured pruning of provably suboptimal regions of the search space using symbolic reasoning over operator semantics, algebraic identities, and hardware constraints. We develop techniques for efficient symbolic graph generation, equivalence verification via e-graph rewriting, and parameter instantiation through auto-tuning. Together, these components allow Prism to bridge the rigor of exhaustive search with the scalability required for modern ML workloads. Evaluation on five commonly used LLM workloads shows that Prism achieves up to $2.2\times$ speedup over best superoptimizers and $4.9\times$ over best compiler-based approaches, while reducing end-to-end optimization time by up to $3.4\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prism's sGraph and two-level symbolic search is a reasonable attempt to scale superoptimization to tensor programs, but the reported speedups rest on thin evaluation details.

read the letter

Prism introduces sGraph as a hierarchical symbolic representation for tensor programs, then runs a two-level search that builds families of programs symbolically before instantiating them. It uses e-graph rewriting for equivalence checks and symbolic reasoning over operator semantics plus hardware constraints to prune the space. That setup is the main new piece; prior superoptimizers either enumerate exhaustively or rely on heuristics without this structured symbolic layer. The paper shows how to generate the symbolic graphs efficiently and how auto-tuning fills in the parameters afterward. Those pieces together let it claim better scaling than pure search while still aiming for optimality on LLM workloads. The reported numbers are up to 2.2 times faster than existing superoptimizers and 4.9 times over compiler baselines, with optimization time cut by 3.4 times on five workloads. If the pruning logic holds, this is useful engineering. The soft spots are in the evaluation and the central assumption. Only five LLM workloads are shown, with no error bars or full methodology for the baselines visible in the abstract. More importantly, the claim that symbolic pruning removes provably suboptimal regions without dropping optimal implementations needs concrete verification; a single counterexample would undermine the whole approach. The paper does not appear to provide machine-checked proofs or exhaustive small-case validation for the pruning rules. This work is aimed at people building ML compilers and tensor program optimizers. A reader who already knows e-graphs and superoptimization literature will see the incremental advance clearly. It deserves a serious referee because the problem is real and the proposed structure is coherent, even though the current evidence is preliminary. I would send it out for review with specific requests for pruning soundness checks and wider benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach relies on domain assumptions about tensor operator semantics and algebraic identities for pruning; introduces sGraph as a new representation with no free parameters or fitted values mentioned in the abstract.

axioms (1)

domain assumption Tensor operator semantics and algebraic identities can be symbolically modeled for equivalence verification and pruning of suboptimal programs
Invoked in the description of symbolic graph generation and e-graph rewriting for structured pruning.

invented entities (1)

sGraph no independent evidence
purpose: Compact symbolic hierarchical representation that encodes large classes of tensor programs by keeping some execution parameters symbolic
New representation introduced as the core of the two-level search approach.

pith-pipeline@v0.9.0 · 5474 in / 1459 out tokens · 54652 ms · 2026-05-10T09:00:24.262403+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 13 canonical work pages · 2 internal anchors

[1]

XLA: Optimizing Compiler for TensorFlow.https : //www

2017. XLA: Optimizing Compiler for TensorFlow.https : //www. tensorflow.org/xla

2017
[2]

Transformer related optimizations.https://github.com/NVIDIA/ FasterTransformer

2020. Transformer related optimizations.https://github.com/NVIDIA/ FasterTransformer

2020
[3]

Flash-Decoding for long-context inference.https : //crfm

2023. Flash-Decoding for long-context inference.https : //crfm. stanford.edu/2023/10/12/flashdecoding.html

2023
[4]

NVIDIA H100 Tensor Core GPU.https://www.nvidia.com/en- us/data- center/h100/

2023. NVIDIA H100 Tensor Core GPU.https://www.nvidia.com/en- us/data- center/h100/

2023
[5]

Murray, Benoit Steiner, Paul Tucker, Vijay Va- sudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv- ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Va- sudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng
[6]

In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI)

TensorFlow: A System for Large-Scale Machine Learning.. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI)
[7]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan- Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe
[8]

InProceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT)

OpenTuner: An Extensible Framework for Program Autotun- ing. InProceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT). ACM
[9]

Sorav Bansal and Alex Aiken. 2006. Automatic Generation of Peephole Superoptimizers. InProceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems(San Jose, California, USA)(ASPLOS XII)

2006
[10]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dal- las Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, St...

work page internal anchor Pith review arXiv 2022
[11]

Available: https://doi.org/10.48550/arXiv.1802.04799

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-End Optimization Stack for Deep Learning.CoRRabs/1802.04799 (2018).http : //arxiv.org/abs/1802. 04799

work page arXiv 2018
[12]

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. InAdvances in Neural Infor- mation Processing Systems 31

2018
[13]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Co- hen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning.CoRRabs/1410.0759 (2014). http://arxiv.org/abs/1410.0759

work page arXiv 2014
[14]

Dense Linear Algebra on GPUs.https : //developer

cuBLAS 2016. Dense Linear Algebra on GPUs.https : //developer. nvidia.com/cublas

2016
[15]

Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. 2023. Flash-Decoding for Long-Context Inference

2023
[16]

Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. 2024. FlashDe- coding++: Faster Large Language Model Inference on GPUs. arXiv:2311.01282 [cs.LG]

work page arXiv 2024
[17]

Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Balamurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai, and Zhihao Jia. 2024. Optimal Kernel Orchestration for Tensor Programs with Korch. InProceedings of the 29th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating Sy...

work page doi:10.1145/3620666.3651383 2024
[18]

Ganger, Tianqi Chen, and Zhihao Jia

Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mohammad Alizadeh, Gregory R. Ganger, Tianqi Chen, and Zhihao Jia. 2025. GraphPipe: Improving Performance and Scalabil- ity of DNN Training with Graph Pipeline Parallelism. InProceedings of the 30th ACM International C...

work page doi:10.1145/3669940.3707220 2025
[19]

Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Za- haria, and Alex Aiken. 2019. TASO: Optimizing Deep Learning Com- putation with Automatic Generation of Graph Substitutions. InPro- ceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada)(SOSP ’19). Association for Computing Machinery, New York, NY, US...

work page doi:10.1145/3341301.3359630 2019
[20]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. InProceedings of the 2nd Conference on Systems and Machine Learning (SysML’19)

2019
[21]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmabil- ity, Performance & Precision. In2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE. doi:10.1109/ipdpsw.2018.00091

work page doi:10.1109/ipdpsw.2018.00091 2018
[22]

Henry Massalin. 1987. Superoptimizer: a look at the smallest program. InACM SIGARCH Computer Architecture News, Vol. 15

1987
[23]

Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan- Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines.ACM Trans. Graph.35, 4 (2016)

2016
[24]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Ko- zlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. 2025. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131(2025)

work page internal anchor Pith review arXiv 2025
[25]

Jongseok Park, Kyungmin Bin, Gibum Park, Sangtae Ha, and Kyung- han Lee. 2023. ASPEN: Breaking Operator Barriers for Efficient Paral- lelization of Deep Neural Networks. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 68625– 68638.https : //p...

2023
[26]

Tensors and Dynamic neural networks in Python with strong GPU acceleration.https://pytorch.org

PyTorch 2017. Tensors and Dynamic neural networks in Python with strong GPU acceleration.https://pytorch.org

2017
[27]

Eric Schkufza, Rahul Sharma, and Alex Aiken. 2013. Stochastic super- optimization. InACM SIGPLAN Notices, Vol. 48

2013
[28]

Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Zim- ing Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. 2023. Welder: Scheduling Deep Learning Memory Access via Tile-graph. In17th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 23). USENIX Association, Boston, MA, 701–718.https : //www.usenix.org/conference/osdi23/presentation/shi

2023
[29]

NVIDIA TensorRT: Programmable Inference Acceler- ator.https://developer.nvidia.com/tensorrt

TensorRT 2017. NVIDIA TensorRT: Programmable Inference Acceler- ator.https://developer.nvidia.com/tensorrt

2017
[30]

McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken

Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Patrick S. McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken. 2022. Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parall...

2022
[31]

Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. 2021. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections. In15th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 37–54.https : ...

2021
[32]

Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. 2025. Astra: A Multi-Agent System for GPU Kernel Performance Optimization. arXiv:2509.07506 [cs.DC]https://arxiv.org/abs/2509.07506

work page arXiv 2025
[33]

Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, and Benjamin Ummenhofer. 2026. KernelFoundry: Hardware-aware evolutionary GPU kernel optimization. arXiv:2603.12440 [cs.DC] https://arxiv.org/abs/2603.12440

work page arXiv 2026
[34]

Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, and Pavel Panchekha. 2021. egg: Fast and Extensible Equality Saturation.Proc. ACM Program. Lang.5, POPL, Article 23 (Jan. 2021), 29 pages. doi:10.1145/3434304

work page doi:10.1145/3434304 2021
[35]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Syl- vain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush
[36]

Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.https://github.com/huggingface/transformers
[37]

Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia
[38]

InProceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

Mirage: A Multi-Level Superoptimizer for Tensor Programs. InProceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association
[39]

Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. 2021. Equality Saturation for Ten- sor Graph Superoptimization.Proceedings of Machine Learning and Systems3 (March 2021), 255–268

2021
[40]

Ansor: Generating high-performance tensor programs for deep learning,

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor : Generat- ing High-Performance Tensor Programs for Deep Learning.CoRR abs/2006.06762 (2020). arXiv:2006.06762https://arxiv.org/abs/2006. 06762

work page arXiv 2020
[41]

Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shuhong Huang, Xupeng Miao, Shizhi Tang, Kezhao Huang, and Zhihao Jia. 2023. EINNET: Optimizing Tensor Pro- grams with Derivation-Based Transformations. In17th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 739–755.https:/...

2023