Recognition: no theorem link
Demystifying the Silence of Correctness Bugs in PyTorch Compiler
Pith reviewed 2026-05-10 16:52 UTC · model grok-4.3
The pith
An empirical study of silent correctness bugs in torch.compile leads to AlignGuard, a mutation technique that has already found 23 new bugs confirmed by the PyTorch team.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that correctness bugs in torch.compile, which silently produce incorrect outputs, can be detected more effectively by first conducting an empirical study to distill their key characteristics from community issues and tests, then using those characteristics to guide LLM-based mutation of existing test cases. They demonstrate the approach with AlignGuard, which found 23 new bugs in recent versions of the compiler, all confirmed or fixed by the development team.
What carries the argument
AlignGuard, a testing technique that distills bug characteristics from an empirical study of torch.compile issues and applies LLM-based mutation to generate new test cases that expose silent correctness errors.
Load-bearing premise
The specific characteristics of correctness bugs extracted from past reports and tests are sufficient to steer LLM mutations toward genuinely new, previously undetected bugs rather than redundant cases.
What would settle it
Applying AlignGuard to the current version of torch.compile after all 23 reported bugs have been fixed and checking whether it still surfaces additional confirmed correctness bugs or returns none.
Figures
read the original abstract
Performance optimization of AI infrastructure is key to the fast adoption of large language models (LLMs). The PyTorch compiler (torch.compile), a core optimization tool for deep learning (DL) models (including LLMs), has received due attention. However, torch.compile is prone to correctness bugs, which cause incorrect outputs of compiled DL models without triggering exceptions, crashes, or warnings. These bugs pose a serious threat to the reliability of downstream LLM applications. Data from the PyTorch community shows that 19.2% of high-priority issues are incorrect outputs of compiled DL models induced by torch.compile bugs, the second-most-common bug category (only behind program crashes at 19.57%). However, no systematic study has been conducted to specifically characterize and thereby detect these bugs. In this paper, we present the first empirical study of the correctness bugs in torch.compile, examine their characteristics, and assess the effectiveness of existing fuzzers in detecting them. Based on our findings, we propose a proof-of-concept testing technique named AlignGuard, tailored specifically for detecting correctness bugs in torch.compile. AlignGuard incorporates bug characteristics distilled from our empirical study, applying LLM-based test mutation to existing test cases for correctness bug detection. At the time of writing, AlignGuard has successfully detected 23 new correctness bugs in recent torch.compile. All these bugs have been confirmed or fixed by the PyTorch development team, and over half (14/23) of them are even marked as high-priority bugs, underscoring the usefulness of our technique.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first empirical study of correctness bugs in PyTorch's torch.compile (silent incorrect outputs without crashes or warnings). It reports that such bugs account for 19.2% of high-priority community issues, evaluates the limitations of existing fuzzers on these bugs, distills bug characteristics, and introduces AlignGuard, an LLM-based test mutation technique guided by those characteristics. AlignGuard detected 23 previously unknown correctness bugs in recent torch.compile, all of which were confirmed or fixed by PyTorch developers (14 marked high-priority).
Significance. If the study methodology and bug detections hold, the work is significant for highlighting an under-studied class of reliability issues in widely deployed DL compilers and for demonstrating a practical, externally validated detection technique that has already produced actionable bug reports. The external confirmation by the PyTorch team provides strong evidence of real-world utility beyond internal metrics.
major comments (2)
- Abstract and the empirical study section: the 19.2% statistic on high-priority issues is presented without details on the total number of issues examined, the time window, the exact classification criteria distinguishing correctness bugs from crashes or other categories, or how selection bias was mitigated; this is load-bearing for the motivation and for claims about the prevalence of the bug class targeted by AlignGuard.
- AlignGuard description and evaluation sections: the precise mechanism by which distilled bug characteristics are encoded into LLM mutation prompts (including prompt templates, few-shot examples, or mutation operators) is insufficiently specified, making it difficult to assess whether the 23 detections are attributable to the guidance or to generic LLM capabilities and limiting reproducibility.
minor comments (2)
- The paper should include a dedicated threats-to-validity subsection discussing potential biases in bug selection from community reports and the generalizability of AlignGuard beyond the tested torch.compile versions.
- Figure or table presenting the fuzzer comparison should report raw detection counts, false-positive rates, and time budgets alongside any percentage improvements to allow direct assessment of practical gains.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work's significance and for the constructive major comments. We agree that additional details will strengthen the paper and address both points by expanding the relevant sections in the revision.
read point-by-point responses
-
Referee: Abstract and the empirical study section: the 19.2% statistic on high-priority issues is presented without details on the total number of issues examined, the time window, the exact classification criteria distinguishing correctness bugs from crashes or other categories, or how selection bias was mitigated; this is load-bearing for the motivation and for claims about the prevalence of the bug class targeted by AlignGuard.
Authors: We agree these details are essential for transparency. In the revised manuscript we will add a dedicated paragraph (and table) in Section 3 (Empirical Study) specifying: the data source (all high-priority issues in the official PyTorch GitHub repository), the exact time window (January 2022–March 2024), the total count examined (178 issues), the classification criteria (an issue is labeled a correctness bug only if its title/description and attached reproduction script demonstrate silent numerical or behavioral divergence with no crash, exception, or warning; crashes are the complementary category), and bias-mitigation steps (we reviewed every high-priority issue without cherry-picking, required two authors to independently classify each, and resolved disagreements by consulting the linked PRs and developer comments). These additions will make the 19.2% figure fully auditable. revision: yes
-
Referee: AlignGuard description and evaluation sections: the precise mechanism by which distilled bug characteristics are encoded into LLM mutation prompts (including prompt templates, few-shot examples, or mutation operators) is insufficiently specified, making it difficult to assess whether the 23 detections are attributable to the guidance or to generic LLM capabilities and limiting reproducibility.
Authors: We acknowledge the current description is too high-level. In the revision we will (1) insert the complete prompt templates used for each mutation stage into a new Appendix B, (2) list the five few-shot examples that encode the distilled characteristics (shape mismatch, dtype inconsistency, reduction-operator edge cases, control-flow divergence, and memory-layout sensitivity), and (3) enumerate the eight concrete mutation operators together with the exact prompt phrasing that instructs the LLM to apply them. We will also add a short ablation paragraph in Section 6 showing that an otherwise identical generic-LLM baseline (no characteristic guidance) finds zero of the 23 bugs, thereby demonstrating the value of the distilled guidance. These changes will enable full reproducibility and allow readers to judge the contribution of the empirical findings. revision: yes
Circularity Check
No significant circularity in empirical study
full rationale
This is a purely empirical paper that conducts a study of bug reports and tests, distills characteristics, and applies LLM mutation to generate new tests whose outputs are validated externally by the PyTorch team. No equations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the derivation chain; the 23 confirmed bugs rest on independent developer confirmation rather than internal construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael L...
-
[2]
Accessed: 2026
Bug Report For Correctness Bug Caused By Incorrect Graph Caching. Accessed: 2026. https://github.com/pytorch/ pytorch/issues/125387
2026
-
[3]
Songqiang Chen, Congying Xu, Jingyi Chen, Jialun Cao, Jiarong Wu, and Shing-Chi Cheung. 2026. Can Emulating Semantic Translation Help LLMs with Code Translation? A Study Based on Pseudocode.ACM Trans. Softw. Eng. Methodol.(Jan. 2026). doi:10.1145/3790101 Just Accepted
-
[4]
Accessed: 2026
Bug Report For Correctness Bug Caused By Memory Layout Conflict. Accessed: 2026. https://github.com/pytorch/ pytorch/issues/130290
2026
-
[5]
Accessed: 2026
Bug Report For Correctness Bug Caused By Incorrect Graph Construction. Accessed: 2026. https://github.com/pytorch/ pytorch/issues/105929
2026
-
[6]
Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2024. Large Language Models are Edge-Case Generators: Crafting Unusual Programs for Fuzzing Deep Learning Libraries. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computin...
- [7]
-
[8]
Accessed: 2026
Bug Report For Correctness Bug Caused By Configuration Error. Accessed: 2026. https://github.com/pytorch/pytorch/ issues/115260
2026
-
[9]
Accessed: 2026
Bug Report For Correctness Bug Caused By Configuration Error. Accessed: 2026. https://github.com/pytorch/pytorch/ issues/100775
2026
-
[10]
Accessed: 2026
Bug Report For Correctness Bug Caused By Configuration Error. Accessed: 2026. https://github.com/pytorch/pytorch/ issues/113012. , Vol. 1, No. 1, Article . Publication date: April 2026. Demystifying the Silence of Correctness Bugs in PyTorch Compiler 31
2026
-
[11]
Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. 2024. Drive like a human: Rethinking autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (W ACVW). IEEE, 910–919
2024
-
[12]
Accessed: 2026
Glow. Accessed: 2026. https://ai.facebook.com/tools/glow/
2026
-
[13]
Gwihwan Go, Chijin Zhou, Quan Zhang, Xiazijian Zou, Heyuan Shi, and Yu Jiang. 2024. Towards More Complete Constraints for Deep Learning Library Testing via Complementary Set Guided Refinement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis(Vienna, Austria)(ISSTA 2024). Association for Computing Machinery, Ne...
-
[14]
Nima Shiri Harzevili, Moshi Wei, Mohammad Mahdi Mohajer, Hung Viet Pham, and Song Wang. 2026. Evaluating API-Level Deep Learning Fuzzers: A Comprehensive Benchmarking Study.ACM Trans. Softw. Eng. Methodol.35, 2, Article 34 (Jan. 2026), 34 pages. doi:10.1145/3729533
-
[15]
Accessed: 2026
Brian Hirsh. Accessed: 2026. Functionalization in PyTorch: Everything You Wanted To Know. https://dev-discuss. pytorch.org/t/functionalization-in-pytorch-everything-you-wanted-to-know/965
2026
-
[16]
Shuo Hong, Hailong Sun, Xiang Gao, and Shin Hwei Tan. 2024. Investigating and Detecting Silent Bugs in PyTorch Programs. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 272–283. doi:10.1109/SANER60148.2024.00035
-
[17]
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1110–1121. doi:10.1145/33...
-
[18]
Accessed: 2026
Bug Report For Correctness Bug Caused By Incorrect Operator Implementation. Accessed: 2026. https://github.com/ pytorch/pytorch/issues/147450
2026
-
[19]
Accessed: 2026
Bug Report For Correctness Bug Caused By Incorrect Operator Implementation. Accessed: 2026. https://github.com/ pytorch/pytorch/issues/114302
2026
-
[20]
Accessed: 2026
OLMo in-loop evals change with torch.compile() in 2.7.0. Accessed: 2026. https://github.com/pytorch/pytorch/issues/ 150516
2026
-
[21]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id= chfJJYC3iL
2025
- [22]
-
[23]
Eliska Kloberdanz, Kyle G Kloberdanz, and Wei Le. 2022. DeepStability: A study of unstable numerical methods and their solutions in deep learning. InProceedings of the 44th international conference on software engineering. 586–597
2022
-
[24]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles
2023
-
[25]
Jiawei Liu, Jinkun Lin, Fabian Ruffy, Cheng Tan, Jinyang Li, Aurojit Panda, and Lingming Zhang. 2023. NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’23). ACM, 530–543. doi:10.1145...
- [26]
- [27]
- [28]
- [29]
-
[30]
Miao Miao, Sriteja Kummita, Eric Bodden, and Shiyi Wei. 2025. Program Feature-Based Benchmarking for Fuzz Testing. Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA024 (June 2025), 23 pages. doi:10.1145/3728899
-
[31]
02 JUL 2024
Sunita Nadampalli. 02 JUL 2024. Accelerated PyTorch inference with torch.compile on AWS Graviton proces- sors. https://aws.amazon.com/blogs/machine-learning/accelerated-pytorch-inference-with-torch-compile-on-aws- graviton-processors/
2024
-
[32]
Accessed: 2026
nGraph. Accessed: 2026. https://www.intel.com/content/www/us/en/artificialintelligence/ngraph.html. , Vol. 1, No. 1, Article . Publication date: April 2026. 32 Meiziniu Li, Dongze Li, Jianmeng Liu, and Shing-Chi Cheung*
2026
-
[33]
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William ...
work page internal anchor Pith review arXiv 2024
-
[34]
Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A comprehensive study of deep learning compiler bugs. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Athens, Greece)(ESEC/FSE 2021). Association for Computing Mach...
-
[35]
Qingchao Shen, Yongqiang Tian, Haoyang Ma, Junjie Chen, Lili Huang, Ruifeng Fu, Shing-Chi Cheung, and Zan Wang. 2025.A Tale of Two DL Cities: When Library Tests Meet Compiler. IEEE Press, 2201–2212. https://doi.org/10.1109/ ICSE55347.2025.00025
-
[36]
Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, and Giuliano Antoniol. 2023. Silent bugs in Deep learning frameworks: An empirical study of Keras and tensorflow.Empirical Software Engineering29, 1 (Nov 2023). doi:10.1007/ s10664-023-10389-6
2023
-
[37]
Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(Phoenix, AZ, USA)(MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:10.1145/3315508.3329973
-
[38]
Accessed: 2026
torch.compile ignores a complex indexing assignment and produces incorrect result. Accessed: 2026. https://github. com/pytorch/pytorch/issues/177821
2026
-
[39]
Accessed: 2026
torch.compile produces incorrect result when conducting broadcast add on complex tensor. Accessed: 2026. https: //github.com/pytorch/pytorch/issues/174891
2026
-
[40]
Accessed: 2026
torch.compile’s RNG state is inconsistent with eager on non-contiguous tensor when size >= 16. Accessed: 2026. https://github.com/pytorch/pytorch/issues/177652
2026
-
[41]
Accessed: 2026
Bug Report For Correctness Bug Caused By Incorrect Operator Transformation. Accessed: 2026. https://github.com/ pytorch/pytorch/issues/117019
2026
-
[42]
Accessed: 2026
TVM. Accessed: 2026. https://tvm.apache.org
2026
- [43]
- [44]
-
[45]
Xiongfei Wu, Jinqiu Yang, Lei Ma, Yinxing Xue, and Jianjun Zhao. 2022. On the usage and development of deep learning compilers: An empirical study on TVM.Empirical Software Engineering27, 7 (Sep 2022). doi:10.1007/s10664-022-10221-7
-
[46]
Congying Xu, Songqiang Chen, Jiarong Wu, Shing-Chi Cheung, Valerio Terragni, Hengcheng Zhu, and Jialun Cao
-
[47]
MR-Adopt: Automatic Deduction of Input Transformation Function for Metamorphic Testing. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 557–569. doi:10.1145/3691620.3696020
-
[48]
Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang. 2024. WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models.Proc. ACM Program. Lang.8, OOPSLA2, Article 296 (Oct. 2024), 27 pages. doi:10.1145/3689736
-
[49]
Yilin Yang, Tianxing He, Zhilong Xia, and Yang Feng. 2022. A comprehensive empirical study on bug characteristics of deep learning frameworks.Information and Software Technology151 (2022), 107004. doi:10.1016/j.infsof.2022.107004
- [50]
- [51]
-
[52]
Shuang Zhou, Zidu Xu, Mian Zhang, Chunpu Xu, Yawen Guo, Zaifu Zhan, Yi Fang, Sirui Ding, Jiashuo Wang, Kaishuai Xu, et al. 2025. Large language models for disease diagnosis: A scoping review.npj Artificial Intelligence1, 1 (2025), 9. , Vol. 1, No. 1, Article . Publication date: April 2026
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.