pith. machine review for the scientific record. sign in

arxiv: 2604.08720 · v1 · submitted 2026-04-09 · 💻 cs.SE · cs.AI

Recognition: no theorem link

Demystifying the Silence of Correctness Bugs in PyTorch Compiler

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:52 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords torch.compilecorrectness bugsPyTorchcompiler testingbug detectionLLM mutationempirical studysilent errors
0
0 comments X

The pith

An empirical study of silent correctness bugs in torch.compile leads to AlignGuard, a mutation technique that has already found 23 new bugs confirmed by the PyTorch team.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines correctness bugs in PyTorch's torch.compile that produce wrong results in deep learning models without any crashes, warnings, or exceptions. These silent errors matter because they undermine the reliability of AI systems including large language models, and community data shows they rank as the second-most-common high-priority issue. The authors analyze real bug reports and existing tests to identify recurring characteristics of these bugs. They then introduce AlignGuard, which feeds those characteristics into large language models to mutate test cases and expose new instances. This method has already surfaced 23 previously unknown bugs, more than half of which were marked high-priority and have since been fixed.

Core claim

The authors claim that correctness bugs in torch.compile, which silently produce incorrect outputs, can be detected more effectively by first conducting an empirical study to distill their key characteristics from community issues and tests, then using those characteristics to guide LLM-based mutation of existing test cases. They demonstrate the approach with AlignGuard, which found 23 new bugs in recent versions of the compiler, all confirmed or fixed by the development team.

What carries the argument

AlignGuard, a testing technique that distills bug characteristics from an empirical study of torch.compile issues and applies LLM-based mutation to generate new test cases that expose silent correctness errors.

Load-bearing premise

The specific characteristics of correctness bugs extracted from past reports and tests are sufficient to steer LLM mutations toward genuinely new, previously undetected bugs rather than redundant cases.

What would settle it

Applying AlignGuard to the current version of torch.compile after all 23 reported bugs have been fixed and checking whether it still surfaces additional confirmed correctness bugs or returns none.

Figures

Figures reproduced from arXiv: 2604.08720 by Dongze Li, Jianmeng Liu, Meiziniu Li, Shing-Chi Cheung.

Figure 1
Figure 1. Figure 1: An industrial LLM fails to converge during training due to a correctness bug in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of PyTorch compiler. 2.2 The Significance of PyTorch Compiler and Its Correctness Bugs As a core and widely adopted component in the PyTorch ecosystem, torch.compile and its reliability have received considerable attention from both the user community and the PyTorch development team. To quantify the prevalence and severity of torch.compile issues, we collect and analyze recent high-priority b… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of faulty components (bottom and top left) and symptoms of high-priority issues in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of Bug Types 3.2.1 Graph-related Bugs. Graph-related bugs account for 19.8% (23/116) of all cases in our dataset. These bugs occur during the front-end graph construction stage of torch.compile, where the user-defined model is transformed into a static computational graph for subsequent optimizations. We subdivide the graph-related bugs into two subcategories: graph semantic capturing bugs and… view at source ↗
Figure 5
Figure 5. Figure 5: Benchmarking studied techniques’ bug detection performance [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of detected and missed bugs. The name of each bug category is abbreviated using its [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Workflow of AlignGuard aimed at generating correctness bug-revealing test cases. The key insight of AlignGuard is that it goes beyond relying solely on historical bug records (e.g., issue reports and fix commits); instead, it combines bug characteristics (including bug categories, bug-triggering patterns, and root causes) from our empirical study with a multi-step mutation workflow to trigger correctness b… view at source ↗
read the original abstract

Performance optimization of AI infrastructure is key to the fast adoption of large language models (LLMs). The PyTorch compiler (torch.compile), a core optimization tool for deep learning (DL) models (including LLMs), has received due attention. However, torch.compile is prone to correctness bugs, which cause incorrect outputs of compiled DL models without triggering exceptions, crashes, or warnings. These bugs pose a serious threat to the reliability of downstream LLM applications. Data from the PyTorch community shows that 19.2% of high-priority issues are incorrect outputs of compiled DL models induced by torch.compile bugs, the second-most-common bug category (only behind program crashes at 19.57%). However, no systematic study has been conducted to specifically characterize and thereby detect these bugs. In this paper, we present the first empirical study of the correctness bugs in torch.compile, examine their characteristics, and assess the effectiveness of existing fuzzers in detecting them. Based on our findings, we propose a proof-of-concept testing technique named AlignGuard, tailored specifically for detecting correctness bugs in torch.compile. AlignGuard incorporates bug characteristics distilled from our empirical study, applying LLM-based test mutation to existing test cases for correctness bug detection. At the time of writing, AlignGuard has successfully detected 23 new correctness bugs in recent torch.compile. All these bugs have been confirmed or fixed by the PyTorch development team, and over half (14/23) of them are even marked as high-priority bugs, underscoring the usefulness of our technique.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the first empirical study of correctness bugs in PyTorch's torch.compile (silent incorrect outputs without crashes or warnings). It reports that such bugs account for 19.2% of high-priority community issues, evaluates the limitations of existing fuzzers on these bugs, distills bug characteristics, and introduces AlignGuard, an LLM-based test mutation technique guided by those characteristics. AlignGuard detected 23 previously unknown correctness bugs in recent torch.compile, all of which were confirmed or fixed by PyTorch developers (14 marked high-priority).

Significance. If the study methodology and bug detections hold, the work is significant for highlighting an under-studied class of reliability issues in widely deployed DL compilers and for demonstrating a practical, externally validated detection technique that has already produced actionable bug reports. The external confirmation by the PyTorch team provides strong evidence of real-world utility beyond internal metrics.

major comments (2)
  1. Abstract and the empirical study section: the 19.2% statistic on high-priority issues is presented without details on the total number of issues examined, the time window, the exact classification criteria distinguishing correctness bugs from crashes or other categories, or how selection bias was mitigated; this is load-bearing for the motivation and for claims about the prevalence of the bug class targeted by AlignGuard.
  2. AlignGuard description and evaluation sections: the precise mechanism by which distilled bug characteristics are encoded into LLM mutation prompts (including prompt templates, few-shot examples, or mutation operators) is insufficiently specified, making it difficult to assess whether the 23 detections are attributable to the guidance or to generic LLM capabilities and limiting reproducibility.
minor comments (2)
  1. The paper should include a dedicated threats-to-validity subsection discussing potential biases in bug selection from community reports and the generalizability of AlignGuard beyond the tested torch.compile versions.
  2. Figure or table presenting the fuzzer comparison should report raw detection counts, false-positive rates, and time budgets alongside any percentage improvements to allow direct assessment of practical gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the constructive major comments. We agree that additional details will strengthen the paper and address both points by expanding the relevant sections in the revision.

read point-by-point responses
  1. Referee: Abstract and the empirical study section: the 19.2% statistic on high-priority issues is presented without details on the total number of issues examined, the time window, the exact classification criteria distinguishing correctness bugs from crashes or other categories, or how selection bias was mitigated; this is load-bearing for the motivation and for claims about the prevalence of the bug class targeted by AlignGuard.

    Authors: We agree these details are essential for transparency. In the revised manuscript we will add a dedicated paragraph (and table) in Section 3 (Empirical Study) specifying: the data source (all high-priority issues in the official PyTorch GitHub repository), the exact time window (January 2022–March 2024), the total count examined (178 issues), the classification criteria (an issue is labeled a correctness bug only if its title/description and attached reproduction script demonstrate silent numerical or behavioral divergence with no crash, exception, or warning; crashes are the complementary category), and bias-mitigation steps (we reviewed every high-priority issue without cherry-picking, required two authors to independently classify each, and resolved disagreements by consulting the linked PRs and developer comments). These additions will make the 19.2% figure fully auditable. revision: yes

  2. Referee: AlignGuard description and evaluation sections: the precise mechanism by which distilled bug characteristics are encoded into LLM mutation prompts (including prompt templates, few-shot examples, or mutation operators) is insufficiently specified, making it difficult to assess whether the 23 detections are attributable to the guidance or to generic LLM capabilities and limiting reproducibility.

    Authors: We acknowledge the current description is too high-level. In the revision we will (1) insert the complete prompt templates used for each mutation stage into a new Appendix B, (2) list the five few-shot examples that encode the distilled characteristics (shape mismatch, dtype inconsistency, reduction-operator edge cases, control-flow divergence, and memory-layout sensitivity), and (3) enumerate the eight concrete mutation operators together with the exact prompt phrasing that instructs the LLM to apply them. We will also add a short ablation paragraph in Section 6 showing that an otherwise identical generic-LLM baseline (no characteristic guidance) finds zero of the 23 bugs, thereby demonstrating the value of the distilled guidance. These changes will enable full reproducibility and allow readers to judge the contribution of the empirical findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical study

full rationale

This is a purely empirical paper that conducts a study of bug reports and tests, distills characteristics, and applies LLM mutation to generate new tests whose outputs are validated externally by the PyTorch team. No equations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the derivation chain; the 23 confirmed bugs rest on independent developer confirmation rather than internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical software engineering study with no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5581 in / 1169 out tokens · 58256 ms · 2026-05-10T16:52:29.417126+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael L...

  2. [2]

    Accessed: 2026

    Bug Report For Correctness Bug Caused By Incorrect Graph Caching. Accessed: 2026. https://github.com/pytorch/ pytorch/issues/125387

  3. [3]

    Songqiang Chen, Congying Xu, Jingyi Chen, Jialun Cao, Jiarong Wu, and Shing-Chi Cheung. 2026. Can Emulating Semantic Translation Help LLMs with Code Translation? A Study Based on Pseudocode.ACM Trans. Softw. Eng. Methodol.(Jan. 2026). doi:10.1145/3790101 Just Accepted

  4. [4]

    Accessed: 2026

    Bug Report For Correctness Bug Caused By Memory Layout Conflict. Accessed: 2026. https://github.com/pytorch/ pytorch/issues/130290

  5. [5]

    Accessed: 2026

    Bug Report For Correctness Bug Caused By Incorrect Graph Construction. Accessed: 2026. https://github.com/pytorch/ pytorch/issues/105929

  6. [6]

    Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2024. Large Language Models are Edge-Case Generators: Crafting Unusual Programs for Fuzzing Deep Learning Libraries. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computin...

  7. [7]

    Xiaoting Du, Zheng Zheng, Lei Ma, and Jianjun Zhao. 2021. An Empirical Study on Common Bugs in Deep Learning Compilers. In2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). 184–195. doi:10. 1109/ISSRE52982.2021.00030

  8. [8]

    Accessed: 2026

    Bug Report For Correctness Bug Caused By Configuration Error. Accessed: 2026. https://github.com/pytorch/pytorch/ issues/115260

  9. [9]

    Accessed: 2026

    Bug Report For Correctness Bug Caused By Configuration Error. Accessed: 2026. https://github.com/pytorch/pytorch/ issues/100775

  10. [10]

    Accessed: 2026

    Bug Report For Correctness Bug Caused By Configuration Error. Accessed: 2026. https://github.com/pytorch/pytorch/ issues/113012. , Vol. 1, No. 1, Article . Publication date: April 2026. Demystifying the Silence of Correctness Bugs in PyTorch Compiler 31

  11. [11]

    Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. 2024. Drive like a human: Rethinking autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (W ACVW). IEEE, 910–919

  12. [12]

    Accessed: 2026

    Glow. Accessed: 2026. https://ai.facebook.com/tools/glow/

  13. [13]

    Gwihwan Go, Chijin Zhou, Quan Zhang, Xiazijian Zou, Heyuan Shi, and Yu Jiang. 2024. Towards More Complete Constraints for Deep Learning Library Testing via Complementary Set Guided Refinement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis(Vienna, Austria)(ISSTA 2024). Association for Computing Machinery, Ne...

  14. [14]

    Nima Shiri Harzevili, Moshi Wei, Mohammad Mahdi Mohajer, Hung Viet Pham, and Song Wang. 2026. Evaluating API-Level Deep Learning Fuzzers: A Comprehensive Benchmarking Study.ACM Trans. Softw. Eng. Methodol.35, 2, Article 34 (Jan. 2026), 34 pages. doi:10.1145/3729533

  15. [15]

    Accessed: 2026

    Brian Hirsh. Accessed: 2026. Functionalization in PyTorch: Everything You Wanted To Know. https://dev-discuss. pytorch.org/t/functionalization-in-pytorch-everything-you-wanted-to-know/965

  16. [16]

    Shuo Hong, Hailong Sun, Xiang Gao, and Shin Hwei Tan. 2024. Investigating and Detecting Silent Bugs in PyTorch Programs. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 272–283. doi:10.1109/SANER60148.2024.00035

  17. [17]

    Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1110–1121. doi:10.1145/33...

  18. [18]

    Accessed: 2026

    Bug Report For Correctness Bug Caused By Incorrect Operator Implementation. Accessed: 2026. https://github.com/ pytorch/pytorch/issues/147450

  19. [19]

    Accessed: 2026

    Bug Report For Correctness Bug Caused By Incorrect Operator Implementation. Accessed: 2026. https://github.com/ pytorch/pytorch/issues/114302

  20. [20]

    Accessed: 2026

    OLMo in-loop evals change with torch.compile() in 2.7.0. Accessed: 2026. https://github.com/pytorch/pytorch/issues/ 150516

  21. [21]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id= chfJJYC3iL

  22. [22]

    Yuxuan Jiang, Ziming Zhou, Boyu Xu, Beijie Liu, Runhui Xu, and Peng Huang. 2025. Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks. arXiv:2506.14813 [cs.LG] https://arxiv.org/abs/2506.14813

  23. [23]

    Eliska Kloberdanz, Kyle G Kloberdanz, and Wei Le. 2022. DeepStability: A study of unstable numerical methods and their solutions in deep learning. InProceedings of the 44th international conference on software engineering. 586–597

  24. [24]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  25. [25]

    Jiawei Liu, Jinkun Lin, Fabian Ruffy, Cheng Tan, Jinyang Li, Aurojit Panda, and Lingming Zhang. 2023. NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’23). ACM, 530–543. doi:10.1145...

  26. [26]

    Jiawei Liu, Jinjun Peng, Yuyao Wang, and Lingming Zhang. 2023. NeuRI: Diversifying DNN Generation via Inductive Rule Inference. arXiv:2302.02261 [cs.SE] https://arxiv.org/abs/2302.02261

  27. [27]

    Lingjun Liu, Feiran Qin, Owolabi Legunsen, and Marcelo d’Amorim. 2025. Bug Histories as Sources of Compiler Fuzzing Mutators.arXiv preprint arXiv:2510.07834(2025)

  28. [28]

    Zhaowei Liu, Xin Guo, Zhi Yang, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Mengping Li, Qi Qi, Zhiqiang Liu, Yiyang Han, et al. 2025. Fin-r1: A large language model for financial reasoning through reinforcement learning.arXiv preprint arXiv:2503.16252(2025)

  29. [29]

    Xiaoxue Ma, Wanwei Zhan, Jiale Chen, Yishu Li, Jacky Keung, and Federica Sarro. 2025. A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems.arXiv preprint arXiv:2512.20345(2025)

  30. [30]

    Miao Miao, Sriteja Kummita, Eric Bodden, and Shiyi Wei. 2025. Program Feature-Based Benchmarking for Fuzz Testing. Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA024 (June 2025), 23 pages. doi:10.1145/3728899

  31. [31]

    02 JUL 2024

    Sunita Nadampalli. 02 JUL 2024. Accelerated PyTorch inference with torch.compile on AWS Graviton proces- sors. https://aws.amazon.com/blogs/machine-learning/accelerated-pytorch-inference-with-torch-compile-on-aws- graviton-processors/

  32. [32]

    Accessed: 2026

    nGraph. Accessed: 2026. https://www.intel.com/content/www/us/en/artificialintelligence/ngraph.html. , Vol. 1, No. 1, Article . Publication date: April 2026. 32 Meiziniu Li, Dongze Li, Jianmeng Liu, and Shing-Chi Cheung*

  33. [33]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William ...

  34. [34]

    Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A comprehensive study of deep learning compiler bugs. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Athens, Greece)(ESEC/FSE 2021). Association for Computing Mach...

  35. [35]

    Devanbu, and Michael Pradel

    Qingchao Shen, Yongqiang Tian, Haoyang Ma, Junjie Chen, Lili Huang, Ruifeng Fu, Shing-Chi Cheung, and Zan Wang. 2025.A Tale of Two DL Cities: When Library Tests Meet Compiler. IEEE Press, 2201–2212. https://doi.org/10.1109/ ICSE55347.2025.00025

  36. [36]

    Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, and Giuliano Antoniol. 2023. Silent bugs in Deep learning frameworks: An empirical study of Keras and tensorflow.Empirical Software Engineering29, 1 (Nov 2023). doi:10.1007/ s10664-023-10389-6

  37. [37]

    Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(Phoenix, AZ, USA)(MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. doi:10.1145/3315508.3329973

  38. [38]

    Accessed: 2026

    torch.compile ignores a complex indexing assignment and produces incorrect result. Accessed: 2026. https://github. com/pytorch/pytorch/issues/177821

  39. [39]

    Accessed: 2026

    torch.compile produces incorrect result when conducting broadcast add on complex tensor. Accessed: 2026. https: //github.com/pytorch/pytorch/issues/174891

  40. [40]

    Accessed: 2026

    torch.compile’s RNG state is inconsistent with eager on non-contiguous tensor when size >= 16. Accessed: 2026. https://github.com/pytorch/pytorch/issues/177652

  41. [41]

    Accessed: 2026

    Bug Report For Correctness Bug Caused By Incorrect Operator Transformation. Accessed: 2026. https://github.com/ pytorch/pytorch/issues/117019

  42. [42]

    Accessed: 2026

    TVM. Accessed: 2026. https://tvm.apache.org

  43. [43]

    Bo Wang, Pengyang Wang, Chong Chen, Ming Deng, Jieke Shi, Qi Sun, Chengran Yang, Youfang Lin, Zhou Yang, Junjie Chen, et al. 2025. Mut4All: Fuzzing Compilers via LLM-Synthesized Mutators Learned from Bug Reports.arXiv preprint arXiv:2507.19275(2025)

  44. [44]

    Jiarong Wu, Songqiang Chen, Jialun Cao, Hau Ching Lo, and Shing-Chi Cheung. 2025. Isolating language-coding from problem-solving: Benchmarking llms with pseudoeval.arXiv preprint arXiv:2502.19149(2025)

  45. [45]

    Xiongfei Wu, Jinqiu Yang, Lei Ma, Yinxing Xue, and Jianjun Zhao. 2022. On the usage and development of deep learning compilers: An empirical study on TVM.Empirical Software Engineering27, 7 (Sep 2022). doi:10.1007/s10664-022-10221-7

  46. [46]

    Congying Xu, Songqiang Chen, Jiarong Wu, Shing-Chi Cheung, Valerio Terragni, Hengcheng Zhu, and Jialun Cao

  47. [47]

    InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Sacramento, CA, USA)(ASE ’24)

    MR-Adopt: Automatic Deduction of Input Transformation Function for Metamorphic Testing. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 557–569. doi:10.1145/3691620.3696020

  48. [48]

    Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang. 2024. WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models.Proc. ACM Program. Lang.8, OOPSLA2, Article 296 (Oct. 2024), 27 pages. doi:10.1145/3689736

  49. [49]

    Yilin Yang, Tianxing He, Zhilong Xia, and Yang Feng. 2022. A comprehensive empirical study on bug characteristics of deep learning frameworks.Information and Software Technology151 (2022), 107004. doi:10.1016/j.infsof.2022.107004

  50. [50]

    Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. 2023. Llm4drive: A survey of large language models for autonomous driving.arXiv preprint arXiv:2311.01043(2023)

  51. [51]

    Xiaoyu Zhang, Weipeng Jiang, Chao Shen, Qi Li, Qian Wang, Chenhao Lin, and Xiaohong Guan. 2025. Deep Learning Library Testing: Definition, Methods and Challenges. arXiv:2404.17871 [cs.SE] https://arxiv.org/abs/2404.17871

  52. [52]

    Shuang Zhou, Zidu Xu, Mian Zhang, Chunpu Xu, Yawen Guo, Zaifu Zhan, Yi Fang, Sirui Ding, Jiashuo Wang, Kaishuai Xu, et al. 2025. Large language models for disease diagnosis: A scoping review.npj Artificial Intelligence1, 1 (2025), 9. , Vol. 1, No. 1, Article . Publication date: April 2026