pith. sign in

arxiv: 2508.20340 · v4 · submitted 2025-08-28 · 💻 cs.SE · cs.AI· cs.PL

Once4All: Skeleton-Guided SMT Solver Fuzzing with LLM-Synthesized Generators

Pith reviewed 2026-05-18 21:26 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.PL
keywords SMT solver fuzzingLLM-assisted testingterm generator synthesiscontext-free grammarsbug detectionZ3cvc5test input generation
0
0 comments X

The pith

Once4All uses one-time LLM synthesis of term generators to fill formula skeletons and expose 43 bugs in SMT solvers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Once4All addresses the challenges of using LLMs for SMT solver testing by synthesizing generators for terms rather than generating entire formulas directly. It extracts context-free grammars from solver documentation to guide the creation of composable term generators that produce valid logical expressions. These terms are then used to fill in structural skeletons extracted from existing formulas, ensuring syntactic correctness and promoting diversity in the test inputs. This design limits LLM interactions to a single upfront phase, reducing costs while still enabling the discovery of 43 confirmed bugs in leading solvers like Z3 and cvc5, with 40 already fixed.

Core claim

Once4All automatically extracts context-free grammars for SMT theories, including solver-specific extensions, from documentation using LLMs and synthesizes composable Boolean term generators that adhere to these grammars. It then uses these generators to populate structural skeletons derived from existing formulas with iteratively produced terms. This ensures syntactic validity and semantic diversity in the generated test formulas while requiring only one-time LLM interaction investment.

What carries the argument

LLM-synthesized composable term generators built from extracted context-free grammars, which produce reusable logical terms for insertion into formula skeletons.

If this is right

  • All generated test formulas remain syntactically valid by following the extracted grammars.
  • Runtime cost drops because LLM calls occur only once for generator synthesis rather than repeatedly.
  • The approach reaches advanced solver features that earlier testing methods struggled to exercise.
  • It has already produced 43 confirmed bugs across Z3 and cvc5, with 40 fixed by the developers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same skeleton-plus-generator pattern could apply to fuzzing other logic solvers or verifiers that rely on formal formulas.
  • A growing collection of these generators might support continuous low-cost regression testing as solvers release updates.
  • Adding light coverage feedback could help choose which existing skeletons are most worth repopulating with fresh terms.

Load-bearing premise

That terms produced by the LLM-synthesized generators, when inserted into structural skeletons from existing formulas, will generate semantically diverse inputs capable of exposing previously unknown solver bugs.

What would settle it

A head-to-head run on Z3 and cvc5 showing whether Once4All triggers more unique confirmed bugs than direct LLM formula generation or standard mutation fuzzing on the same solver versions.

Figures

Figures reproduced from arXiv: 2508.20340 by Maolin Sun, Yibiao Yang, Yuming Zhou.

Figure 1
Figure 1. Figure 1: A bug-revealing formula involving the Seq theory extended in cvc5. #11924 3 APPROACH In this section, we provide an overview of our proposed approach, Sphinx, followed by a detailed explanation. 3.1 Overview This study introduces a novel approach called Sphinx, illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Sphinx. context-free grammar (CFG) for each theory. This summarization (Line 5) is guided by a pre-defined prompt template, which is shown in Figure 3a. Subsequently, these extracted CFGs form the basis for implementing the generators (Line 7). Sphinx again leverages LLMs, providing them with the CFGs and a detailed instruction as shown in Figure 3b. This step aims to produce generators capable… view at source ↗
Figure 3
Figure 3. Figure 3: Prompt template used in LLM-assisted generator [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustrative process of Sphinx’s skeleton-guided mutation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Confirmed bugs that affect the corresponding re [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Selected bug samples in Z3 and cvc5. theories like finite fields and newly introduced higher-order the￾ories, the proportion of valid formulas was often below 30%. In contrast, theories like real arithmetic fared much better, with over 90% valid formulas. Our correction mechanism substantially im￾proved these outcomes. For most theories, the proportion of valid formulas increased to nearly 100%. Even in th… view at source ↗
read the original abstract

Satisfiability Modulo Theory (SMT) solvers are foundational to modern systems and programming languages research, providing the foundation for tasks like symbolic execution and automated verification. Because these solvers sit on the critical path, their correctness is essential, and high-quality test formulas are key to uncovering bugs. However, while prior testing techniques performed well on earlier solver versions, they struggle to keep pace with rapidly evolving features. Recent approaches based on Large Language Models (LLMs) show promise in exploring advanced solver capabilities, but two obstacles remain: nearly half of the generated formulas are syntactically invalid, and iterative interactions with LLMs introduce substantial computational overhead. In this study, we present Once4All, a novel LLM-assisted fuzzing framework that addresses both issues by shifting from direct formula generation to the synthesis of generators for reusable terms (i.e., logical expressions). Specifically, Once4All uses LLMs to (1) automatically extract context-free grammars (CFGs) for SMT theories, including solver-specific extensions, from documentation, and (2) synthesize composable Boolean term generators that adhere to these grammars. During fuzzing, Once4All populates structural skeletons derived from existing formulas with the terms iteratively produced by the LLM-synthesized generators. This design ensures syntactic validity while promoting semantic diversity. Notably, Once4All requires only one-time LLM interaction investment, dramatically reducing runtime cost. We evaluated Once4All on two leading SMT solvers: Z3 and cvc5. Our experiments show that Once4All has identified 43 confirmed bugs, 40 of which have already been fixed by developers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Once4All, an LLM-assisted fuzzing framework for SMT solvers. LLMs are used to extract context-free grammars (CFGs) from solver documentation (including extensions) and to synthesize composable term generators that adhere to these grammars. During fuzzing, structural skeletons extracted from existing formulas are populated with terms iteratively produced by the generators. This design guarantees syntactic validity, aims to increase semantic diversity, and requires only a single round of LLM interaction. Evaluation on Z3 and cvc5 reports 43 confirmed bugs, 40 of which have already been fixed by the developers.

Significance. The empirical outcome of 43 confirmed bugs (40 fixed) in two widely used SMT solvers demonstrates practical value for solver testing and verification. The one-time LLM investment model directly addresses the overhead problem of prior iterative LLM approaches. If the LLM-synthesized generators can be shown to deliver semantic diversity beyond what random CFG sampling achieves on the same skeletons, the technique would represent a meaningful advance in grammar-guided fuzzing for evolving solver features.

major comments (1)
  1. Evaluation section: the central claim that LLM-synthesized generators produce semantically diverse terms capable of exposing previously unknown solver behaviors rests on the 43 confirmed bugs. However, the evaluation provides no ablation study, coverage statistics, or term-distribution metrics that isolate the LLM synthesis step from a baseline of random term generation under the identical CFG and skeleton regime. Without this comparison, the reported bugs cannot be confidently attributed to the claimed semantic-diversity mechanism rather than skeleton reuse alone.
minor comments (2)
  1. Abstract: the statement that 'nearly half of the generated formulas are syntactically invalid' in prior LLM approaches should include a citation to the specific prior work being referenced.
  2. Throughout: ensure that the first occurrence of each acronym (SMT, CFG, LLM) is accompanied by its full expansion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comment point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: Evaluation section: the central claim that LLM-synthesized generators produce semantically diverse terms capable of exposing previously unknown solver behaviors rests on the 43 confirmed bugs. However, the evaluation provides no ablation study, coverage statistics, or term-distribution metrics that isolate the LLM synthesis step from a baseline of random term generation under the identical CFG and skeleton regime. Without this comparison, the reported bugs cannot be confidently attributed to the claimed semantic-diversity mechanism rather than skeleton reuse alone.

    Authors: We agree that the current evaluation does not include a direct ablation isolating the contribution of the LLM-synthesized generators from random term generation under the same CFG and skeleton regime. The 43 confirmed bugs (40 fixed) demonstrate the practical effectiveness of the overall Once4All pipeline, including its one-time LLM investment and syntactic-validity guarantee. However, to more rigorously attribute results to semantic diversity introduced by the LLM step rather than skeleton reuse alone, we will add an ablation study in the revised manuscript. This will report term-distribution metrics, coverage statistics, and bug-finding results comparing the LLM-synthesized generators against a random baseline that samples from the identical extracted CFGs and populates the same skeletons. We expect these additional results to clarify the incremental benefit of the LLM synthesis. revision: yes

Circularity Check

0 steps flagged

Empirical bug-finding claims rest on external developer confirmations with no internal derivation chain

full rationale

The paper presents a practical fuzzing framework that extracts CFGs via LLM, synthesizes term generators, and populates existing formula skeletons. Its central results consist of 43 externally confirmed bugs (40 fixed by independent developers). No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The evaluation is observational and relies on real-world bug reports rather than any self-referential construction or renaming of inputs as outputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that LLMs can reliably extract accurate CFGs from documentation and synthesize generators that produce useful terms; no explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption LLMs can extract context-free grammars for SMT theories including solver-specific extensions directly from documentation.
    Invoked when the method shifts from direct formula generation to grammar extraction and generator synthesis.

pith-pipeline@v0.9.0 · 5831 in / 1300 out tokens · 35375 ms · 2026-05-18T21:26:54.781492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Sphinx uses LLMs to (1) automatically extract context-free grammars (CFGs) for SMT theories... and (2) synthesize composable Boolean term generators that adhere to these grammars. During fuzzing, Sphinx populates structural skeletons derived from existing formulas with the terms iteratively produced by the LLM-synthesized generators.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The self-correction mechanism... tasking a newly constructed generator with producing a small set of sample terms... If any of the sample terms are found to be invalid, Sphinx initiates a refinement loop

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 2 internal anchors

  1. [1]

    Clark Barrett, Pascal Fontaine, and Cesare Tinelli. 2022. The satisfiability modulo theories library (SMT-LIB). Retrieved 2022-08-18 from http://smtlib.cs.uiowa.edu/

  2. [2]

    Clark Barrett, Pascal Fontaine, and Cesare Tinelli. 2025. The SMT-LIB Standard: Version 2.7. Technical Report. Department of Computer Science, The University of Iowa. Available at www.SMT-LIB.org

  3. [3]

    Clark Barrett and Cesare Tinelli. 2018. Satisfiability modulo theories. InHandbook of model checking. Springer, 305–343

  4. [4]

    Robert Brummayer and Armin Biere. 2009. Fuzzing and Delta-Debugging SMT Solvers. In Proceedings of the 7th International Workshop on Satisfiability Modulo Theories (Montreal, Canada) (SMT ’09). Association for Computing Machinery, New York, NY, USA, 1–5. https://doi.org/10.1145/1670412.1670413

  5. [5]

    Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. In OSDI, Vol. 8. 209–224

  6. [6]

    Junjie Chen, Wenxiang Hu, Dan Hao, Yingfei Xiong, Hongyu Zhang, Lu Zhang, and Bing Xie. 2016. An empirical comparison of compiler testing techniques. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016 , Laura K. Dillon, Willem Visser, and Laurie A. Williams (Eds.). ACM, 180–190. https:/...

  7. [7]

    The International SMT Competition. 2020. SMT-COMP. Retrieved 2020-05-15 from https://smt-comp.github.io/2019/index.html

  8. [8]

    cvc5. 2025. Non-standard or extended theories in cvc5 . Retrieved 2024- 12-1 from https://cvc5.github.io/docs-ci/docs-main/theories/theories.html#non- standard-or-extended-theories

  9. [9]

    Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340

  10. [10]

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Ling- ming Zhang. 2022. Fuzzing Deep-Learning Libraries via Large Language Mod- els. CoRR abs/2212.14834 (2022). https://doi.org/10.48550/arXiv.2212.14834 arXiv:2212.14834

  11. [11]

    Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2023. Large Language Models are Edge-Case Fuzzers: Testing Deep Learning Libraries via FuzzGPT.CoRR abs/2304.02014 (2023). https: //doi.org/10.48550/arXiv.2304.02014 arXiv:2304.02014

  12. [12]

    Ahmed El-Hassany, Petar Tsankov, Laurent Vanbever, and Martin T. Vechev

  13. [13]

    In 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018, Renton, W A, USA, April 9-11, 2018, Sujata Banerjee and Srinivasan Seshan (Eds.)

    NetComplete: Practical Network-Wide Configuration Synthesis with Au- tocompletion. In 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018, Renton, W A, USA, April 9-11, 2018, Sujata Banerjee and Srinivasan Seshan (Eds.). USENIX Association, 579–594. https://www.usenix. org/conference/nsdi18/presentation/el-hassany

  14. [14]

    Rongxin Han, Jingyu Wang, Qi Qi, Haifeng Sun, Chaowei Xu, Zhaoyang Wan, Zirui Zhuang, Yichuan Yu, and Jianxin Liao. 2024. NetRen: Service Migration- Driven Network Renascence with Synthesizing Updated Configuration. In Pro- ceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3,...

  15. [15]

    Travis Hance, Andrea Lattuada, Chris Hawblitzel, Jon Howell, Rob Johnson, and Bryan Parno. 2020. Storage Systems are Distributed Systems (So Verify Them That Way!). In 14th USENIX Symposium on Operating Systems Design and Imple- mentation, OSDI 2020, Virtual Event, November 4-6, 2020 . USENIX Association, 99–115. https://www.usenix.org/conference/osdi20/p...

  16. [16]

    Lorch, Arjun Narayan, Bryan Parno, Danfeng Zhang, and Brian Zill

    Chris Hawblitzel, Jon Howell, Jacob R. Lorch, Arjun Narayan, Bryan Parno, Danfeng Zhang, and Brian Zill. 2014. Ironclad Apps: End-to-End Security via Automated Full-System Verification. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, Broomfield, CO, USA, October 6-8, 2014, Jason Flinn and Hank Levy (Eds.). USENIX Associa...

  17. [17]

    Heqing Huang. 2023. CVE-2020-19725. Retrieved 2023-11-5 from https://nvd.nist. gov/vuln/detail/CVE-2020-19725#

  18. [18]

    Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019, Tim Brecht and Carey Williamson (Eds.). AC...

  19. [19]

    Eunsuk Kang, Stéphane Lafortune, and Stavros Tripakis. 2019. Automated Syn- thesis of Secure Platform Mappings. In Computer Aided Verification - 31st Interna- tional Conference, CA V 2019, New York City, NY, USA, July 15-18, 2019, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 11561) , Isil Dillig and Serdar Tasiran (Eds.). Springer, 219–237...

  20. [20]

    Jongwook Kim, Sunbeom So, and Hakjoo Oh. 2023. Diver: Oracle-Guided SMT Solver Testing with Unrestricted Random Mutations. In 45th IEEE/ACM Interna- tional Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2224–2236. https://doi.org/10.1109/ICSE48619.2023.00187

  21. [21]

    Gereon Kremer, Aina Niemetz, and Mathias Preiner. 2021. ddSMT 2.0: Better Delta Debugging for the SMT-LIBv2 Language and Friends. In Computer Aided Verification - 33rd International Conference, CA V 2021, Virtual Event, July 20-23, 2021, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12760), Alexandra Silva and K. Rustan M. Leino (Eds.). Sp...

  22. [22]

    Cong Li, Yanyan Jiang, Chang Xu, and Zhendong Su. 2023. Validating JIT Compil- ers via Compilation Space Exploration. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023 , Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace (Eds.). ACM, 66–79. https://doi.org...

  23. [23]

    Yi Li, Aws Albarghouthi, Zachary Kincaid, Arie Gurfinkel, and Marsha Chechik

  24. [24]

    Symbolic optimization with SMT solvers. InThe 41st Annual ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, POPL ’14, San Diego, CA, USA, January 20-21, 2014 , Suresh Jagannathan and Peter Sewell (Eds.). ACM, 607–618. https://doi.org/10.1145/2535838.2535857

  25. [25]

    Jiawei Liu, Jinkun Lin, Fabian Ruffy, Cheng Tan, Jinyang Li, Aurojit Panda, and Lingming Zhang. 2023. NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, Ma...

  26. [26]

    Lopes, Nikolaj S

    Nuno P. Lopes, Nikolaj S. Bjørner, Patrice Godefroid, Karthick Jayaraman, and George Varghese. 2015. Checking Beliefs in Dynamic Networks. In 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 15, Oakland, CA, USA, May 4-6, 2015 . USENIX Association, 499–512. https://www.usenix.org/ conference/nsdi15/technical-sessions/presentation/lopes

  27. [27]

    Muhammad Numair Mansur, Maria Christakis, Valentin Wüstholz, and Fuyuan Zhang. 2020. Detecting critical bugs in SMT solvers using blackbox mutational fuzzing. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 701–712. https://doi.org/10.1145/3368089.3409763

  28. [28]

    MicroSoft. 2025. Z3 Guide. Retrieved 2024-12-1 from https://microsoft.github.io/ z3guide/

  29. [29]

    Luke Nelson, Helgi Sigurbjarnarson, Kaiyuan Zhang, Dylan Johnson, James Bornholt, Emina Torlak, and Xi Wang. 2017. Hyperkernel: Push-Button Ver- ification of an OS Kernel. In Proceedings of the 26th Symposium on Operat- ing Systems Principles, Shanghai, China, October 28-31, 2017 . ACM, 252–269. https://doi.org/10.1145/3132747.3132748

  30. [30]

    Aina Niemetz, Mathias Preiner, and Clark W. Barrett. 2022. Murxla: A Modular and Highly Extensible API Fuzzer for SMT Solvers. InComputer Aided Verification - 34th International Conference, CA V 2022, Haifa, Israel, August 7-10, 2022, Proceed- ings, Part II (Lecture Notes in Computer Science, Vol. 13372) , Sharon Shoham and Yakir Vizel (Eds.). Springer, 9...

  31. [31]

    OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023). https: //doi.org/10.48550/ARXIV.2303.08774 arXiv:2303.08774

  32. [32]

    Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2024. The Mutators Reloaded: Fuzzing Compilers with Large Language Model Generated Mutation Operators. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4, ASPLOS 2024, Hilton La Jolla Torrey Pines, La Jolla, CA, USA, 27 ...

  33. [33]

    Jiwon Park, Dominik Winterer, Chengyu Zhang, and Zhendong Su. 2021. Gener- ative type-aware mutation for testing SMT solvers. Proc. ACM Program. Lang. 5, OOPSLA (2021), 1–19. https://doi.org/10.1145/3485529

  34. [34]

    John Regehr, Yang Chen, Pascal Cuoq, Eric Eide, Chucky Ellison, and Xuejun Yang. 2012. Test-case reduction for C compiler bugs. InACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012 , Jan Vitek, Haibo Lin, and Frank Tip (Eds.). ACM, 335–346. https://doi.org/10.1145/2254064.2254104

  35. [35]

    SMT-LIB Initiative. 2025. News of SMT-LIB. https://smt-lib.org/news.shtml 11

  36. [36]

    Maolin Sun, Yibiao Yang, Yang Wang, Ming Wen, Haoxiang Jia, and Yuming Zhou

  37. [37]

    Gleiph Ghiotto, Leonardo Murta, Márcio Barros, and André van der Hoek

    SMT Solver Validation Empowered by Large Pre-Trained Language Models. In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11-15, 2023 . IEEE, 1288–1300. https://doi.org/ 10.1109/ASE56229.2023.00180

  38. [38]

    Maolin Sun, Yibiao Yang, Ming Wen, Yongcong Wang, Yuming Zhou, and Hai Jin. 2023. Validating SMT Solvers via Skeleton Enumeration Empowered by Historical Bug-Triggering Inputs. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023 . IEEE, 69–81. https://doi.org/10.1109/ICSE48619.2023.00018

  39. [39]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucu- rull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony...

  40. [40]

    Deze Wang, Zhouyang Jia, Shanshan Li, Yue Yu, Yun Xiong, Wei Dong, and Xiangke Liao. 2022. Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022 . ACM, 287–298. https://doi.org/10.1145/3510003.3510062

  41. [41]

    Haoyu Wang, Junjie Chen, Chuyue Xie, Shuang Liu, Zan Wang, Qingchao Shen, and Yingquan Zhao. 2023. MLIRSmith: Random Program Generation for Fuzzing MLIR Compiler Infrastructure. In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11-15, 2023 . IEEE, 1555–1566. https://doi.org/10.1109/ASE56229.2023.00120

  42. [42]

    Dominik Winterer and Zhendong Su. 2024. Validating SMT Solvers for Correct- ness and Performance via Grammar-Based Enumeration. Proc. ACM Program. Lang. 8, OOPSLA2 (2024), 2378–2401. https://doi.org/10.1145/3689795

  43. [43]

    Dominik Winterer, Chengyu Zhang, and Zhendong Su. 2020. On the unusual effectiveness of type-aware operator mutations for testing SMT solvers. Proc. ACM Program. Lang. 4, OOPSLA (2020), 193:1–193:25. https://doi.org/10.1145/ 3428261

  44. [44]

    Dominik Winterer, Chengyu Zhang, and Zhendong Su. 2020. Validating SMT solvers via semantic fusion. In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020 , Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 718–730. https://doi.org/10.1145/3385412.3385985

  45. [45]

    Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4All: Universal Fuzzing with Large Language Models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024 . ACM, 126:1–126:13. https://doi. org/10.1145/3597503.3639121

  46. [46]

    Jones, Xiangyu Zhang, and Baowen Xu

    Xinmeng Xia, Yang Feng, Qingkai Shi, James A. Jones, Xiangyu Zhang, and Baowen Xu. 2024. Enumerating Valid Non-Alpha-Equivalent Programs for Interpreter Testing. ACM Trans. Softw. Eng. Methodol. (feb 2024). https: //doi.org/10.1145/3647994 Just Accepted

  47. [47]

    Peisen Yao, Heqing Huang, Wensheng Tang, Qingkai Shi, Rongxin Wu, and Charles Zhang. 2021. Skeletal approximation enumeration for SMT solver testing. In ESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, Diomidis Spinellis, Georgios Gousios, Ma...

  48. [48]

    Guixin Ye, Zhanyong Tang, Shin Hwei Tan, Songfang Huang, Dingyi Fang, Xi- aoyang Sun, Lizhong Bian, Haibo Wang, and Zheng Wang. 2021. Automated conformance testing for JavaScript engines via deep compiler fuzzing. InPLDI ’21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Virtual Event, Canada, June 20-25, 202...

  49. [49]

    Qirun Zhang, Chengnian Sun, and Zhendong Su. 2017. Skeletal program enu- meration for rigorous compiler testing. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017 , Albert Cohen and Martin T. Vechev (Eds.). ACM, 347–361. https://doi.org/10.1145/3062341.3062379 12