arxiv: 2604.21570 · v1 · submitted 2026-04-23 · 💻 cs.SE

Recognition: unknown

SpecSyn: LLM-based Synthesis and Refinement of Formal Specifications for Real-world Program Verification

Lezhi Ma , Shangqing Liu , Yi Li , Qiong Wu , Han Wang , Lei Bu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:02 UTC · model grok-4.3

classification 💻 cs.SE

keywords specification synthesisLLM-based verificationprogram decompositionspecification refinementmutation discriminationformal methodssoftware correctnessreal-world verification

0 comments

The pith

SpecSyn uses program decomposition and mutation-based refinement to generate strong formal specifications for verifying real-world program properties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SpecSyn to overcome limitations in LLM-based specification generation for large programs, such as handling big codebases and ensuring specification strength. SpecSyn breaks the program into smaller segments for the LLM to process iteratively. It refines the specs by mutating the code semantically and discriminating variants to check and improve strength. This yields high precision over 90% and recall over 75%, outperforming prior methods and handling 1071 of 1365 properties in open-source code. A sympathetic reader would care because automating strong specs removes a major barrier to applying formal verification in practice.

Core claim

SpecSyn first decomposes the input program into individual segments, which are handled respectively by the subsequent iterative specification generation process. Innovatively, we incorporate into the process a specification refinement mechanism based on semantic-non-equivalent program mutations and variant discrimination, assessing and enhancing the semantic strength of the generated specifications. Extensive experiments show that SpecSyn maintains high precision over 90% and outstanding recall over 75%, significantly outperforming existing LLM-based approaches. In further evaluations, SpecSyn successfully handles 1071 out of 1365 target properties for open-source programs, proving its aplic

What carries the argument

The specification refinement mechanism based on semantic-non-equivalent program mutations and variant discrimination that assesses and enhances the semantic strength of LLM-generated specifications for each program segment.

If this is right

Decomposing programs allows LLMs to generate specifications for larger codebases that would otherwise be too big to handle directly.
The mutation and discrimination process provides a way to automatically strengthen specifications that are too weak for verification.
High precision and recall mean fewer false positives and more properties that can actually be verified automatically.
Success on over 78 percent of target properties in open-source programs indicates the method can be used in practice beyond small examples.
The approach combines decomposition with refinement to address both scalability and strength issues in specification generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be combined with other verification tools to create end-to-end automated verification pipelines for complex software.
If the refinement step is efficient, it might allow iterative improvement of specs until they pass verification or a timeout.
One could explore whether the same mutation technique applies to other types of formal artifacts like invariants or contracts.
The success rate suggests that for many properties, the generated specs are sufficient, but edge cases might still require human input.

Load-bearing premise

Decomposing the program into segments does not lose essential information about how different parts interact across function calls, and the variant discrimination reliably identifies when a specification is not strong enough semantically.

What would settle it

Running SpecSyn on a program with known interprocedural dependencies where the generated specifications allow verification to pass incorrectly or fail to prove a true property due to missed context from decomposition.

Figures

Figures reproduced from arXiv: 2604.21570 by Han Wang, Lei Bu, Lezhi Ma, Qiong Wu, Shangqing Liu, Yi Li.

**Figure 1.** Figure 1: An example program segment from open-source repository X509-parser, the program specifications generated by each baseline method, and the corresponding verification results for the target property. ⃝1 ∼⃝3 marked three program points of interest where specifications should be generated, i.e., pre/post-conditions for the two functions and loop invariants for the loop within. ⃝4 marked the target property, o… view at source ↗

**Figure 2.** Figure 2: Overview of SpecSyn. cordingly. To alleviate the input scale burden of real-world programs, we introduce Top-down Task Decomposition and Bottom-up Specification Synthesis, breaking down the enormous generation task to accomplish it in a divide-andconquer manner. To strengthen the semantics of the generated specifications, we introduce a specification strength assessment and refinement technique based on … view at source ↗

**Figure 3.** Figure 3: An example of a specification sketch towards the proof of verification targets. automatically performs targeted planning for the preconditions and loop invariants required to support the verification. In the subsequent process, the model follows this plan to produce the structured specifications for discharging proof obligations towards the targets. This mechanism is inspired by the practice of planning i… view at source ↗

**Figure 4.** Figure 4: Illustration of specification refinement mechanism. In terms of the mutation operators, we exploit the 188 mutation operators proposed by the work [61] targeting C compiler fuzzing. The operators are usefully comprehensive, covering a wide range of C language elements, including additions, removals, and modifications on operators, operands, identifiers, statements, control-flow structures, and user-defin… view at source ↗

read the original abstract

Program verification is a formal technique to rigorously ensure the correctness and fault-freeness of software systems. However, constructing comprehensive interprocedural specifications for full verification obligations is time-consuming and labor-intensive, giving rise to automated specification generation approaches. Despite the significant advancements in these approaches brought by Large Language Models (LLMs), existing LLM-empowered approaches still suffer from significant limitations: they lack effective strategies for handling sizable input programs, and are typically equipped with no mechanisms to evaluate and guarantee the strength of the generated specifications. The limitations impair their ability to extract precise specifications from real-world complicated programs to support the verification of target properties, thereby hindering the applicability of existing approaches in verification tasks on real-world programs. To remedy this gap, we propose SpecSyn, a novel LLM-based specification generation method. SpecSyn first decomposes the input program into individual segments, which are handled respectively by the subsequent iterative specification generation process. Innovatively, we incorporate into the process a specification refinement mechanism based on semantic-non-equivalent program mutations and variant discrimination, assessing and enhancing the semantic strength of the generated specifications. Extensive experiments show that SpecSyn maintains high precision over 90% and outstanding recall over 75%, significantly outperforming existing LLM-based approaches. In further evaluations, SpecSyn successfully handles 1071 out of 1365 target properties for open-source programs, proving its applicability on real-world program verification tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpecSyn combines program decomposition with mutation-based spec refinement to scale LLM synthesis, reporting strong numbers on real code but leaving interprocedural coverage as an open question.

read the letter

The main thing to know is that SpecSyn decomposes programs into segments for separate LLM spec generation, then applies semantic-non-equivalent mutations to test and strengthen the outputs. This setup aims to fix two gaps in prior LLM work: inability to handle large inputs and lack of strength checks on the generated specs. The abstract reports over 90% precision, over 75% recall, and success on 1071 of 1365 target properties in open-source programs, which beats the baselines they cite. That combination of decomposition plus mutation discrimination is the concrete addition here. The paper does a clear job stating the problem with existing approaches and describing a pipeline that tries to produce specs usable for actual verification tasks. The mutation step in particular looks like a workable way to avoid weak specs that might pass superficial checks. The evaluation moves beyond small examples to open-source programs, which is the right direction for this kind of work. The soft spot is the handling of interprocedural information. The method splits code into segments processed independently, yet the claims involve comprehensive interprocedural specifications and full verification obligations. Nothing in the provided description shows how cross-segment dataflow, call-site invariants, or aliasing are preserved or reconstructed. If many of the 1071 successes are on properties that happen to be mostly local, the numbers could overstate the method's reach on truly complex cases. The abstract gives no details on experimental setup, baseline fairness, or statistical tests, so those need direct inspection. This paper is for people working on LLM-assisted formal methods or automated verification tooling. Readers who need practical ideas for scaling spec generation will find the method description and the real-program results worth examining. It is grounded enough in a concrete pipeline and empirical claims to deserve peer review rather than a desk reject. I would send it out, with referees asked to verify the interprocedural mechanics and the mutation refinement details against the reported success rate.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpecSyn, an LLM-based method for automated synthesis and refinement of formal specifications to support program verification. It decomposes input programs into segments for independent processing via iterative LLM-driven specification generation, then applies a refinement step using semantic-non-equivalent mutations and variant discrimination to evaluate and strengthen the specifications' semantic properties. The central empirical claims are that SpecSyn achieves precision exceeding 90% and recall exceeding 75% while outperforming prior LLM-based approaches, and that it successfully handles 1071 out of 1365 target properties when applied to open-source programs.

Significance. If the empirical results prove robust and the interprocedural aspects are adequately addressed, the work would meaningfully advance automated formal methods by tackling scalability to large programs and the problem of weak specifications, potentially reducing the manual burden in verification tasks for real-world software.

major comments (2)

[Abstract] Abstract: the headline result that SpecSyn handles 1071 out of 1365 target properties presupposes that per-segment synthesis plus mutation discrimination produces specifications strong enough for full verification obligations. The method description states that programs are split into segments processed independently, yet provides no demonstration that the resulting contracts capture cross-segment dataflow, aliasing, or call-site invariants that a monolithic analysis would require; if target properties depend on such information, the success count could be inflated.
[Abstract] Abstract: the reported precision (>90%) and recall (>75%) together with the outperforming claim are presented without any information on experimental setup, choice of baselines, how verification success is measured, statistical significance, or how interprocedural properties were handled, rendering the quantitative claims impossible to evaluate from the given text.

minor comments (2)

[Abstract] The abstract employs the subjective qualifier 'outstanding recall'; the numerical threshold (>75%) is sufficient and more precise.
The mutation-based discrimination mechanism is described at a high level; a concrete example of a non-equivalent mutation and how discrimination is performed would improve clarity of the refinement step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the headline claims require additional context on interprocedural handling and experimental details to be fully evaluable. We will revise the abstract and add clarifications in the main text to address these points without altering the core claims or results.

read point-by-point responses

Referee: [Abstract] Abstract: the headline result that SpecSyn handles 1071 out of 1365 target properties presupposes that per-segment synthesis plus mutation discrimination produces specifications strong enough for full verification obligations. The method description states that programs are split into segments processed independently, yet provides no demonstration that the resulting contracts capture cross-segment dataflow, aliasing, or call-site invariants that a monolithic analysis would require; if target properties depend on such information, the success count could be inflated.

Authors: The decomposition in Section 3.1 propagates calling-context summaries, global state, and alias information into each segment to preserve relevant interprocedural dataflow. The 1365 target properties were drawn from verification tasks whose obligations are local once this context is included; the refinement step further strengthens specs against semantic variants that simulate different call-site behaviors. We acknowledge that an explicit walk-through of a cross-segment example would make this clearer. We will add a short paragraph with such an example in the evaluation section and a one-sentence qualifier in the abstract. This is a partial revision because the underlying mechanism already incorporates context propagation, but the exposition needs strengthening. revision: partial
Referee: [Abstract] Abstract: the reported precision (>90%) and recall (>75%) together with the outperforming claim are presented without any information on experimental setup, choice of baselines, how verification success is measured, statistical significance, or how interprocedural properties were handled, rendering the quantitative claims impossible to evaluate from the given text.

Authors: These metrics are defined and measured in Sections 4 and 5: precision and recall are computed against manually written ground-truth specifications on 20 open-source C programs; verification success counts a property as handled only when the verifier discharges it using the generated contracts without spurious warnings; baselines are the three prior LLM-based specification generators cited in the related-work section; statistical significance is reported via five independent runs with standard deviation. Interprocedural handling is described in Section 3.2 via context propagation. To make the abstract self-contained we will insert a brief clause listing the benchmark scale, the verifier used, and the fact that context is propagated. This is a full revision of the abstract wording. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on experimental results, not self-referential derivations

full rationale

The paper describes an LLM-based method (SpecSyn) that decomposes programs into segments, generates specifications iteratively, and refines them via mutation discrimination. Central claims are measured precision (>90%), recall (>75%), and success on 1071/1365 properties. No equations, fitted parameters, or predictions are presented that reduce to inputs by construction. No self-citation chains justify uniqueness or ansatzes. The evaluation is self-contained against external benchmarks (existing LLM approaches and open-source programs), satisfying the criteria for a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method appears to rest on standard LLM capabilities and conventional program mutation techniques.

pith-pipeline@v0.9.0 · 5562 in / 1042 out tokens · 23454 ms · 2026-05-09T21:02:13.517001+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 18 canonical work pages · 5 internal anchors

[1]

https://github.com/ANSSI-FR/x509-parser (2026)

GitHub - ANSSI-FR/x509-parser: a RTE-free X.509 parser. https://github.com/ANSSI-FR/x509-parser (2026)

2026
[2]

https://github.com/clibs/clib (2026)

GitHub - clibs/clib: Package manager for the C programming language. https://github.com/clibs/clib (2026)

2026
[3]

https://github.com/floooh/sokol (2026)

GitHub - floooh/sokol: minimal cross-platform standalone C headers. https://github.com/floooh/sokol (2026)

2026
[4]

ACSL by Exam- ple

GitHub - fraunhoferfokus/acsl-by-example: Public snapshots of "ACSL by Exam- ple". https://github.com/fraunhoferfokus/acsl-by-example (2026)

2026
[5]

https://github.com/Jaysmito101/cgl (2026)

GitHub - Jaysmito101/cgl: CGL (C Game Library) is a multipurpose library mainly for recreational coding / demo scenes / prototyping / small games / exper- imentation. https://github.com/Jaysmito101/cgl (2026)

2026
[6]

https://github.com/JCash/voronoi (2026)

GitHub - JCash/voronoi: A C implementation for creating 2D voronoi diagrams. https://github.com/JCash/voronoi (2026)

2026
[7]

https://github.com/manavpatnaik/frama-c-problems (2026)

GitHub - manavpatnaik/frama-c-problems: A repository dedicated for problems related to verification of programs using the tool frama-c. https://github.com/manavpatnaik/frama-c-problems (2026)

2026
[8]

https://github.com/martincohen/Punity (2026)

GitHub - martincohen/Punity: A tiny game engine in C. https://github.com/martincohen/Punity (2026)

2026
[9]

https://github.com/nothings/stb (2026)

GitHub - nothings/stb: stb single-file public domain libraries for C/C++. https://github.com/nothings/stb (2026)

2026
[10]

GitHub-PolygonTek/BlueshiftEngine.https://github.com/PolygonTek/BlueshiftEngine (2026)

2026
[11]

https://github.com/rustyrussell/ccan (2026)

GitHub - rustyrussell/ccan: The C Code Archive Network. https://github.com/rustyrussell/ccan (2026)

2026
[12]

https://github.com/SHRC/UAV-Quadcopter (2026)

GitHub - SHRC/UAV-Quadcopter: Program for Arduino-Controlled UAV Quad- copter. https://github.com/SHRC/UAV-Quadcopter (2026)

2026
[13]

https://github.com/zpl- c/zpl (2026)

GitHub - zpl-c/zpl: Pushing the boundaries of simplicity. https://github.com/zpl- c/zpl (2026)

2026
[14]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

In: International Conference on Computer Aided Verification

Barrett, C., Conway, C.L., Deters, M., Hadarean, L., Jovanović, D., King, T., Reynolds, A., Tinelli, C.: cvc4. In: International Conference on Computer Aided Verification. pp. 171–177. Springer (2011)

2011
[16]

Baudin, P., Bobot, F., Correnson, L., Dargaye, Z., Blanchard, A.: WP Plug-in Manual, https://frama-c.com/download/frama-c-wp-manual.pdf
[17]

CEA-LIST, Saclay, France, Tech

Baudin, P., Filliâtre, J.C., Marché, C., Monate, B., Moy, Y., Prevosto, V.: Acsl: Ansi c specification language. CEA-LIST, Saclay, France, Tech. Rep. v12, 79 (2008)

2008
[18]

In: International Conference on Tools and Algorithms for the Construction and Analysis of Systems

Beyer, D., Strejček, J.: Improvements in software verification and witness valida- tion: Sv-comp 2025. In: International Conference on Tools and Algorithms for the Construction and Analysis of Systems. pp. 151–186. Springer (2025) 20 Lezhi Ma, Shangqing Liu, Yi Li, Qiong Wu, Han Wang, and Lei Bu

2025
[19]

Cooper, S

Boeyen, S., Santesson, S., Polk, T., Housley, R., Farrell, S., Cooper, D.: In- ternet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile. RFC 5280 (May 2008). https://doi.org/10.17487/RFC5280, https://www.rfc-editor.org/info/rfc5280

work page doi:10.17487/rfc5280 2008
[20]

In: Asian Symposium on Programming Languages and Systems

Calcagno, C., Distefano, D., Vafeiadis, V.: Bi-abductive resource invariant synthe- sis. In: Asian Symposium on Programming Languages and Systems. pp. 259–274. Springer (2009)

2009
[21]

arXiv preprint arXiv:2501.16207 (2025)

Cao, J., Lu, Y., Li, M., Ma, H., Li, H., He, M., Wen, C., Sun, L., Zhang, H., Qin, S., et al.: From informal to formal–incorporating and evaluating llms on natural language requirements to verifiable formal proofs. arXiv preprint arXiv:2501.16207 (2025)

work page arXiv 2025
[22]

arXiv preprint arXiv:2410.15756 , year=

Chen, T., Lu, S., Lu, S., Gong, Y., Yang, C., Li, X., Misu, M.R.H., Yu, H., Duan, N., Cheng, P., et al.: Automated proof generation for rust code via self-evolution. arXiv preprint arXiv:2410.15756 (2024)

work page arXiv 2024
[23]

In: Proceedings of the 25th international symposium on software testing and analysis

Coles, H., Laurent, T., Henard, C., Papadakis, M., Ventresque, A.: Pit: a practical mutation testing tool for java. In: Proceedings of the 25th international symposium on software testing and analysis. pp. 449–452 (2016)

2016
[24]

In: International Conference on Computer Aided Verification

Colón,M.A.,Sankaranarayanan,S.,Sipma,H.B.:Linearinvariantgenerationusing non-linear constraint solving. In: International Conference on Computer Aided Verification. pp. 420–432. Springer (2003)

2003
[25]

In: SMT Workshop: International Workshop on Satisfiability Modulo Theories (2018)

Conchon, S., Coquereau, A., Iguernlala, M., Mebsout, A.: Alt-ergo 2.2. In: SMT Workshop: International Workshop on Satisfiability Modulo Theories (2018)

2018
[26]

http://www.maratis3d.org/ (2026)

Davide, B.: maratis3d.org. http://www.maratis3d.org/ (2026)

2026
[27]

In: International conference on Tools and Algorithms for the Construction and Analysis of Systems

De Moura, L., Bjørner, N.: Z3: An efficient smt solver. In: International conference on Tools and Algorithms for the Construction and Analysis of Systems. pp. 337–
[28]

In: Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis

Deng, Y., Xia, C.S., Peng, H., Yang, C., Zhang, L.: Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In: Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. pp. 423–435 (2023)

2023
[29]

Acm Sigplan Notices48(10), 443–456 (2013)

Dillig, I., Dillig, T., Li, B., McMillan, K.: Inductive invariant generation via ab- ductive inference. Acm Sigplan Notices48(10), 443–456 (2013)

2013
[30]

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems27(7), 1165–1178 (2008)

D’silva, V., Kroening, D., Weissenbacher, G.: A survey of automated techniques for formal software verification. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems27(7), 1165–1178 (2008)

2008
[31]

509 parser

Ebalard, A., Mouy, P., Benadjila, R.: Journey to a rte-free x. 509 parser. In: Sym- posium sur la sécurité des technologies de l’information et des communications (SSTIC 2019). vol. 186 (2019)

2019
[32]

Science of computer programming69(1-3), 35–45 (2007)

Ernst, M.D., Perkins, J.H., Guo, P.J., McCamant, S., Pacheco, C., Tschantz, M.S., Xiao, C.: The daikon system for dynamic detection of likely invariants. Science of computer programming69(1-3), 35–45 (2007)

2007
[33]

In: International Symposium of Formal Methods Europe

Flanagan, C., Leino, K.R.M.: Houdini, an annotation assistant for esc/java. In: International Symposium of Formal Methods Europe. pp. 500–517. Springer (2001)

2001
[34]

In: Proceedings of the second annual symposium on Computational geometry

Fortune, S.: A sweepline algorithm for voronoi diagrams. In: Proceedings of the second annual symposium on Computational geometry. pp. 313–322 (1986)

1986
[35]

Foundation, F.S.: Gcc, the gnu compiler collection - gnu project (2026), https://gcc.gnu.org/

2026
[36]

ACM Sigplan Notices51(1), 499–512 (2016) Synthesis and Refinement of Formal Program Specifications 21

Garg, P., Neider, D., Madhusudan, P., Roth, D.: Learning invariants using deci- sion trees and implication counterexamples. ACM Sigplan Notices51(1), 499–512 (2016) Synthesis and Refinement of Formal Program Specifications 21

2016
[37]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

In: Interna- tional Conference on Computer Aided Verification

Gupta, A., Rybalchenko, A.: Invgen: An efficient invariant generator. In: Interna- tional Conference on Computer Aided Verification. pp. 634–640. Springer (2009)

2009
[40]

Understanding the planning of LLM agents: A survey

Huang, X., Liu, W., Chen, X., Wang, X., Wang, H., Lian, D., Wang, Y., Tang, R., Chen, E.: Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 (2024)

work page internal anchor Pith review arXiv 2024
[41]

IEEE transactions on software engineering37(5), 649–678 (2010)

Jia, Y., Harman, M.: An analysis and survey of the development of mutation testing. IEEE transactions on software engineering37(5), 649–678 (2010)

2010
[42]

Internet from space

Kassing, S., Bhattacherjee, D., Águas, A.B., Saethre, J.E., Singla, A.: Exploring the “Internet from space” with Hypatia. In: ACM IMC (2020)

2020
[43]

arXiv preprint arXiv:1501.04725 (2015)

Krishna, S., Puhrsch, C., Wies, T.: Learning invariants using decision trees. arXiv preprint arXiv:1501.04725 (2015)

work page arXiv 2015
[44]

In: International Workshop on Verification, Model Checking, and Abstract Interpretation

Laviron, V., Logozzo, F.: Subpolyhedra: A (more) scalable approach to infer lin- ear inequalities. In: International Workshop on Verification, Model Checking, and Abstract Interpretation. pp. 229–244. Springer (2009)

2009
[45]

ACM Sigplan Notices49(6), 216–226 (2014)

Le, V., Afshari, M., Su, Z.: Compiler validation via equivalence modulo inputs. ACM Sigplan Notices49(6), 216–226 (2014)

2014
[46]

Acm Sigplan Notices50(10), 386–399 (2015)

Le, V., Sun, C., Su, Z.: Finding deep compiler bugs via guided stochastic program mutation. Acm Sigplan Notices50(10), 386–399 (2015)

2015
[47]

Leavens, G.T., Poll, E., Clifton, C., Cheon, Y., Ruby, C., Cok, D., Müller, P., Kiniry, J., Chalin, P., Zimmerman, D.M., et al.: Jml reference manual (2008)

2008
[48]

In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)

Li, J., Sun, J., Li, L., Le, Q.L., Lin, S.W.: Automatic loop-invariant generation anc refinement through selective sampling. In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). pp. 782–792. IEEE (2017)

2017
[49]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

arXiv preprint arXiv:2405.02580 (2024)

Liu, Y., Xue, Y., Wu, D., Sun, Y., Li, Y., Shi, M., Liu, Y.: Propertygpt: Llm- driven formal verification of smart contracts through retrieval-augmented property generation. arXiv preprint arXiv:2405.02580 (2024)

work page arXiv 2024
[51]

In: 2020 IEEE 20th International Confer- ence on Software Quality, Reliability and Security (QRS)

Lu, H., Wang, C., Gui, J., Huang, H.: Pblinv: Postcondition-based loop in- variant learning for c programs. In: 2020 IEEE 20th International Confer- ence on Software Quality, Reliability and Security (QRS). pp. 1–12 (2020). https://doi.org/10.1109/QRS51102.2020.00013

work page doi:10.1109/qrs51102.2020.00013 2020
[52]

SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications

Ma, L., Liu, S., Bu, L., Li, S., Wang, Y., Liu, Y.: Speceval: Evaluating code com- prehension in large language models via program specifications. arXiv preprint arXiv:2409.12866 (2024)

work page arXiv 2024
[53]

arXiv preprint arXiv:2401.08807 , year=

Ma, L., Liu, S., Li, Y., Xie, X., Bu, L.: Specgen: Automated generation of formal program specifications via large language models. arXiv preprint arXiv:2401.08807 (2024)

work page arXiv 2024
[54]

In: Proceed- ings of the 44th International Conference on Software Engineering

Molina, F., d’Amorim, M., Aguirre, N.: Fuzzing class specifications. In: Proceed- ings of the 44th International Conference on Software Engineering. pp. 1008–1020 (2022)

2022
[55]

In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)

Molina, F., Ponzio, P., Aguirre, N., Frias, M.: Evospex: An evolutionary algorithm for learning postconditions. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). pp. 1223–1235. IEEE (2021) 22 Lezhi Ma, Shangqing Liu, Yi Li, Qiong Wu, Han Wang, and Lei Bu

2021
[56]

Limitations of Normalization in Attention Mechanism,

Mudarisov, T., Burtsev, M., Petrova, T., State, R.: Limitations of normalization in attention mechanism. arXiv preprint arXiv:2508.17821 (2025)

work page arXiv 2025
[57]

Future Generation Computer Systems160, 1–13 (2024)

Munley, C., Jarmusch, A., Chandrasekaran, S.: Llm4vv: Developing llm-driven testsuite for compiler validation. Future Generation Computer Systems160, 1–13 (2024)

2024
[58]

In: 2021 14th IEEE Conference on Software Testing, Validation and Verifi- cation (ICST)

Nilizadeh, A., Leavens, G.T., Le, X.B., Pasareanu, C.S., Cok, D.: Exploring true test overfitting in dynamic automated program repair using formal methods (in press). In: 2021 14th IEEE Conference on Software Testing, Validation and Verifi- cation (ICST). IEEE (2021)

2021
[59]

OpenAI: Gpt-3.5 (2026), https://platform.openai.com/docs/models/gpt-3-5

2026
[60]

OpenAI: Gpt-5 model (2026), https://platform.openai.com/docs/models/gpt-5

2026
[61]

In: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Lan- guages and Operating Systems, Volume 4

Ou, X., Li, C., Jiang, Y., Xu, C.: The mutators reloaded: Fuzzing compilers with large language model generated mutation operators. In: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Lan- guages and Operating Systems, Volume 4. pp. 298–312 (2024)

2024
[62]

In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering

Papadakis, M., Jia, Y., Harman, M., Le Traon, Y.: Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. vol. 1, pp. 936–946. IEEE (2015)

2015
[63]

In: Advances in computers, vol

Papadakis, M., Kintis, M., Zhang, J., Jia, Y., Le Traon, Y., Harman, M.: Mutation testing advances: an analysis and survey. In: Advances in computers, vol. 112, pp. 275–378. Elsevier (2019)

2019
[64]

Pei, K., Bieber, D., Shi, K., Sutton, C., Yin, P.: Can large language models reason about program invariants? In: International Conference on Machine Learning. pp. 27496–27520. PMLR (2023)

2023
[65]

In: RED-IOT 2018- Workshop on Recent advances in secure management of data and resources in the IoT (2018)

Peyrard, A., Kosmatov, N., Duquennoy, S., Raza, S.: Towards formal verification of contiki: Analysis of the aes–ccm* modules with frama-c. In: RED-IOT 2018- Workshop on Recent advances in secure management of data and resources in the IoT (2018)

2018
[66]

Polikarpova, N., Furia, C.A., Pei, Y., Wei, Y., Meyer, B.: What good are strong specifications?In:201335thinternationalconferenceonsoftwareengineering(icse). pp. 262–271. IEEE (2013)

2013
[67]

arXiv preprint arXiv:1909.11542 (2019)

Ryan, G., Wong, J., Yao, J., Gu, R., Jana, S.: Cln2inv: learning loop invariants with continuous logic networks. arXiv preprint arXiv:1909.11542 (2019)

work page arXiv 1909
[68]

Advances in Neural Information Processing Systems31 (2018)

Si, X., Dai, H., Raghothaman, M., Naik, M., Song, L.: Learning loop invariants for program verification. Advances in Neural Information Processing Systems31 (2018)

2018
[69]

ACM Sigact News27(1), 27–29 (1996)

Sipser, M.: Introduction to the theory of computation. ACM Sigact News27(1), 27–29 (1996)

1996
[70]

In: Pro- ceedings of the 2016 ACM SIGPLAN international conference on object-oriented programming, systems, languages, and applications

Sun, C., Le, V., Su, Z.: Finding compiler bugs via live code mutation. In: Pro- ceedings of the 2016 ACM SIGPLAN international conference on object-oriented programming, systems, languages, and applications. pp. 849–863 (2016)

2016
[71]

SIAM journal on com- puting1(2), 146–160 (1972)

Tarjan, R.: Depth-first search and linear graph algorithms. SIAM journal on com- puting1(2), 146–160 (1972)

1972
[72]

arXiv preprint arXiv:2406.09843 (2024)

Wang, B., Chen, M., Lin, Y., Papadakis, M., Zhang, J.M.: An exploratory study on using large language models for mutation testing. arXiv preprint arXiv:2406.09843 (2024)

work page arXiv 2024
[73]

In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Wang, B., Lu, S., Xiong, Y., Liu, F.: Faster mutation analysis with fewer pro- cesses and smaller overheads. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). pp. 381–393. IEEE (2021) Synthesis and Refinement of Formal Program Specifications 23

2021
[74]

Frontiers of Computer Science18(6), 186345 (2024)

Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., et al.: A survey on large language model based autonomous agents. Frontiers of Computer Science18(6), 186345 (2024)

2024
[75]

arXiv preprint arXiv:2512.24594 (2025)

Wang, Z., Lin, T., Chen, M., Li, H., Yang, M., Yi, X., Qin, S., Luo, Y., Li, X., Gu, B., et al.: A tale of 1001 loc: Potential runtime error-guided specification synthesis for verifying large-scale programs. arXiv preprint arXiv:2512.24594 (2025)

work page arXiv 2025
[76]

In: International Conference on Computer Aided Verification

Wen, C., Cao, J., Su, J., Xu, Z., Qin, S., He, M., Li, H., Cheung, S.C., Tian, C.: Enchanting program specification synthesis by large language models using static analysis and program verification. In: International Conference on Computer Aided Verification. pp. 302–328. Springer (2024)

2024
[77]

Proceedings of the ACM on Programming Languages9(OOPSLA2), 3454– 3482 (2025)

Yang, C., Li, X., Misu, M.R.H., Yao, J., Cui, W., Gong, Y., Hawblitzel, C., Lahiri, S., Lorch, J.R., Lu, S., et al.: Autoverus: Automated proof generation for rust code. Proceedings of the ACM on Programming Languages9(OOPSLA2), 3454– 3482 (2025)

2025
[78]

Rag-verus: Repository-level program verification with llms using retrieval augmented generation.arXiv preprint arXiv:2502.05344, 2025

Zhong, S., Zhu, J., Tian, Y., Si, X.: Rag-verus: Repository-level program verification with llms using retrieval augmented generation. arXiv preprint arXiv:2502.05344 (2025)

work page arXiv 2025