pith. sign in

arxiv: 2605.17914 · v1 · pith:3NOGSDW2new · submitted 2026-05-18 · 💻 cs.PL

Guiding LLM-based Loop Invariant Synthesis via Feedback on Local Reasoning Errors

Pith reviewed 2026-05-20 00:50 UTC · model grok-4.3

classification 💻 cs.PL
keywords loop invariant synthesislarge language modelsreasoning feedbackformal verificationprogram analysisC programsnon-linear invariants
0
0 comments X

The pith

Verifying an LLM's natural language reasoning steps for logical flaws enables effective loop invariant refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework that improves LLM performance on loop invariant synthesis by examining the model's own explanations rather than just its final guesses. It prompts the LLM to justify in natural language why a candidate invariant satisfies the program conditions, then converts those justifications into first-order logic implications that can be checked automatically. Any invalid implication reveals the exact error in the reasoning chain. This error is packaged into targeted feedback that the LLM uses to produce a better invariant on the next try. The method reached 93.1 percent success on 460 C programs and remained effective on a separate set of 50 programs that require non-linear invariants.

Core claim

By formally verifying the LLM's own step-by-step natural language reasoning for failed verification conditions and feeding back exact logical flaws, the framework enables effective refinement of loop invariants, solving 445 out of 460 programs (93.1% success) on the main benchmark and showing robustness on non-linear cases.

What carries the argument

The translation of the LLM's natural language reasoning steps into first-order logic implications, which are automatically checked to identify invalid ones that pinpoint the precise flaw in the original thinking.

If this is right

  • Targeted feedback on specific logical flaws allows refinement without repeated random guesses.
  • The approach maintains high success rates even on programs whose invariants involve non-linear arithmetic.
  • Each iteration becomes more directed because the feedback names the exact mistaken implication in the prior attempt.
  • The same verification-of-reasoning pattern can be reused whenever an LLM must produce objects that are later checked by a formal tool.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-error feedback technique could be applied to LLM-generated proofs or test cases in other formal methods settings.
  • Hybrid systems that combine this method with classical invariant generators might reduce the number of LLM calls needed.
  • The results suggest that LLMs can be made more reliable for verification tasks by treating their explanations as first-class objects to be checked.

Load-bearing premise

The translation of the LLM's natural language reasoning steps into first-order logic implications is accurate enough that an invalid implication reliably identifies the true logical flaw rather than introducing translation artifacts.

What would settle it

Running the system on the same benchmarks after replacing the translation step with a version known to distort the original reasoning and observing whether the success rate falls sharply would test the claim.

Figures

Figures reproduced from arXiv: 2605.17914 by Junhao Liu, Peng Di, Tianchi Li, Xin Zhang, Zhenyu Yan.

Figure 2
Figure 2. Figure 2: Loop invariants proposed by the LLM. /*@ loop invariant i1: a >= -j + 1 && a <= j - 1; loop invariant i2: j >= 1; loop invariant i3: j <= m + 1; loop invariant i4: m > 0; */ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Example program. /*@ loop invariant i1: a >= -(j - 1) && a <= (j - 1); loop invariant i2: j >= 1; loop invariant i3: m > 0; */ [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: A local reasoning step of the natural language proof given by the LLM. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The corresponding formalized proof of the local reasoning step. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for the Synthesizer LLM to generate initial invariants. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The structure of the natural language proof. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for the Synthesizer LLM to generate structured natural language proof. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The Prompt Used for Formalization (Part 1 / 2) [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The Prompt Used for Formalization (Part 2 / 2) [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The Prompt Used for Feedback. clauses, which are then logically combined via conjunction or disjunction. Like LaM4Inv, it relies on counterexample-based feedback to guide the LLM to generate new clauses. Our approach diverges from these baselines in two fundamental ways. First, while both baselines rely on coarse-grained, counterexample-based feedback, our approach provides structured and detailed feedbac… view at source ↗
Figure 13
Figure 13. Figure 13: Correct loop invariants for the program. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Loop invariants proposed by the LLM. 7.1 Failure Analysis For the benchmarks that our method failed to solve in the evaluation of Section 6, we conducted a manual inspection to better understand the limitations of our approach and get clues for the future work. We found that a small subset of failures resulted from tool-specific limitations, such as Frama-C’s handling of specific C features (e.g., unsigne… view at source ↗
Figure 16
Figure 16. Figure 16: Correct loop invariants for the program. [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Loop invariants proposed by the LLM. negative. This highlights a limitation of our method: while our method effectively repairs local reasoning errors, it cannot currently prompt the LLM to re-think the global reasoning of the problem. However, as LLMs become more and more powerful, our method will be more helpful to leverage their abilities, as shown in our evaluation. 7.1.2 Ineffective Refinement. The s… view at source ↗
Figure 18
Figure 18. Figure 18: Example program with nested loops. /*@ loop invariant a == p * x + r * y; loop invariant b == q * x + s * y; */ [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Loop invariants for the outer loop proposed [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
read the original abstract

We propose a novel framework that provides constructive feedback to an LLM in the "guess-and-check" paradigm by formally verifying its own thinking process and detecting local reasoning errors. We apply this framework to the loop invariant synthesis problem. We prompt the model to produce a step-by-step natural language proof justifying its thinking process for the failed verification condition of its generated loop invariants. Then, we use an LLM to translate the reasoning steps into first-order logic implications, which can be checked automatically. An invalid implication pinpoints the exact logical flaw in the LLM's thinking process, which we then use to construct targeted feedback for refinement. We have implemented our approach in a tool called LORIS and evaluated it on a main benchmark suite of 460 C programs and an additional benchmark suite of 50 C programs each of which involves non-linear properties. On the main benchmark suite, LORIS solved 445 of the programs, and achieved an overall success rate of $93.1\%$. LORIS also demonstrates robustness on the challenging non-linear benchmark suite.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LORIS, a framework for guiding LLM-based loop invariant synthesis by detecting local reasoning errors through formal verification of the LLM's natural language thinking process. The approach involves prompting the LLM for step-by-step natural language justifications for failed verification conditions, translating these into first-order logic implications using another LLM, checking their validity, and providing targeted feedback based on invalid implications to refine the invariants. It evaluates this on 460 C programs achieving 93.1% success and on 50 non-linear programs showing robustness.

Significance. If the translation from natural language reasoning to FOL implications is faithful, the work offers a promising direction for improving LLM reliability in formal methods by enabling targeted self-correction of logical flaws during invariant generation. The reported 93.1% success rate on the main benchmark of 460 programs and robustness on the non-linear suite represent a strong empirical result that could influence automated program verification tools.

major comments (2)
  1. [Framework (Section 3)] The central mechanism translates LLM-generated natural language reasoning steps into FOL implications for validity checking, but the manuscript provides no quantitative validation (e.g., fidelity metrics or human audits) of this translation step. This is load-bearing for the claim that invalid implications reliably identify true logical flaws rather than translation artifacts.
  2. [Evaluation (Section 4)] The evaluation reports solving 445 of 460 programs (93.1% success) on the main benchmark and robustness on the non-linear suite, but includes no error bars, statistical significance tests, or details on the number of LLM attempts or queries permitted per program. This weakens assessment of whether the results are robust or sensitive to experimental setup.
minor comments (1)
  1. [Implementation details] Clarify the exact prompt templates used for the natural language reasoning step and the translation step to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address each major comment below in detail and describe the specific revisions we will make to the next version of the paper.

read point-by-point responses
  1. Referee: [Framework (Section 3)] The central mechanism translates LLM-generated natural language reasoning steps into FOL implications for validity checking, but the manuscript provides no quantitative validation (e.g., fidelity metrics or human audits) of this translation step. This is load-bearing for the claim that invalid implications reliably identify true logical flaws rather than translation artifacts.

    Authors: We agree that the absence of quantitative validation for the natural-language-to-FOL translation step is a limitation in the current manuscript. While the overall success rate on the benchmark provides indirect support for the approach, it does not directly measure translation fidelity. In the revised version we will add a dedicated subsection (likely 3.4) that reports the results of a human audit performed on a random sample of 100 translation instances. The audit will include fidelity metrics such as the percentage of translations judged accurate by two independent reviewers, along with inter-annotator agreement. We will also discuss cases where translation errors occurred and how they were handled. revision: yes

  2. Referee: [Evaluation (Section 4)] The evaluation reports solving 445 of 460 programs (93.1% success) on the main benchmark and robustness on the non-linear suite, but includes no error bars, statistical significance tests, or details on the number of LLM attempts or queries permitted per program. This weakens assessment of whether the results are robust or sensitive to experimental setup.

    Authors: We acknowledge that the current evaluation section lacks statistical rigor and implementation details that would allow readers to assess robustness. In the revision we will expand Section 4 to include: (1) error bars computed over five independent runs with different random seeds for the LLM calls, (2) results of paired statistical significance tests (e.g., McNemar’s test) against the strongest baseline, and (3) an explicit description of the experimental budget, including the maximum number of LLM queries and refinement attempts permitted per program. These additions will be presented in both the main text and a new appendix table. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical success measured on external benchmarks

full rationale

The paper's central result is an empirical success rate (445/460 programs, 93.1%) obtained by implementing the LORIS tool and running it on a standard benchmark suite of C programs plus a separate non-linear suite. The method involves prompting an LLM for natural-language reasoning, translating steps to FOL implications via another LLM call, checking validity, and generating feedback. This workflow is evaluated directly against external verification benchmarks rather than deriving any quantity from fitted parameters, self-definitions, or a chain of self-citations. No equations or uniqueness theorems are invoked that reduce the reported performance to the method's own inputs by construction. The outcome remains independently falsifiable on the cited benchmark suites.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that LLMs can produce usable natural-language proofs and that an auxiliary LLM can faithfully translate those proofs into FOL without introducing new errors. No free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption An LLM can generate a step-by-step natural language proof that justifies its thinking process for a failed verification condition.
    Stated in the abstract as the first step of the framework.
  • domain assumption Translation of natural-language reasoning steps into first-order logic implications preserves the original logical content sufficiently for error detection.
    Required for the automatic checking step to correctly identify flaws.

pith-pipeline@v0.9.0 · 5718 in / 1439 out tokens · 42619 ms · 2026-05-20T00:50:05.998183+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    Timos Antopoulos, Andrew Ruef, and Michael Hicks. 2016. A Counterexample-guided Approach to Finding Numerical Invariants. (2016)

  2. [2]

    Haniel Barbosa, Clark Barrett, Martin Brain, Gereon Kremer, Hanna Lachnitt, Makai Mann, Abdalrhman Mohamed, Mudathir Mohamed, Aina Niemetz, Andres Nötzli, Alex Ozdemir, Mathias Preiner, Andrew Reynolds, Ying Sheng, Cesare Tinelli, and Yoni Zohar. 2022. cvc5: A Versatile and Industrial-Strength SMT Solver. InTools and Algorithms for the Construction and An...

  3. [3]

    Bruno Barras, Samuel Boutin, Cristina Cornes, Judicaël Courant, Jean-Christophe Filliâtre, Eduardo Giménez, Hugo Her- belin, Gérard Huet, César Muñoz, Chetan Murthy, Catherine Parent-vigouroux, Christine Paulin-Mohring, Amokrane Saïbi, and Benjamin Werner. 1997. The Coq Proof Assistant Reference Manual : Version 6.1. (06 1997)

  4. [4]

    Bruno Blanchet, Patrick Cousot, Radhia Cousot, Jérome Feret, Laurent Mauborgne, Antoine Miné, David Monniaux, and Xavier Rival. 2003. A static analyzer for large safety-critical software.SIGPLAN Not.38, 5 (May 2003), 196–207. doi:10.1145/780822.781153

  5. [5]

    Weining Cao, Guangyuan Wu, Tangzhi Xu, Yuan Yao, Hengfeng Wei, Taolue Chen, and Xiaoxing Ma. 2025. Clause2Inv: A Generate-Combine-Check Framework for Loop Invariant Inference.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA045 (June 2025), 22 pages. doi:10.1145/3728920

  6. [6]

    Fengxiang Cheng, Haoxuan Li, Fenrong Liu, Robert van Rooij, Kun Zhang, and Zhouchen Lin. 2025. Empowering LLMs with Logical Reasoning: A Comprehensive Survey. arXiv:2502.15652 [cs.AI] https://arxiv.org/abs/2502.15652 ACM Trans. Program. Lang. Syst., Vol. 48, No. 2, Article 8. Publication date: May 2026. Guiding LLM-based Loop Invariant Synthesis via Feedb...

  7. [7]

    Patrick Cousot and Radhia Cousot. 1977. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. InProceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages(Los Angeles, California)(POPL ’77). Association for Computing Machinery, New York, NY, USA, 238–...

  8. [8]

    Pascal Cuoq, Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2012. Frama-C. InSoftware Engineering and Formal Methods, George Eleftherakis, Mike Hinchey, and Mike Holcombe (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 233–247

  9. [9]

    Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: an efficient SMT solver. InProceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems(Budapest, Hungary)(TACAS’08/ETAPS’08). Springer-Verlag, Berlin, Heidelberg, 337–340

  10. [10]

    Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. 2015. The Lean Theorem Prover (System Description). InAutomated Deduction - CADE-25, Amy P. Felty and Aart Middeldorp (Eds.). Springer International Publishing, Cham, 378–388

  11. [11]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  12. [12]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  13. [13]

    Ernst, Jake Cockrell, William G

    Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. 1999. Dynamically discovering likely program invariants to support program evolution. InProceedings of the 21st International Conference on Software Engineering(Los Angeles, California, USA)(ICSE ’99). Association for Computing Machinery, New York, NY, USA, 213–224. doi:10.1145/302405.302467

  14. [14]

    Ernst, Adam Czeisler, William G

    Michael D. Ernst, Adam Czeisler, William G. Griswold, and David Notkin. 2000. Quickly detecting relevant program invariants. InProceedings of the 22nd International Conference on Software Engineering(Limerick, Ireland)(ICSE ’00). Association for Computing Machinery, New York, NY, USA, 449–458. doi:10.1145/337180.337240

  15. [15]

    Madhusudan, and Daniel Neider

    Pranav Garg, Christof Löding, P. Madhusudan, and Daniel Neider. 2014. ICE: A Robust Framework for Learning Invari- ants. InComputer Aided Verification, Armin Biere and Roderick Bloem (Eds.). Springer International Publishing, Cham, 69–87

  16. [16]

    C. A. R. Hoare. 1969. An axiomatic basis for computer programming.Commun. ACM12, 10 (Oct. 1969), 576–580. doi:10.1145/363235.363259

  17. [17]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/ 3703155

  18. [18]

    Ranjit Jhala and K. L. McMillan. 2006. A practical and complete approach to predicate refinement. InProceedings of the 12th International Conference on Tools and Algorithms for the Construction and Analysis of Systems(Vienna, Austria) (TACAS’06). Springer-Verlag, Berlin, Heidelberg, 459–473. doi:10.1007/11691372_33

  19. [19]

    Albert Qiaochu Jiang, Sean Welleck, Jin Peng Zhou, Timothee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. 2023. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=SMa9EAovKMC

  20. [20]

    Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma

    Adharsh Kamath, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis, Shuvendu K. Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma. 2023. Finding Inductive Loop Invariants using Large Language Models. arXiv:2311.07948 [cs.PL] https://arxiv.org/abs/2311.07948

  21. [21]

    Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. 2024. When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs.Transactions of the Association for Computational Linguistics 12 (2024), 1417–1440. doi:10.1162/tacl_a_00713

  22. [22]

    Chang Liu, Xiwei Wu, Yuan Feng, Qinxiang Cao, and Junchi Yan. 2024. Towards general loop invariant generation: a benchmark of programs with memory manipulation. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 4101, 26 pages

  23. [23]

    Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof Automation with Large Language Models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 1509–1520. doi:10.1145/3691620.3695521

  24. [24]

    Kumar Madhukar, Björn Wachter, Daniel Kroening, Matt Lewis, and Mandayam Srivas. 2015. Accelerating invariant generation. InProceedings of the 15th Conference on Formal Methods in Computer-Aided Design(Austin, Texas)(FMCAD ’15). FMCAD Inc, Austin, Texas, 105–111

  25. [25]

    McMillan

    Kenneth L. McMillan. 2010. Lazy annotation for program testing and verification. InProceedings of the 22nd International Conference on Computer Aided Verification(Edinburgh, UK)(CA V’10). Springer-Verlag, Berlin, Heidelberg, 104–118. doi:10.1007/978-3-642-14295-6_10

  26. [26]

    Tobias Nipkow, Markus Wenzel, and Lawrence C. Paulson. 2002.Isabelle/HOL: a proof assistant for higher-order logic. Springer-Verlag, Berlin, Heidelberg

  27. [27]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

  28. [28]

    Saswat Padhi, Rahul Sharma, and Todd Millstein. 2016. Data-driven precondition inference with learned features. SIGPLAN Not.51, 6 (June 2016), 42–56. doi:10.1145/2980983.2908099

  29. [29]

    Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, and Pengcheng Yin. 2023. Can large language models reason about program invariants?. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Article 1144, 25 pages

  30. [31]

    Gabriel Ryan, Justin Wong, Jianan Yao, Ronghui Gu, and Suman Jana. 2019. CLN2INV: Learning Loop Invariants with Continuous Logic Networks. arXiv:1909.11542 [cs.LG] https://arxiv.org/abs/1909.11542

  31. [32]

    Xujie Si, Hanjun Dai, Mukund Raghothaman, Mayur Naik, and Le Song. 2018. Learning Loop Invariants for Program Verification. InAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada(2018), Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kriste...

  32. [33]

    Xujie Si, Aaditya Naik, Hanjun Dai, Mayur Naik, and Le Song. 2020. Code2Inv: A Deep Learning Framework for Program Verification. InComputer Aided Verification - 32nd International Conference, CA V 2020, Los Angeles, CA, USA, July 21-24, 2020, Proceedings, Part II(2020)(Lecture Notes in Computer Science, Vol. 12225). Springer, 151–164. doi:10.1007/978-3-03...

  33. [34]

    Xujie Si, Aaditya Naik, Hanjun Dai, Mayur Naik, and Le Song. 2020. Code2Inv: A Deep Learning Framework for Program Verification. InComputer Aided Verification, Shuvendu K. Lahiri and Chao Wang (Eds.). Springer International Publishing, Cham, 151–164

  34. [35]

    Fabio Somenzi and Aaron R. Bradley. 2011. IC3: where monolithic and incremental meet. InProceedings of the International Conference on Formal Methods in Computer-Aided Design(Austin, Texas)(FMCAD ’11). FMCAD Inc, Austin, Texas, 3–8

  35. [36]

    2025.Competition on software verification

    SVCOMP. 2025.Competition on software verification. https://sv-comp.sosy-lab.org

  36. [37]

    Quang-Trung Ta, Ton Chanh Le, Siau-Cheng Khoo, and Wei-Ngan Chin. 2019. Automated mutual induction proof in separation logic.Form. Asp. Comput.31, 2 (April 2019), 207–230. doi:10.1007/s00165-018-0471-5

  37. [38]

    Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian

  38. [39]

    An extension of HybridSynchAADL and its application to collaborating au- tonomous UA Vs

    Enchanting Program Specification Synthesis by Large Language Models Using Static Analysis and Program Verification. InComputer Aided Verification: 36th International Conference, CA V 2024, Montreal, QC, Canada, July 24–27, ACM Trans. Program. Lang. Syst., Vol. 48, No. 2, Article 8. Publication date: May 2026. 8:32 Tianchi Li, Zhenyu Yan, Junhao Liu, Peng ...

  39. [40]

    Guangyuan Wu, Weining Cao, Yuan Yao, Hengfeng Wei, Taolue Chen, and Xiaoxing Ma. 2024. LLM Meets Bounded Model Checking: Neuro-symbolic Loop Invariant Inference. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 406–417. doi:1...

  40. [41]

    Haoze Wu, Clark Barrett, and Nina Narodytska. 2024. Lemur: Integrating Large Language Models in Automated Program Verification. InThe Twelfth International Conference on Learning Representations. https://openreview.net/ forum?id=Q3YaCghZNt

  41. [42]

    Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy

    Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy

  42. [43]

    Autoformalization with Large Language Models. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (Eds.). http: //papers.nips.cc/paper_files/pape...

  43. [44]

    Jianan Yao, Gabriel Ryan, Justin Wong, Suman Jana, and Ronghui Gu. 2020. Learning nonlinear loop invariants with gated continuous logic networks. InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation(London, UK)(PLDI 2020). Association for Computing Machinery, New York, NY, USA, 106–120. doi:10.1145/3385412.3385986

  44. [45]

    Shiwen Yu, Ting Wang, and Ji Wang. 2023. Loop Invariant Inference through SMT Solving Enhanced Reinforcement Learning. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis(Seattle, WA, USA)(ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 175–187. doi:10.1145/3597926.3598047

  45. [46]

    Jin Peng Zhou, Charles E Staats, Wenda Li, Christian Szegedy, Kilian Q Weinberger, and Yuhuai Wu. 2024. Don’t Trust: Verify – Grounding LLM Quantitative Reasoning with Autoformalization. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=V5tdi14ple

  46. [47]

    He Zhu, Stephen Magill, and Suresh Jagannathan. 2018. A data-driven CHC solver.SIGPLAN Not.53, 4 (June 2018), 707–721. doi:10.1145/3296979.3192416 ACM Trans. Program. Lang. Syst., Vol. 48, No. 2, Article 8. Publication date: May 2026