pith. sign in

arxiv: 2605.27051 · v1 · pith:QIQDYQSGnew · submitted 2026-05-26 · 💻 cs.SE · cs.AI

ConVer: Using Contracts and Loop Invariant Synthesis for Scalable Formal Software Verification

Pith reviewed 2026-06-29 15:37 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords formal verificationfunction contractsLLM synthesisCEGAR-CEGIScompositional verificationbounded model checkingC programsloop invariants
0
0 comments X

The pith

ConVer decomposes C program verification top-down by synthesizing LLM contracts and refining them via CEGAR-CEGIS to reduce state-space explosion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ConVer to address the state-space explosion problem in bounded model checking of C programs. It generates function contracts from a top-level assertion using an LLM, then performs alternating system and function level checks that refine the contracts on failure. This top-down approach is tested across four benchmark suites of increasing difficulty. A sympathetic reader cares because the method aims to verify programs that exceed the practical limits of unrolling every construct in traditional BMC tools.

Core claim

ConVer decomposes verification top-down: it uses a large language model to synthesise function contracts from the system property, then alternates system-level and function-level checks in a CEGAR-CEGIS loop, refining contracts whenever a check fails via SMART ICE learning. On the Frama-C benchmark of 45 simple C programs, it achieves 82-96% verification success across three LLM backends, with 93-95% of converged programs requiring only a single CEGAR-CEGIS iteration.

What carries the argument

Top-down compositional verification that synthesizes and refines LLM-generated function contracts inside a CEGAR-CEGIS loop with SMART ICE learning.

If this is right

  • 82-96% success on Frama-C with most cases converging after one iteration.
  • 33-50% success on the X.509 parser suite and 82-88% on LF2C-Simple.
  • 55-64% success on VerifyThis recursive programs using the Pre-Abstraction strategy.
  • 67% success on LF-Hard benchmarks after transpilation to C via ESBMC-LF.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contract synthesis and refinement pattern could be tested on programs written in languages other than C if an equivalent model checker is available.
  • If contract stability holds for industrial codebases, the approach would lower the manual annotation burden that currently limits adoption of formal methods.
  • Different LLM backends yielding similar success rates suggests the method is not tightly coupled to any single model family.
  • Programs with deeper recursion or more complex data structures than VerifyThis may require additional abstraction layers beyond the Pre-Abstraction strategy.

Load-bearing premise

LLM-generated contracts stay stable enough that the refinement loop can converge on programs whose structure matches the evaluated benchmarks.

What would settle it

A collection of C programs whose function dependencies differ markedly from the Frama-C and LF suites on which the LLM contracts repeatedly fail to stabilize after three or more refinement iterations.

Figures

Figures reproduced from arXiv: 2605.27051 by Konstantin Korovin, Lucas C. Cordeiro, Muhammad A. A. Pirzada, Weiqi Wang, Yiannis Charalambous.

Figure 1
Figure 1. Figure 1: The CONVER verification pipeline. The LLM derives function contracts top-down from the system property; the CEGAR loop alternates system-level and function-level ESBMC checks, refining contracts via ICE learning until both levels pass. Counterexample Guided Inductive Synthesis (CEGIS) loop. 4) End-to-end evaluation. An empirical study across four benchmarks: the Frama-C benchmark (45 programs, 82– 96% succ… view at source ↗
Figure 2
Figure 2. Figure 2: Outcome breakdown on Frama-C benchmark (45 programs) for three [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Verification success rate (%) across all four benchmarks and three LLM backends. Each cell shows the convergence percentage under the best [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CEGAR iteration distribution on Frama-C benchmark (SmartICE) for three LLM backends. Nearly all converged programs require only a single [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SmartICE vs. No-ICE on the Frama-C benchmark for three LLM [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Outcome breakdown on LF-Hard benchmarks (24 programs, 600 s [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Formal verification of large C programs is impeded by state-space explosion: Bounded Model Checking (BMC) tools must encode the entire state space up to the predetermined bound by unrolling all nested constructs. We present ConVer, a top-down compositional verification tool. Given a C program with a top-level assertion, ConVer decomposes verification top-down: it uses a large language model (LLM) to synthesise function contracts from the system property, then alternates system-level and function-level checks in a CEGAR-CEGIS loop, refining contracts whenever a check fails via SMART ICE learning. We evaluate ConVer on four benchmark suites of increasing difficulty and against other state-of-the-art (SOTA) tools. On the Frama-C benchmark of 45 simple C programs, ConVer achieves 82-96% verification success across three LLM backends, with 93-95% of converged programs requiring only a single CEGAR-CEGIS iteration. On the X.509 parser benchmark (6~programs) and LF2C-Simple suite (17 programs), ConVer achieves 33-50% and 82-88% success respectively. On the VerifyThis suite of 11 recursive and loop-intensive programs, the Pre-Abstraction strategy achieves 55-64% success. In addition, we present ESBMC-LF a preprocessor tool that converts LF models to C while preserving the properties of the LF files, enabling ConVer to verify them. We transpile the LF Verifier Benchmarks using ESBMC-LF to C; we denote those LF-Hard. We show that ConVer successfully verifies 67% of LF-Hard benchmarks overall.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents ConVer, a top-down compositional verification tool for C programs. It uses an LLM to synthesize function contracts from a top-level assertion, then alternates system-level and function-level checks in a CEGAR-CEGIS loop, refining contracts via SMART ICE learning when checks fail. The approach is evaluated on four benchmark suites of increasing difficulty (Frama-C with 45 simple programs, X.509 with 6 programs, LF2C-Simple with 17 programs, VerifyThis with 11 recursive/loop-intensive programs, plus LF-Hard derived via the new ESBMC-LF transpiler), reporting 82-96% success on Frama-C (with 93-95% single-iteration convergence), 33-50% on X.509, 82-88% on LF2C-Simple, 55-64% on VerifyThis (Pre-Abstraction), and 67% on LF-Hard.

Significance. If the empirical results hold under more rigorous experimental controls, ConVer could advance scalable formal verification by reducing manual contract specification through LLM assistance and compositional decomposition. The multi-suite evaluation, including introduction of ESBMC-LF for LF-to-C transpilation and testing on recursive programs, provides a useful baseline for LLM-assisted verification tools. The high single-iteration convergence on simple cases is a concrete strength worth highlighting.

major comments (2)
  1. [Evaluation] Evaluation section (results on Frama-C, X.509, VerifyThis): success percentages (82-96%, 33-50%, 55-64%) are reported without an experimental protocol, number of LLM runs per program, statistical tests, variance measures, or systematic failure-case analysis. This directly affects the ability to judge whether the numbers support the central claim of scalable verification via stable LLM contracts.
  2. [Evaluation] Evaluation on VerifyThis and X.509 suites: lower success rates are noted but without reporting iteration counts, contract refinement success/failure breakdowns, or metrics on contract stability for recursive/loop-intensive or parser programs. This leaves the CEGAR-CEGIS convergence assumption untested precisely where program structure diverges from the Frama-C simple cases.
minor comments (2)
  1. [Abstract] The abstract and evaluation could more clearly distinguish between the three LLM backends and their individual results rather than reporting ranges.
  2. [Approach] Notation for SMART ICE and the CEGAR-CEGIS loop could be introduced with a small diagram or pseudocode for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting gaps in the evaluation section. We address each major comment below and commit to revisions that add the requested experimental details to better support our claims about LLM-assisted contract synthesis and CEGAR-CEGIS convergence.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (results on Frama-C, X.509, VerifyThis): success percentages (82-96%, 33-50%, 55-64%) are reported without an experimental protocol, number of LLM runs per program, statistical tests, variance measures, or systematic failure-case analysis. This directly affects the ability to judge whether the numbers support the central claim of scalable verification via stable LLM contracts.

    Authors: We agree that the current manuscript does not provide a full experimental protocol, the number of LLM runs per program, statistical tests, variance measures, or a systematic failure-case analysis. This limits readers' ability to evaluate contract stability. In the revised manuscript we will add a dedicated 'Experimental Protocol' subsection that specifies the number of independent LLM runs performed for each program and configuration, reports observed variance across runs, includes basic statistical measures where appropriate, and provides a categorized breakdown of failure cases (e.g., synthesis failure vs. refinement divergence). These additions will directly address the concern about supporting the central claim. revision: yes

  2. Referee: [Evaluation] Evaluation on VerifyThis and X.509 suites: lower success rates are noted but without reporting iteration counts, contract refinement success/failure breakdowns, or metrics on contract stability for recursive/loop-intensive or parser programs. This leaves the CEGAR-CEGIS convergence assumption untested precisely where program structure diverges from the Frama-C simple cases.

    Authors: We acknowledge that iteration counts, refinement success/failure breakdowns, and contract stability metrics are not reported for the X.509 and VerifyThis suites. In the revision we will extend the results tables and accompanying text for these suites to include per-program and aggregate iteration counts, the fraction of cases requiring refinement steps, and indicators of contract stability (e.g., whether contracts stabilized after the first synthesis or needed multiple updates). This will allow direct assessment of the CEGAR-CEGIS loop on recursive and parser programs. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation on external benchmarks

full rationale

The paper describes a compositional verification tool (ConVer) that combines LLM contract synthesis with a CEGAR-CEGIS loop and reports success rates directly on four external benchmark suites (Frama-C, X.509, LF2C-Simple, VerifyThis, LF-Hard). No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the abstract or described content. All performance figures (82-96% on Frama-C, lower rates on other suites) are presented as measured outcomes rather than derived by construction from the tool's own inputs or prior self-citations. The central claims therefore remain independent of any internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivation is present; the approach rests on the empirical effectiveness of the LLM and the refinement loop rather than on stated axioms or free parameters.

pith-pipeline@v0.9.1-grok · 5859 in / 1009 out tokens · 34120 ms · 2026-06-29T15:37:27.290431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Esbmc v7. 7: Efficient concurrent software verification with scheduling, incremental smt and partial order reduction: (competition contribution),

    T. Wu, X. Li, E. Manino, R. S. Menezes, M. R. Gadelha, S. Xiong, N. Tihanyi, P. Petoumenos, and L. C. Cordeiro, “Esbmc v7. 7: Efficient concurrent software verification with scheduling, incremental smt and partial order reduction: (competition contribution),” inInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems. ...

  2. [2]

    Code-level model checking in the software development workflow at amazon web services,

    N. Chong, B. Cook, J. Eidelman, K. Kallas, K. Khazem, F. R. Monteiro, D. Schwartz-Narbonne, S. Tasiran, M. Tautschnig, and M. R. Tuttle, “Code-level model checking in the software development workflow at amazon web services,”Software: Practice and Experience, vol. 51, no. 4, pp. 772–797, 2021

  3. [3]

    Model checking and the state explosion problem,

    E. M. Clarke, W. Klieber, M. Nov ´aˇcek, and P. Zuliani, “Model checking and the state explosion problem,” inLASER Summer School on Software Engineering. Springer, 2011, pp. 1–30

  4. [4]

    Counterexample- guided abstraction refinement,

    E. Clarke, O. Grumberg, S. Jha, Y . Lu, and H. Veith, “Counterexample- guided abstraction refinement,” inInternational Conference on Com- puter Aided Verification. Springer, 2000, pp. 154–169

  5. [5]

    Ice: A robust framework for learning invariants,

    P. Garg, C. L ¨oding, P. Madhusudan, and D. Neider, “Ice: A robust framework for learning invariants,” inInternational Conference on Computer Aided Verification. Springer, 2014, pp. 69–87

  6. [6]

    Symbolic model checking without bdds,

    A. Biere, A. Cimatti, E. Clarke, and Y . Zhu, “Symbolic model checking without bdds,” inInternational conference on tools and algorithms for the construction and analysis of systems. Springer, 1999, pp. 193–207

  7. [7]

    Bounded model checking

    A. Biere, A. Cimatti, E. M. Clarke, O. Strichman, and Y . Zhu, “Bounded model checking.”Handbook of satisfiability, vol. 185, no. 99, pp. 457– 481, 2009

  8. [8]

    Bounded Model Checking of Multi-threaded Software using SMT solvers

    L. Cordeiro and B. Fischer, “Bounded model checking of multi-threaded software using smt solvers,”arXiv preprint arXiv:1003.3830, 2010

  9. [9]

    Esbmc 5.0: an industrial-strength c model checker,

    M. R. Gadelha, F. R. Monteiro, J. Morse, L. C. Cordeiro, B. Fischer, and D. A. Nicole, “Esbmc 5.0: an industrial-strength c model checker,” inProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018, pp. 888–891

  10. [10]

    Esbmc v7. 4: Harnessing the power of intervals: (competition contribution),

    R. S. Menezes, M. Aldughaim, B. Farias, X. Li, E. Manino, F. Shmarov, K. Song, F. Brauße, M. R. Gadelha, N. Tihanyiet al., “Esbmc v7. 4: Harnessing the power of intervals: (competition contribution),” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2024, pp. 376–380

  11. [11]

    Large language model (llm) for software security: Code analysis, malware analysis, reverse engineering,

    H. Jelodar, S. Bai, P. Hamedi, H. Mohammadian, R. Razavi-Far, and A. Ghorbani, “Large language model (llm) for software security: Code analysis, malware analysis, reverse engineering,”Journal of Information Security and Applications, vol. 98, p. 104390, 2026

  12. [12]

    Automated Program Repair in the Era of Large Pre-trained Language Models,

    C. S. Xia, Y . Wei, and L. Zhang, “Automated Program Repair in the Era of Large Pre-trained Language Models,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). Melbourne, Australia: IEEE, May 2023, pp. 1482–1494. [Online]. Available: https://ieeexplore.ieee.org/document/10172803/

  13. [13]

    Evaluating Language Models for Efficient Code Generation,

    J. Liu, S. Xie, J. Wang, Y . Wei, Y . Ding, and L. Zhang, “Evaluating Language Models for Efficient Code Generation,” Aug. 2024, arXiv:2408.06450. [Online]. Available: http://arxiv.org/abs/2408.06450

  14. [14]

    A new era in software security: Towards self-healing software via large language models and formal verification,

    N. Tihanyi, Y . Charalambous, R. Jain, M. A. Ferrag, and L. C. Cordeiro, “A new era in software security: Towards self-healing software via large language models and formal verification,” in2025 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 2025, pp. 136–147

  15. [15]

    The use of large language models for program repair,

    F. Zubair, M. Al-Hitmi, and C. Catal, “The use of large language models for program repair,”Computer Standards & Interfaces, vol. 93, p. 103951, Apr. 2025. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S092054892400120X

  16. [16]

    Towards Effectively Leveraging Execution Traces for Program Repair with Code LLMs,

    M. Haque, P. Babkin, F. Farmahinifarahani, and M. Veloso, “Towards Effectively Leveraging Execution Traces for Program Repair with Code LLMs,” May 2025, arXiv:2505.04441 [cs] version: 1. [Online]. Available: http://arxiv.org/abs/2505.04441

  17. [17]

    Polyver: A compositional approach for polyglot system modeling and verification,

    P.-W. Chen, S. Lin, A. Godbole, R. Singh, E. Polgreen, E. Lee, and S. Seshia, “Polyver: A compositional approach for polyglot system modeling and verification,” 2025. [Online]. Available: https: //repositum.tuwien.at/handle/20.500.12708/219538

  18. [18]

    A tale of 1001 loc: Potential runtime error- guided specification synthesis for verifying large-scale programs,

    Z. Wang, T. Lin, M. Chen, H. Li, M. Yang, X. Yi, S. Qin, Y . Luo, X. Li, B. Guet al., “A tale of 1001 loc: Potential runtime error- guided specification synthesis for verifying large-scale programs,”arXiv preprint arXiv:2512.24594, 2025

  19. [19]

    Uclid5: integrating modeling, veri- fication, synthesis and learning,

    S. A. Seshia and P. Subramanyan, “Uclid5: integrating modeling, veri- fication, synthesis and learning,” in2018 16th ACM/IEEE International Conference on Formal Methods and Models for System Design (MEM- OCODE). IEEE, 2018, pp. 1–10

  20. [20]

    Array-carrying symbolic execution for function contract generation,

    W. Lu, J. Ke, H. Fu, Z. Sun, Y . Zhou, G. Li, and H. Li, “Array-carrying symbolic execution for function contract generation,” inProceedings of the 26th International Symposium on Formal Methods (FM 2026), ser. Lecture Notes in Computer Science. Springer, 2026

  21. [21]

    Yesterday, my program worked. today, it does not. why?

    A. Zeller, “Yesterday, my program worked. today, it does not. why?” ACM SIGSOFT Software engineering notes, vol. 24, no. 6, pp. 253–267, 1999

  22. [22]

    Simplifying and isolating failure-inducing input,

    A. Zeller and R. Hildebrandt, “Simplifying and isolating failure-inducing input,”IEEE Transactions on software engineering, vol. 28, no. 2, pp. 183–200, 2002

  23. [23]

    ESBMC 7.4: Harnessing the power of intervals,

    R. S ´a Menezes, M. Aldughaim, B. Farias, X. Li, E. Manino, F. Shmarov, K. Song, F. Brauße, M. R. Gadelha, N. Tihanyi, K. Korovin, and L. C. Cordeiro, “ESBMC 7.4: Harnessing the power of intervals,” Systems & Software Verification Laboratory, Software Documentation, 2024, version 7.4. Available at: https: //github.com/esbmc/esbmc. Accessed: 2026-03-20. [O...

  24. [24]

    Llm-generated invariants for bounded model checking without loop unrolling,

    M. A. Pirzada, G. Reger, A. Bhayat, and L. C. Cordeiro, “Llm-generated invariants for bounded model checking without loop unrolling,” inPro- ceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1395–1407

  25. [25]

    Assume-guarantee abstraction refinement for probabilistic systems,

    A. Komuravelli, C. S. P ˘as˘areanu, and E. M. Clarke, “Assume-guarantee abstraction refinement for probabilistic systems,” inInternational con- ference on computer aided verification. Springer, 2012, pp. 310–326

  26. [26]

    Sketching concurrent data structures,

    A. Solar-Lezama, C. G. Jones, and R. Bodik, “Sketching concurrent data structures,” inProceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’08. New York, NY , USA: Association for Computing Machinery, 2008, p. 136–148. [Online]. Available: https://doi.org/10.1145/1375581.1375599

  27. [27]

    A complexity measure,

    T. J. McCabe, “A complexity measure,”IEEE Transactions on software Engineering, no. 4, pp. 308–320, 1976

  28. [28]

    Towards building verifiable cps using lingua franca,

    S. Lin, Y . A. Manerkar, M. Lohstroh, E. Polgreen, S.-J. Yu, C. Jerad, E. A. Lee, and S. A. Seshia, “Towards building verifiable cps using lingua franca,”ACM Transactions on Embedded Computing Systems, vol. 22, no. 5s, pp. 1–24, 2023

  29. [29]

    Introducing Claude 3.5 Sonnet,

    Anthropic, “Introducing Claude 3.5 Sonnet,” 2024, accessed: 2026-03-27. [Online]. Available: https://www.anthropic.com/news/ claude-3-5-sonnet

  30. [30]

    State of the art in software verification and witness validation: Sv-comp 2024,

    D. Beyer, “State of the art in software verification and witness validation: Sv-comp 2024,” inInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2024, pp. 299–329