arxiv: 2604.16584 · v1 · submitted 2026-04-17 · 💻 cs.SE · cs.AI· cs.PL

Recognition: unknown

Certified Program Synthesis with a Multi-Modal Verifier

Yueyang Feng , Dipesh Kafle , Vladimir Gladshtein , Vitaly Kurin , George P\^irlea , Qiyuan Zhao , Peter M\"uller , Ilya Sergey

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:14 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.PL

keywords certified program synthesismulti-modal verifierLeanformal verificationLLM agentsspecification validationproperty-based testingvericoding

0 comments

The pith

LeetProof uses a multi-modal verifier in Lean to generate more fully certified programs from natural language than single-mode baselines at the same budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LeetProof, an agentic pipeline for certified program synthesis from natural-language descriptions that relies on a multi-modal verifier to fix two problems: specifications that are either too weak or too strong, and the fragmentation of verifiers into separate tools with different automation-expressivity trade-offs. The verifier, called Velvet and embedded in Lean, supports dynamic validation through randomized property-based testing, automated proofs, and interactive scripting within one framework. This lets LeetProof validate specifications before any code is written, break synthesis tasks into verification-guided sub-problems, and hand residual proofs to specialized AI provers. On benchmarks drawn from earlier work, the staged pipeline produces a significantly higher fraction of fully certified solutions than a single-mode baseline across two frontier LLM backends while also exposing defects in those benchmarks.

Core claim

By structuring the certified synthesis workflow around a multi-modal verifier that combines dynamic validation, automated proofs, and interactive proof scripting in a single foundational framework, LeetProof achieves a significantly higher rate of fully certified solutions than a single-mode baseline at the same budget, consistently across two frontier LLM backends, and its specification validation step uncovers defects in existing reference benchmarks.

What carries the argument

Velvet, the multi-modal verifier embedded in Lean, which performs randomized property-based testing for specification validation, decomposes synthesis tasks using verification conditions, and delegates residual proofs to frontier AI provers.

If this is right

Specifications synthesized from natural language can be tested for weakness or over-strength before any code is generated.
Synthesis tasks can be decomposed into sub-problems aligned with verification conditions.
Residual proof obligations can be routed to AI provers specialized for the Lean ecosystem.
The performance gain holds across different frontier LLM backends at fixed computational budget.
Existing certified-synthesis benchmarks contain identifiable defects that can be detected systematically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged validation-plus-decomposition pattern could be ported to other foundational proof assistants to test whether multi-modality is broadly useful.
Software teams might adopt certified synthesis earlier if the pipeline reduces the manual effort needed to reach machine-checkable proofs.
Cleaning the reference benchmarks of the uncovered defects and re-running both pipelines would isolate how much of the reported gain is due to better handling versus benchmark noise.
Extending the dynamic-testing stage to larger input spaces could further tighten the quality of generated specifications.

Load-bearing premise

A multi-modal verifier can be realized in Lean to combine dynamic validation, automated proofs, and interactive scripting without prohibitive overhead or loss of expressivity, and the benchmarks remain representative after defects are uncovered.

What would settle it

A head-to-head run of LeetProof against the single-mode baseline on the same natural-language tasks where the percentage of fully machine-checkable certified solutions is not higher for the multi-modal pipeline.

Figures

Figures reproduced from arXiv: 2604.16584 by Dipesh Kafle, George P\^irlea, Ilya Sergey, Peter M\"uller, Qiyuan Zhao, Vitaly Kurin, Vladimir Gladshtein, Yueyang Feng.

**Figure 1.** Figure 1: A Velvet method for checking non-primality. Contributions. This work makes the following contributions: • LeetProof: the first agentic pipeline for end-to-end vericoding built around a foundational multi-modal verifier, combining testing, automated SMT-based proofs, and AI-assisted interactive proof scripting in a unified, staged synthesis pipeline (Sec. 3). • A new testing infrastructure for Lean-based v… view at source ↗

**Figure 2.** Figure 2: LeetProof pipeline: from the task description in a natural language to formally specified and verified Velvet code. at the 2024 International Mathematical Olympiad (IMO), and Aristotle [1], which reached gold-medal level at the 2025 IMO. Beyond mathematics, Lean also serves as a meta-verifier: its hygienic macro system and metaprogramming facilities [49] allow users to embed domain-specific reasoning fram… view at source ↗

**Figure 3.** Figure 3: Partial LeetProof pipeline: specification generation (top), program synthesis (middle), and invariant inference (bottom). Solid blue arrows show artefacts flowing between stages; dashed red arrows show revision loops within each stage. between 2 and its integer square root. This proof can be written manually in Lean or delegated to an AI prover such as Aristotle. 3 A Tour of LeetProof LeetProof takes a pro… view at source ↗

**Figure 5.** Figure 5: Incorrect invariant caught by PBT. a smaller fraction of obligations (though it remains effective for linear arithmetic), so we again exploit multi-modality. The invariant inference loop (bottom row of [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 4.** Figure 4: PBT checks for generated specifications and test cases. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: A testing procedure for CheckSortedAndRotated into a testing harness that interleaves execution with checks done at runtime [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Synthesis results for single-mode Lean and Velvet. Velvet yields 18 partially proven cases, while single-mode Lean produces 30. (Development: 7 vs. 13; Evaluation: 11 vs. 17.) Synthesis failure: no implementation passes the pipeline’s checks. Velvet fails on 4 problems, compared to 3 for single-mode Lean. (Development: 3 vs. 1; Evaluation: 1 vs. 2.) We attribute the higher failure rate of Velvet to the st… view at source ↗

**Figure 8.** Figure 8: LLM robustness. Left: fully proven vs. failure for single-mode [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Certified program synthesis (aka vericoding) is the process of automatically generating a program, its formal specification, and a machine-checkable proof of their alignment from a natural-language description. Two challenges make vericoding difficult. First, specifications synthesised from natural language are often either too weak to be meaningful or too strong to be implementable, yet existing approaches lack systematic means to detect such defects. Second, the landscape of program verifiers is fragmented: each tool supports a particular reasoning mode -- auto-active (e.g., Dafny, Verus) or interactive (e.g., Coq, Lean) -- with its own trade-off between automation and expressivity. This forces every synthesis methodology to be tailored to a single verification paradigm, limiting the class of tasks it can handle effectively. We overcome both challenges by structuring the certified synthesis workflow around a multi-modal verifier -- a single tool combining dynamic validation, automated proofs, and interactive proof scripting in one foundational framework. We realise this idea in LeetProof, an agentic pipeline built on Velvet, a multi-modal verifier embedded in Lean. Multi-modality enables LeetProof to validate generated specifications via randomised property-based testing before any code is synthesised, decompose the synthesis task into sub-problems guided by verification conditions, and delegate residual proof obligations to frontier AI provers specialised for Lean. We evaluate LeetProof on benchmarks derived from prior work on certified synthesis. Our specification validation uncovers defects in existing reference benchmarks, and LeetProof's staged pipeline achieves a significantly higher rate of fully certified solutions than a single-mode baseline at the same budget -- consistently across two frontier LLM backends.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is Velvet, a multi-modal verifier in Lean, plus the LeetProof pipeline that uses it to raise certified synthesis rates by mixing dynamic testing, auto proofs, and interactive steps.

read the letter

The main thing here is that they've put dynamic validation, automated proofs, and interactive scripting into one Lean-based tool called Velvet, then built LeetProof around it to do certified synthesis from natural language. This lets them run property-based testing on specs before any code appears, decompose tasks via verification conditions, and hand residual proofs to specialized AI provers for Lean.

Referee Report

2 major / 2 minor

Summary. The paper introduces LeetProof, an agentic pipeline for certified program synthesis (vericoding) built on Velvet, a multi-modal verifier embedded in Lean. Velvet integrates dynamic validation via property-based testing, automated proofs, and interactive scripting in one framework. LeetProof uses this to validate synthesized specifications before code generation, decompose tasks guided by verification conditions, and delegate residual obligations to AI provers. Evaluation on benchmarks derived from prior work reports uncovering defects in reference benchmarks and a significantly higher rate of fully certified solutions than a single-mode baseline at fixed budget, consistent across two frontier LLM backends.

Significance. If the empirical results hold with adequate quantitative support, the work is significant for certified synthesis: it directly tackles the fragmentation of verification tools by unifying modes in a foundational system (Lean), enabling earlier defect detection in specifications and more flexible handling of proof obligations. The uncovering of benchmark defects provides a concrete service to the community. The staged pipeline design and cross-backend consistency are strengths that could influence future agentic synthesis systems. Machine-checked elements in Lean add reliability to the claims.

major comments (2)

[Evaluation] Evaluation section: The central claim of significantly higher fully certified solution rates (and consistency across backends) is load-bearing but unsupported by any reported success percentages, task counts, error bars, or statistical details in the manuscript; without these, the magnitude and reliability of the improvement over the single-mode baseline cannot be assessed.
[Sections 3-4] Velvet realization (Sections 3-4): The claim that a multi-modal verifier can be effectively realized in Lean to combine dynamic validation, automated proofs, and interactive scripting without prohibitive overhead or loss of expressivity is central to the approach but lacks any architecture details, integration mechanism, or overhead measurements, undermining evaluation of the weakest assumption.

minor comments (2)

[Abstract] Abstract: Include at least one key quantitative result (e.g., success rate delta) to ground the 'significantly higher' claim for readers.
[Introduction] Introduction: Define 'vericoding' on first use and add a citation to the prior work from which benchmarks are derived.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments identify important gaps in the presentation of our empirical results and the technical realization of Velvet. We will revise the manuscript to address both points directly, as detailed below. These changes will strengthen the paper without altering its core claims or contributions.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The central claim of significantly higher fully certified solution rates (and consistency across backends) is load-bearing but unsupported by any reported success percentages, task counts, error bars, or statistical details in the manuscript; without these, the magnitude and reliability of the improvement over the single-mode baseline cannot be assessed.

Authors: We agree that the evaluation section requires explicit quantitative support for the reported improvement. The current manuscript states that LeetProof achieves a significantly higher rate of fully certified solutions than the single-mode baseline at fixed budget and that this holds consistently across two frontier LLM backends, but does not include the supporting numbers. In the revised version we will add a dedicated evaluation subsection (or expanded table) that reports: (i) the exact benchmark size and task counts, (ii) success percentages for fully certified solutions under LeetProof versus the baseline, (iii) error bars or variance across runs, and (iv) any statistical tests performed. This will allow readers to evaluate both the magnitude and reliability of the gains. revision: yes
Referee: [Sections 3-4] Velvet realization (Sections 3-4): The claim that a multi-modal verifier can be effectively realized in Lean to combine dynamic validation, automated proofs, and interactive scripting without prohibitive overhead or loss of expressivity is central to the approach but lacks any architecture details, integration mechanism, or overhead measurements, undermining evaluation of the weakest assumption.

Authors: We concur that Sections 3 and 4 currently provide insufficient detail on Velvet's implementation. The manuscript describes Velvet as a multi-modal verifier embedded in Lean that unifies randomized property-based testing, automated proof tactics, and interactive scripting, but does not supply architecture diagrams, the precise integration points between modes, or concrete overhead measurements. In revision we will expand these sections with: (1) a high-level architecture overview and Lean module structure showing how the three modes coexist without restricting expressivity, (2) the mechanisms used to switch between or combine modes within a single verification condition, and (3) empirical overhead data (runtime and memory) collected on representative tasks. These additions will substantiate the feasibility claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical agentic pipeline (LeetProof) built on a multi-modal verifier (Velvet in Lean) and reports experimental results: higher rates of fully certified solutions versus a single-mode baseline at fixed budget, plus discovery of benchmark defects. No mathematical derivation, equations, or first-principles predictions are claimed; the central result is a direct empirical comparison on externally derived benchmarks. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the outcome to the paper's own inputs by construction appear in the provided text. The evaluation is externally grounded and falsifiable via replication on the same benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The approach rests on the soundness of Lean for machine-checkable proofs and the effectiveness of randomized testing for spec validation; these are standard in the domain but the specific multi-modal integration is new and unproven outside this work.

axioms (2)

standard math Lean provides a sound foundation for interactive and automated proofs of program correctness.
Invoked for the machine-checkable proofs and delegation to AI provers in the pipeline.
domain assumption Randomized property-based testing can reliably detect defects in specifications synthesized from natural language.
Used as the first stage to validate specs before code synthesis.

invented entities (2)

Velvet no independent evidence
purpose: Multi-modal verifier embedded in Lean combining dynamic validation, automated proofs, and interactive proof scripting.
New foundational tool introduced to overcome verifier fragmentation.
LeetProof no independent evidence
purpose: Agentic pipeline for certified program synthesis built on Velvet.
New staged workflow that uses the multi-modal verifier for validation, decomposition, and proof delegation.

pith-pipeline@v0.9.0 · 5628 in / 1600 out tokens · 61458 ms · 2026-05-10T08:14:59.437883+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 53 canonical work pages · 5 internal anchors

[1]

Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern-Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, Riley Patterson, Eric Rodriguez, Laura Scharff, Vikram Shanker, Vladimir Sicca, Hari Sowrirajan, Aidan Swope, Matyas Tamas, Vlad Tenev, Jonathan Thomm, Harold Williams, and L...

work page internal anchor Pith review doi:10.48550/arxiv.2510.01346 2025
[2]

Pranjal Aggarwal, Bryan Parno, and Sean Welleck. 2025. AlphaVerus: Bootstrap- ping Formally Verified Code Generation through Self-Improving Translation and Treefinement. In ICML (Proceedings of Machine Learning Research) . PMLR / OpenReview.net. https://proceedings.mlr.press/v267/aggarwal25a.html

2025
[3]

Appel, Robert Dockins, Aquinas Hobor, Lennart Beringer, Josiah Dodds, Gordon Stewart, Sandrine Blazy, and Xavier Leroy

Andrew W. Appel, Robert Dockins, Aquinas Hobor, Lennart Beringer, Josiah Dodds, Gordon Stewart, Sandrine Blazy, and Xavier Leroy. 2014. Program Logics for Certified Compilers. Cambridge University Press

2014
[4]

Mantas Baksys, Stefan Zetzsche, Olivier Bouissou, Remi Delmas, Soonho Kong, and Sean B. Holden. 2025. ATLAS: Automated Toolkit for Large-Scale Verified Code Synthesis. CoRR abs/2512.10173 (2025). https://doi.org/10.48550/ARXIV. 2512.10173 arXiv:2512.10173

work page internal anchor Pith review doi:10.48550/arxiv 2025
[5]

Debangshu Banerjee, Olivier Bouissou, and Stefan Zetzsche. 2026. DafnyPro: LLM-Assisted Automated Verification for Dafny Programs. CoRR abs/2601.05385 (2026). https://doi.org/10.48550/ARXIV.2601.05385

work page doi:10.48550/arxiv.2601.05385 2026
[6]

Haniel Barbosa, Clark W. Barrett, Martin Brain, Gereon Kremer, Hanna Lach- nitt, Makai Mann, Abdalrhman Mohamed, Mudathir Mohamed, Aina Niemetz, Andres Nötzli, Alex Ozdemir, Mathias Preiner, Andrew Reynolds, Ying Sheng, Cesare Tinelli, and Yoni Zohar. 2022. cvc5: A Versatile and Industrial-Strength SMT Solver. In TACAS (LNCS, Vol. 13243) . Springer, 415–4...

work page doi:10.1007/978-3-030-99524-9_24 2022
[7]

Lukas Bulwahn. 2012. The New Quickcheck for Isabelle - Random, Exhaustive and Symbolic Testing under One Roof. In CPP (LNCS) . Springer, 92–108. https: //doi.org/10.1007/978-3-642-35308-6_10

work page doi:10.1007/978-3-642-35308-6_10 2012
[8]

Sergiu Bursuc, Theodore Ehrenborg, Shaowei Lin, Lacramioara Astefanoaei, Ionel Emilian Chiosa, Jure Kukovec, Alok Singh, Oliver Butterley, Adem Bizid, Quinn Dougherty, Miranda Zhao, Max Tan, and Max Tegmark. 2025. A Bench- mark for Vericoding: Formally Verified Program Synthesis.CoRR abs/2509.22908 (2025). https://doi.org/10.48550/ARXIV.2509.22908

work page doi:10.48550/arxiv.2509.22908 2025
[9]

In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fa- tima, Shuvendu K. Lahiri, and Nikhil Swamy. 2025. Towards Neural Synthe- sis for SMT-Assisted Proof-Oriented Programming. In ICSE. IEEE, 1755–1767. https://doi.org/10.1109/ICSE55347.2025.00002

work page doi:10.1109/icse55347.2025.00002 2025
[10]

Tianyu Chen, Shuai Lu, Shan Lu, Yeyun Gong, Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Hao Yu, Nan Duan, Peng Cheng, Fan Yang, Shu- vendu K Lahiri, Tao Xie, and Lidong Zhou. 2025. Automated Proof Gener- ation for Rust Code via Self-Evolution. CoRR abs/2410.15756 (2025). https: //doi.org/10.48550/ARXIV.2410.15756 ICLR 2025

work page doi:10.48550/arxiv.2410.15756 2025
[11]

Koen Claessen and John Hughes. 2000. QuickCheck: a lightweight tool for ran- dom testing of Haskell programs. In ICFP. ACM, 268–279. https://doi.org/10. 1145/351240.351266

work page arXiv 2000
[12]

Leonardo de Moura and Sebastian Ullrich. 2021. The Lean 4 Theorem Prover and Programming Language. In CADE (LNCS, Vol. 12699) . Springer, 625–635. https://doi.org/10.1007/978-3-030-79876-5_37

work page doi:10.1007/978-3-030-79876-5_37 2021
[13]

Leonardo Mendonça de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In TACAS (LNCS, Vol. 4963). Springer, 337–340. https://doi.org/10.1007/ 978-3-540-78800-3_24

2008
[14]

Leonardo Mendonça de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. 2015. The Lean Theorem Prover (System Description). In CADE (LNCS, Vol. 9195) . Springer, 378–388. https://doi.org/10.1007/978-3- 319-21401-6_26

work page doi:10.1007/978-3- 2015
[15]

Benjamin Delaware, Clément Pit-Claudel, Jason Gross, and Adam Chlipala. 2015. Fiat: Deductive Synthesis of Abstract Data Types in a Proof Assistant. In POPL. ACM, 689–700. https://doi.org/10.1145/2676726.2677006

work page doi:10.1145/2676726.2677006 2015
[16]

Yueyang Feng, Dipesh Kafle, Vladimir Gladshtein, Vitaly Kurin, George Pîrlea, Qiyuan Zhao, Peter Müller, and Ilya Sergey. 2026. LeetProof: Artefact. Zenodo. https://doi.org/10.5281/zenodo.19624966 Includes implementation, benchmarks, and evaluation harness

work page doi:10.5281/zenodo.19624966 2026
[17]

Rustan M

Cormac Flanagan and K. Rustan M. Leino. 2001. Houdini, an Annotation Assistant for ESC/Java. In FME (LNCS, Vol. 2021) . Springer, 500–517. https: //doi.org/10.1007/3-540-45251-6_29

work page doi:10.1007/3-540-45251-6_29 2001
[19]

Foundational Multi-Modal Program Verifiers. Proc. ACM Program. Lang. 10, POPL (2026), 1–28. https://doi.org/10.1145/3776719

work page doi:10.1145/3776719 2026
[20]

Vladimir Gladshtein, George Pîrlea, Qiyuan Zhao, Vitaly Kurin, and Ilya Sergey
[21]

In Dafny Work- shop

Velvet: A Multi-Modal Verifier for Effectful Programs. In Dafny Work- shop. https://verse-lab.org/papers/velvet-dafny26.pdf Code available at https: //github.com/verse-lab/velvet
[22]

Cutler, Daniel Dickstein, Benjamin C

Harrison Goldstein, Joseph W. Cutler, Daniel Dickstein, Benjamin C. Pierce, and Andrew Head. 2024. Property-Based Testing in Practice. In ICSE. ACM, 187:1– 187:13. https://doi.org/10.1145/3597503.3639581

work page doi:10.1145/3597503.3639581 2024
[23]

Harrison Goldstein, Hila Peleg, Cassia Torczon, Daniel Sainati, Leonidas Lam- propoulos, and Benjamin C. Pierce. 2025. The Search for Constrained Random Generators. CoRR abs/2511.12253 (2025). https://doi.org/10.48550/ARXIV.2511. 12253 arXiv:2511.12253

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511 2025
[24]

C. A. R. Hoare. 1969. An Axiomatic Basis for Computer Programming. Commun. ACM 12, 10 (1969), 576–580. https://doi.org/10.1145/363235.363259

work page doi:10.1145/363235.363259 1969
[25]

Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z. Horváth, Goran Žužić, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Ma- soom, Ottavia Bertolli, Tom Zahavy, Amol Mandhane, Jessica Yung, Iuliya Be- loshapka, Borja Ibarz, Vivek Veeriah, Lei Yu, Oliver Nash, Paul Lezeau, Salva- tore Mercuri, Calle Sönne, Bhavik Mehta, Alex D...

work page doi:10.1038/s41586-025- 2026
[26]

John Hughes. 2011. Specification based testing with QuickCheck: tutorial talk. In FMCAD. FMCAD Inc., 17. http://dl.acm.org/citation.cfm?id=2157659

2011
[27]

Saketh Ram Kasibatla, Arpan Agarwal, Yuriy Brun, Sorin Lerner, Talia Ringer, and Emily First. 2024. Cobblestone: A Divide-and-Conquer Approach for Au- tomating Formal Verification. CoRR abs/2410.19940 (2024). arXiv: 2410.19940 http://arxiv.org/abs/2410.19940

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2015. Frama-C: A software analysis perspective. Formal Aspects 11 Yueyang Feng, Dipesh Kafle, Vladimir Gladshtein, Vitaly Kurin, George Pîrlea, Qiyuan Zhao, Peter Müller, and Ilya Sergey Comput. 27, 3, 573–609. https://doi.org/10.1007/S00165-014-0326-7

work page doi:10.1007/s00165-014-0326-7 2015
[29]

Shuvendu K. Lahiri. 2024. Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages. In FMCAD. IEEE, 142–147. https://doi.org/ 10.34727/2024/isbn.978-3-85448-065-5_19

work page doi:10.34727/2024/isbn.978-3-85448-065-5_19 2024
[30]

Lyu, and Yang- fan Zhou

Andrea Lattuada, Travis Hance, Jay Bosamiya, Matthias Brun, Chanhee Cho, Hayley LeBlanc, Pranav Srinivasan, Reto Achermann, Tej Chajed, Chris Haw- blitzel, Jon Howell, Jacob R. Lorch, Oded Padon, and Bryan Parno. 2024. Verus: A Practical Foundation for Systems Verification. In SOSP. ACM, 438–454. https: //doi.org/10.1145/3694715.3695952

work page doi:10.1145/3694715.3695952 2024
[31]

leanprover-community. 2025. Plausible: A property testing framework for Lean

2025
[32]

https://github.com/leanprover-community/plausible
[33]

leanprover-community. 2026. Lean Language Reference: The grind tactic. https: //lean-lang.org/doc/reference/latest/The--grind--tactic/

2026
[34]

leanprover-community. 2026. Lean Language Reference: Type Classes. https: //lean-lang.org/doc/reference/latest/Type-Classes/

2026
[35]

Rustan M

K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. In LPAR (LNCS, Vol. 6355) . Springer, 348–370. https://doi.org/10. 1007/978-3-642-17511-4_20

2010
[36]

Sheng Liang, Paul Hudak, and Mark P. Jones. 1995. Monad Transformers and Modular Interpreters. In POPL. ACM Press, 333–343. https://doi.org/10.1145/ 199448.199528

work page arXiv 1995
[37]

Jannis Limperg and Asta Halkjær From. 2023. Aesop: White-Box Best-First Proof Search for Lean. In CPP. ACM, 253–266. https://doi.org/10.1145/3573105. 3575671

work page doi:10.1145/3573105 2023
[38]

Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark. 2025. DafnyBench: A Benchmark for Formal Software Verifica- tion. Trans. Mach. Learn. Res. 2025 (2025). https://openreview.net/forum?id= yBgTVWccIx

2025
[39]

Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof Automation with Large Language Models. (2024), 1509–1520. https://doi.org/10.1145/ 3691620.3695521

work page arXiv 2024
[40]

Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. 2025. SpecGen: Auto- mated Generation of Formal Program Specifications via Large Language Models. In ICSE. IEEE, 16–28. https://doi.org/10.1109/ICSE55347.2025.00129

work page doi:10.1109/icse55347.2025.00129 2025
[41]

The mathlib Community. 2020. The Lean mathematical library. In CPP. ACM, 367–381. https://doi.org/10.1145/3372885.3373824 https://github.com/ leanprover-community/mathlib4

work page doi:10.1145/3372885.3373824 2020
[42]

McMillan and Oded Padon

Kenneth L. McMillan and Oded Padon. 2020. Ivy: A Multi-modal Verification Tool for Distributed Algorithms. In CA V (LNCS, Vol. 12225). Springer, 190–202. https://doi.org/10.1007/978-3-030-53291-8_12

work page doi:10.1007/978-3-030-53291-8_12 2020
[43]

Brando Miranda, Zhanke Zhou, Allen Nie, Elyas Obbad, Leni Aniva, Kai Frons- dal, Weston Kirk, Dilara Soylu, Andrea Yu, Ying Li, and Sanmi Koyejo. 2025. VeriBench: End-to-End Formal Verification Benchmark for AI Code Generation in Lean 4. (2025). https://openreview.net/forum?id=rWkGFmnSNl

2025
[44]

Lopes, Iris Ma, and James Noble

Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, and James Noble. 2024. To- wards AI-Assisted Synthesis of Verified Dafny Methods. Proc. ACM Softw. Eng. 1, FSE (2024), 812–835. https://doi.org/10.1145/3643763

work page doi:10.1145/3643763 2024
[45]

Abdalrhman Mohamed, Tomaz Mascarenhas, Muhammad Harun Ali Khan, Haniel Barbosa, Andrew Reynolds, Yicheng Qian, Cesare Tinelli, and Clark W. Barrett. 2025. lean-smt: An SMT Tactic for Discharging Proof Goals in Lean. In CA V (LNCS, Vol. 15933). Springer, 197–212. https://doi.org/10.1007/978-3-031- 98682-6_11

work page doi:10.1007/978-3-031- 2025
[46]

Eric Mugnier, Emmanuel Anaya Gonzalez, Ranjit Jhala, Nadia Polikarpova, and Yuanyuan Zhou. 2025. Laurel: Unblocking Automated Verification with Large Language Models. Proc. ACM Program. Lang. 9, OOPSLA1 (2025). https://doi. org/10.1145/3720499

work page doi:10.1145/3720499 2025
[47]

Prasita Mukherjee, Minghai Lu, and Benjamin Delaware. 2025. LLM-Assisted Synthesis of High-Assurance C Programs. In ASE. IEEE, 3108. https://doi.org/ 10.1109/ASE63991.2025.00255

work page doi:10.1109/ase63991.2025.00255 2025
[48]

Peter Müller, Malte Schwerhoff, and Alexander J. Summers. 2016. Viper: A Verification Infrastructure for Permission-Based Reasoning. In VMCAI (LNCS, Vol. 9583). Springer, 41–62. https://doi.org/10.1007/978-3-662-49122-5_2

work page doi:10.1007/978-3-662-49122-5_2 2016
[49]

O’Hearn, John C

Peter W. O’Hearn, John C. Reynolds, and Hongseok Yang. 2001. Local Reasoning about Programs that Alter Data Structures. In CSL (LNCS, Vol. 2142) . Springer, 1–19. https://doi.org/10.1007/3-540-44802-0_1

work page doi:10.1007/3-540-44802-0_1 2001
[50]

McMillan, Aurojit Panda, Mooly Sagiv, and Sharon Shoham

Oded Padon, Kenneth L. McMillan, Aurojit Panda, Mooly Sagiv, and Sharon Shoham. 2016. Ivy: safety verification by interactive generalization. In PLDI. ACM, 614–630. https://doi.org/10.1145/2908080.2908118

work page doi:10.1145/2908080.2908118 2016
[51]

Zoe Paraskevopoulou, Cătălin Hriţcu, Maxime Dénès, Leonidas Lampropoulos, and Benjamin C. Pierce. 2015. Foundational Property-Based Testing. In ITP (LNCS, Vol. 9236). Springer, 325–343. https://doi.org/10.1007/978-3-319-22102- 1_22

work page doi:10.1007/978-3-319-22102- 2015
[52]

Arthur Paulino, Damiano Testa, Edward Ayers, Evgenia Karunus, Henrik Bövinga, Jannis Limperg, Siddhartha Gadgil, and Siddharth Bhat. 2024. Metapro- gramming in Lean 4. Available at https://leanprover-community.github.io/ lean4-metaprogramming-book/

2024
[53]

George Pîrlea, Vladimir Gladshtein, Elad Kinsbruner, Qiyuan Zhao, and Ilya Sergey. 2025. Veil: A Framework for Automated and Interactive Verification of Transition Systems. In CA V (LNCS, Vol. 15933) . Springer, 26–41. https: //doi.org/10.1007/978-3-031-98682-6_2

work page doi:10.1007/978-3-031-98682-6_2 2025
[54]

Clément Pit-Claudel, Jade Philipoom, Dustin Jamner, Andres Erbsen, and Adam Chlipala. 2022. Relational Compilation for Performance-Critical Applications. In PLDI. ACM, 918–932. https://doi.org/10.1145/3519939.3523706

work page doi:10.1145/3519939.3523706 2022
[55]

Nadia Polikarpova and Ilya Sergey. 2019. Structuring the Synthesis of Heap- Manipulating Programs. PACMPL 3, POPL (2019), 72:1–72:30. https://doi.org/ 10.1145/3290385

work page doi:10.1145/3290385 2019
[56]

Barrett, and Jeremy Avigad

Yicheng Qian, Joshua Clune, Clark W. Barrett, and Jeremy Avigad. 2025. Lean- Auto: An Interface Between Lean 4 and Automated Theorem Provers. In CA V (LNCS, Vol. 15933). Springer, 175–196. https://doi.org/10.1007/978-3-031-98682- 6_10

work page doi:10.1007/978-3-031-98682- 2025
[57]

Reynolds

John C. Reynolds. 2002. Separation Logic: A Logic for Shared Mutable Data Structures. In LICS. IEEE Computer Society, 55–74. https://doi.org/10.1109/LICS. 2002.1029817

work page doi:10.1109/lics 2002
[58]

Rocq Development Team. 2025. The Rocq Prover. https://rocq-prover.org. Ver- sion 9.0.0, released March 12, 2025

2025
[59]

Grigore Rosu. 2017. K: A Semantic Framework for Programming Languages and Formal Analysis Tools. In Dependable Software Systems Engineering. NATO Science for Peace and Security Series - D: Information and Communication Secu- rity, Vol. 50. IOS Press, 186–206. https://doi.org/10.3233/978-1-61499-810-5-186

work page doi:10.3233/978-1-61499-810-5-186 2017
[60]

Peiyang Song, Kaiyu Yang, and Anima Anandkumar. 2025. Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean. (2025), 144–169. https://proceedings.mlr.press/v288/song25a.html

2025
[61]

Chuyue Sun, Ying Sheng, Oded Padon, and Clark Barrett. 2024. Clover: Closed- Loop Verifiable Code Generation. CoRR abs/2310.17807 (2024). https://doi.org/ 10.48550/ARXIV.2310.17807 SAIV 2024

work page doi:10.48550/arxiv.2310.17807 2024
[62]

Nikhil Swamy, Cătălin Hriţcu, Chantal Keller, Aseem Rastogi, Antoine Delignat- Lavaud, Simon Forest, Karthikeyan Bhargavan, Cédric Fournet, Pierre-Yves Strub, Markulf Kohlweiss, Jean-Karim Zinzindohoue, and Santiago Zanella- Béguelin. 2016. Dependent types and multi-monadic effects in F*. InPOPL. ACM, 256–270. https://doi.org/10.1145/2837614.2837655

work page doi:10.1145/2837614.2837655 2016
[63]

Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri. 2025. CLEVER: A Curated Benchmark for Formally Verified Code Generation. CoRR abs/2505.13938 (2025). https://doi.org/10.48550/ARXIV.2505.13938 NeurIPS 2025 Datasets and Benchmarks

work page doi:10.48550/arxiv.2505.13938 2025
[64]

Ferreira, Sorin Lerner, and Emily First

Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez- Stern, Yuriy Brun, João F. Ferreira, Sorin Lerner, and Emily First. 2024. Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification. arXiv preprint arXiv:2412.14063 (2024). https://arxiv.org/abs/2412.14063

work page arXiv 2024
[65]

Haoxin Tu, Huan Zhao, Yahui Song, Mehtab Zafar, Ruijie Meng, and Abhik Roy- choudhury. 2025. Agentic Program Verification. CoRR abs/2511.17330 (2025). https://doi.org/10.48550/ARXIV.2511.17330 arXiv:2511.17330

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.17330 2025
[66]

Sumanth Varambally, Thomas Voice, Yanchao Sun, Zhifeng Chen, Rose Yu, and Ke Ye. 2025. Hilbert: Recursively Building Formal Proofs with Informal Reason- ing. CoRR abs/2509.22819 (2025). https://doi.org/10.48550/ARXIV.2509.22819 arXiv:2509.22819

work page doi:10.48550/arxiv.2509.22819 2025
[67]

Yasunari Watanabe, Kiran Gopinathan, George Pîrlea, Nadia Polikarpova, and Ilya Sergey. 2021. Certifying the Synthesis of Heap-Manipulating Programs. Proc. ACM Program. Lang. 5, ICFP (2021), 1–29. https://doi.org/10.1145/3473589

work page doi:10.1145/3473589 2021
[68]

Lorch, Shuai Lu, Fan Yang, Ziqiao Zhou, and Shan Lu

Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Jianan Yao, Weidong Cui, Yeyun Gong, Chris Hawblitzel, Shuvendu Lahiri, Jacob R. Lorch, Shuai Lu, Fan Yang, Ziqiao Zhou, and Shan Lu. 2025. AutoVerus: Automated Proof Generation for Rust Code. Proc. ACM Program. Lang. 9, OOPSLA2 (2025). https://doi.org/ 10.1145/3763174

work page doi:10.1145/3763174 2025
[69]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. In NeurIPS. https://openreview. net/forum?id=mXpq6ut8J3

2024
[70]

Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, and Dawn Song. 2025. VERINA: Benchmarking Verifiable Code Generation. CoRR abs/2505.23135 (2025). https://doi.org/10.48550/ARXIV.2505.23135 ICLR 2026

work page doi:10.48550/arxiv.2505.23135 2025
[71]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. In ISSTA. ACM, 1592–1604. https://doi.org/10.1145/3650212.3680384 12

work page doi:10.1145/3650212.3680384 2024