pith. sign in

arxiv: 2605.30914 · v1 · pith:54WC7TJ5new · submitted 2026-05-29 · 💻 cs.LG · cs.SE

Automating Formal Verification with Reinforcement Learning and Recursive Inference

Pith reviewed 2026-06-28 23:50 UTC · model grok-4.3

classification 💻 cs.LG cs.SE
keywords reinforcement learningformal verificationlarge language modelsDafnyLeanproof generationverifier feedbackverified code
0
0 comments X

The pith

Reinforcement learning from verifier rewards and recursive inference raise verified pass rates on Dafny and Lean tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how environments that check program correctness with compilers and verifiers can train language models to generate formally verified code and proofs. It applies reinforcement learning from verifiable rewards in Dafny by assembling candidates into programs and scoring them on verification outcomes, first on an initial dataset then after filtering to limit specification hacking. In Lean it builds an inference scaffold that decomposes tasks into subgoals and applies a proof reviser using verifier diagnostics. These steps produce measured gains in verified pass rates and solve some previously unsolved tasks. A reader would care because formal verification promises machine-checked correctness but remains difficult for current models due to scarce data and strict specification requirements.

Core claim

The paper claims that multi-turn RLVR with GRPO on a refined Dafny benchmark lifts verified pass rate from 9.7% to 31.1%, while a fixed-base-model verifier-guided scaffold with whole-task decomposition and proof reviser raises Lean pass rate from 46.2% under direct repair to 69.2% and solves 7 of 42 previously unsolved tasks on the VERINA dataset.

What carries the argument

Verifier-guided reinforcement learning from compiler and verifier outcomes together with a recursive inference scaffold that decomposes proofs and applies a reviser on feedback.

If this is right

  • Open-source models reach higher verified reward and pass rates when trained with multi-turn RLVR after dataset refinement.
  • Whole-task decomposition plus proof reviser enables solving tasks that direct generation or repair leave unsolved.
  • Repository-scale Lean benchmarks remain difficult and require stronger progress evaluation and tool-use policies.
  • Specification hacking can be reduced by explicit filtering of weak tasks before RLVR training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward-signal loop could be applied to other proof assistants if they expose comparable compiler and verifier feedback.
  • New benchmarks that avoid post-hoc filtering would test whether the reported gains generalize beyond the refined task distribution.
  • Combining the RLVR training loop with the inference scaffold in one system might compound the separate gains observed here.

Load-bearing premise

Filtering underspecified and vulnerable tasks from the dataset still produces a benchmark that measures intended verification capability rather than selection effects.

What would settle it

Re-evaluating the same models on the original unfiltered APPS-derived Dafny dataset or on a fresh set of verification tasks drawn without the same filtering criteria would show whether the pass-rate gains hold or shrink.

Figures

Figures reproduced from arXiv: 2605.30914 by Max Tan.

Figure 3.1
Figure 3.1. Figure 3.1: Concrete worksheet anatomy for an example Dalek Bench [PITH_FULL_IMAGE:figures/full_fig_p045_3_1.png] view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: Distribution of helper goals per task in the helper-scaffolded Dalek Bench. The [PITH_FULL_IMAGE:figures/full_fig_p046_3_2.png] view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: Single-turn Dafny RLVR training and validation pipeline. Training uses grouped rollouts, sampling K = 8 completions per prompt before extracting Dafny code and evaluating each completion with build / verify. Verifier outcomes are mapped to a staged reward and used for the DAPO update. Validation reuses the same extraction and verifier-based scoring pipeline with K = 1, reporting staged reward on the held… view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: Reward trajectory for the initial single-turn RLVR run. The training curve [PITH_FULL_IMAGE:figures/full_fig_p059_4_2.png] view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: Logged validation diagnostics for the initial single-turn RLVR run. [PITH_FULL_IMAGE:figures/full_fig_p061_4_3.png] view at source ↗
Figure 4.4
Figure 4.4. Figure 4.4: Reward trajectories for the initial APPS-derived run and the quality-filtered hard [PITH_FULL_IMAGE:figures/full_fig_p066_4_4.png] view at source ↗
Figure 4.5
Figure 4.5. Figure 4.5: Multi-turn verifier-feedback environment for Dafny RLVR. The policy samples a JSON replacement at attempt t, after which the environment reconstructs the Dafny file and evaluates it with dafny build / dafny verify. If verification succeeds, the trajectory terminates with discounted reward γ t−1 ; otherwise, verifier diagnostics define a local repair state for attempt t + 1, until either success or the K … view at source ↗
Figure 4.6
Figure 4.6. Figure 4.6: Training diagnostics for the multi-turn Dafny RLVR run on the quality-filtered [PITH_FULL_IMAGE:figures/full_fig_p072_4_6.png] view at source ↗
Figure 4.7
Figure 4.7. Figure 4.7: Average number of turns per episode during the multi-turn Dafny RLVR run. [PITH_FULL_IMAGE:figures/full_fig_p073_4_7.png] view at source ↗
Figure 4.8
Figure 4.8. Figure 4.8: Prompt and response token lengths during the multi-turn Dafny RLVR run. [PITH_FULL_IMAGE:figures/full_fig_p074_4_8.png] view at source ↗
Figure 5.1
Figure 5.1. Figure 5.1: Overview of the verifier-in-the-loop scaffold for Lean verified coding. The task [PITH_FULL_IMAGE:figures/full_fig_p091_5_1.png] view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: Recursive proof decomposition for Lean verified coding. Given a root theorem or lemma obligation, the scaffold first attempts a direct proof. If the direct attempt succeeds, the resulting proof can be accepted as a verified proof. If it fails, the system constructs a proof sketch containing named intermediate claims, which are promoted into standalone subgoals. Each subgoal is then solved independently: … view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: Progress-evaluation control flow. Deterministic gates first enforce hard constraints [PITH_FULL_IMAGE:figures/full_fig_p097_5_3.png] view at source ↗
Figure 5.4
Figure 5.4. Figure 5.4: Budget-efficiency curves on the 26-task Vericoding pilot set. The left panel plots [PITH_FULL_IMAGE:figures/full_fig_p109_5_4.png] view at source ↗
Figure 5.5
Figure 5.5. Figure 5.5: Captured interactive tool-call usage by benchmark family and tool group. The top [PITH_FULL_IMAGE:figures/full_fig_p110_5_5.png] view at source ↗
Figure 5.6
Figure 5.6. Figure 5.6: Failure-mode diagnostics for the Vericoding pilot and VERINA runs. Bars group [PITH_FULL_IMAGE:figures/full_fig_p111_5_6.png] view at source ↗
read the original abstract

Automated formal verification remains challenging for large language models because data for proof assistants and verification-aware languages is scarce, and correctness depends on satisfying precise machine-checkable specifications rather than producing plausible code. This thesis studies how verifier environments can improve LLM generation of verified programs and proofs through reinforcement learning from verifiable rewards (RLVR) and verifier-guided inference-time search. First, we train open-source models in Dafny with RLVR using Group Relative Policy Optimization (GRPO) and related variants, assembling generated candidates into complete programs and scoring them with compiler and verifier outcomes. Initial experiments on an APPS-derived Dafny dataset increased verified reward from 2.2% to 58.1%, but revealed specification hacking, where models exploit weak formal specifications instead of implementing the intended solutions. After filtering underspecified and vulnerable tasks, multi-turn RLVR on the refined benchmark improves the verified pass rate from 9.7% to 31.1%. Second, we develop a verifier-guided inference scaffold in Lean that treats proof generation as structured search over decomposed subgoals, verifier feedback, diagnostics, and repair. With a fixed base model, the full scaffold with proof reviser improves pass rate on an initial VeriCoding pilot set from 46.2% under direct repair to 69.2%. On the larger VERINA dataset, whole-task decomposition plus proof reviser solves 7 of 42 previously unsolved tasks. We also introduce Dalek-Bench, a repository-scale Lean benchmark derived from the Rust $\texttt{curve25519-dalek}$ verification project; preliminary results remain weak, indicating that stronger progress evaluation and task-specific tool-use policies are still needed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript investigates using reinforcement learning from verifiable rewards (RLVR) with Group Relative Policy Optimization (GRPO) to train LLMs on Dafny program generation from an APPS-derived dataset, initially raising verified reward from 2.2% to 58.1% before addressing specification hacking via task filtering; on the refined benchmark, multi-turn RLVR lifts verified pass rate from 9.7% to 31.1%. It further develops a verifier-guided inference scaffold for Lean that decomposes proofs, incorporates feedback and repair, raising pass rate from 46.2% to 69.2% on a VeriCoding pilot with a fixed base model and solving 7 of 42 previously unsolved tasks on VERINA, while introducing the Dalek-Bench repository-scale Lean benchmark.

Significance. If the reported gains prove robust, the work would supply concrete empirical evidence that RLVR and structured verifier-guided search can measurably advance LLM performance on machine-checkable formal verification tasks despite scarce training data. The specific percentage improvements and new benchmark provide falsifiable progress markers in an applied ML-for-formal-methods setting.

major comments (1)
  1. [Abstract (Dafny RLVR paragraph)] Abstract (Dafny RLVR paragraph): the central claim that multi-turn RLVR improves verified pass rate from 9.7% to 31.1% on the refined benchmark is load-bearing for the paper's empirical contribution, yet the filtering of underspecified and vulnerable tasks lacks any quantitative comparison of difficulty metrics, pass-rate statistics, or task distributions between the original and retained sets; without this, the observed delta cannot be confidently attributed to RLVR rather than curation effects.
minor comments (2)
  1. [Abstract] Abstract: the manuscript reports concrete numerical improvements without accompanying error bars, full experimental protocols, or statements on data/code availability, which would be required for reproducibility assessment even if the filtering concern is resolved.
  2. [Abstract] Abstract: the Lean scaffold description would benefit from explicit definition of 'whole-task decomposition' and 'proof reviser' components before reporting the 69.2% and 7/42 figures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive comment. We address the major concern below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract (Dafny RLVR paragraph)] Abstract (Dafny RLVR paragraph): the central claim that multi-turn RLVR improves verified pass rate from 9.7% to 31.1% on the refined benchmark is load-bearing for the paper's empirical contribution, yet the filtering of underspecified and vulnerable tasks lacks any quantitative comparison of difficulty metrics, pass-rate statistics, or task distributions between the original and retained sets; without this, the observed delta cannot be confidently attributed to RLVR rather than curation effects.

    Authors: We agree that additional quantitative details on the filtering step would help readers isolate the contribution of RLVR. The reported 9.7% to 31.1% delta is measured on the identical refined benchmark (i.e., the baseline and post-RLVR results use the same task set), so the improvement cannot be explained by differences in the task distribution before versus after training. Nevertheless, to address the concern about curation effects on the overall empirical narrative, the revised manuscript will add: (1) the number and percentage of tasks filtered, (2) comparative statistics on proxies for difficulty such as specification length and assertion count between original and retained sets, and (3) base-model pass-rate distributions on both sets. These additions will clarify that filtering targeted specification-hacking vulnerabilities without systematically reducing task difficulty. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training and benchmark results on external verifiers

full rationale

The paper reports measured improvements in verified pass rates from RLVR training (GRPO) and inference scaffolds on APPS-derived Dafny and VERINA/Lean benchmarks. All claims are empirical outcomes scored by external compiler/verifier feedback; no equations, derivations, or 'predictions' are presented that reduce by construction to fitted parameters, self-defined quantities, or self-citation chains. Filtering of tasks is a preprocessing choice whose effect is evaluated on the retained set, not a definitional loop. No load-bearing uniqueness theorems or ansatzes are invoked.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the domain assumption that verifier oracles supply a usable training signal and that the chosen benchmarks after filtering remain representative of verification difficulty.

axioms (2)
  • domain assumption Verifier feedback provides a reliable and non-hackable reward signal for policy optimization
    Underpins the RLVR training loop described in the first contribution.
  • domain assumption Decomposed subgoals plus diagnostic repair improve proof search over direct generation
    Underpins the Lean scaffold results.

pith-pipeline@v0.9.1-grok · 5820 in / 1375 out tokens · 26132 ms · 2026-06-28T23:50:31.080446+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

128 extracted references · 91 canonical work pages · 26 internal anchors

  1. [1]

    en.url: https://www.anthropic

    Detecting and countering misuse of AI: August 2025. en.url: https://www.anthropic. com/news/detecting-countering-misuse-aug-2025?utm_source=chatgpt.com (visited on 04/30/2026)

  2. [2]

    SentinelOne.How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally. en. Mar. 2026.url: https://www.sentinelone.com/blog/how-sentinelones-ai-edr-autonomously- discovered-and-stopped-anthropics-claude-from-executing-a-zero-day-supply-chain- attack-globally/ (visited on 05/05/2026)

  3. [3]

    en.url: https://codewall.ai/blog/how-we- hacked-mckinseys-ai-platform (visited on 05/05/2026)

    How We Hacked McKinsey’s AI Platform. en.url: https://codewall.ai/blog/how-we- hacked-mckinseys-ai-platform (visited on 05/05/2026)

  4. [4]

    en.url: https: //codewall.ai/blog/ai-vs-ai-how-our-ai-agent-hacked-a-20m-funded-ai-recruiter (visited on 05/05/2026)

    AI vs AI: How Our AI Agent Hacked a $20M-Funded AI Recruiter. en.url: https: //codewall.ai/blog/ai-vs-ai-how-our-ai-agent-hacked-a-20m-funded-ai-recruiter (visited on 05/05/2026)

  5. [5]

    How We Hacked Bain’s Competitive Intelligence Platform.en.url:https://codewall.ai/ blog/how-we-hacked-bains-competitive-intelligence-platform (visited on 05/05/2026)

  6. [6]

    en.url: https://www.ibm.com/reports/data-breach (visited on 05/05/2026)

    Cost of a data breach 2025 | IBM. en.url: https://www.ibm.com/reports/data-breach (visited on 05/05/2026)

  7. [7]

    O’Flaherty.CrowdStrike Reveals What Happened, Why—And What’s Changed

    K. O’Flaherty.CrowdStrike Reveals What Happened, Why—And What’s Changed. en. Section: Cybersecurity.url: https://www.forbes.com/sites/kateoflahertyuk/2024/ 125 08/07/crowdstrike-reveals-what-happened-why-and-whats-changed/ (visited on 12/18/2025)

  8. [8]

    Janardhan.More details about the October 4 outage

    S. Janardhan.More details about the October 4 outage. en-US. Oct. 2021.url: https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ (visited on 05/05/2026)

  9. [9]

    Amazon’s cloud unit hit by outage involving AI tools in December

    “Amazon’s cloud unit hit by outage involving AI tools in December.” en.Reuters, Feb. 2026.url: https://www.reuters.com/business/retail-consumer/amazons-cloud- unit-hit-by-least-two-outages-involving-ai-tools-ft-says-2026-02-20/ (visited on 05/05/2026)

  10. [10]

    An axiomatic basis for computer programming

    C. A. R. Hoare. “An axiomatic basis for computer programming.”Commun. ACM, 12(10), Oct. 1969, pp. 576–580.issn: 0001-0782.doi: 10.1145/363235.363259.url: https://dl.acm.org/doi/10.1145/363235.363259 (visited on 05/05/2026)

  11. [11]

    en.url: https://rocq-prover.org/about (visited on 05/20/2026)

    About The Rocq Prover. en.url: https://rocq-prover.org/about (visited on 05/20/2026)

  12. [12]

    The Lean 4 Theorem Prover and Programming Language (System Description)

    L. de Moura and S. Ullrich. “The Lean 4 Theorem Prover and Programming Language (System Description).” en

  13. [13]

    Dafny: Statically Verifying Functional Correctness

    R. Gauci.Dafny: Statically Verifying Functional Correctness. arXiv:1412.4395 [cs]. Dec. 2014.doi: 10.48550/arXiv.1412.4395.url: http://arxiv.org/abs/1412.4395 (visited on 12/18/2025)

  14. [14]

    Dependent types and multi-monadic effects in F*

    N. Swamy et al. “Dependent types and multi-monadic effects in F*.” en. In:Pro- ceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. St. Petersburg FL USA: ACM, Jan. 2016, pp. 256–270.isbn: 978-1-4503-3549-2.doi: 10.1145/2837614.2837655.url: https://dl.acm.org/doi/10. 1145/2837614.2837655 (visited on 05/05/2026). 126

  15. [15]

    Lattuada, T

    A. Lattuada, T. Hance, C. Cho, M. Brun, I. Subasinghe, Y. Zhou, J. Howell, B. Parno, and C. Hawblitzel.Verus: Verifying Rust Programs using Linear Ghost Types (extended version). arXiv:2303.05491 [cs]. Mar. 2023.doi: 10.48550/arXiv.2303.05491.url: http://arxiv.org/abs/2303.05491 (visited on 12/18/2025)

  16. [16]

    J. W. Cutler et al.Cedar: A New Language for Expressive, Fast, Safe, and Analyzable Authorization (Extended Version). arXiv:2403.04651 [cs]. Mar. 2024.doi: 10.48550/ arXiv.2403.04651.url: http://arxiv.org/abs/2403.04651 (visited on 05/05/2026)

  17. [17]

    Anvil: Verifying Liveness of Cluster Management Controllers

    X. Sun et al. “Anvil: Verifying Liveness of Cluster Management Controllers.” en. In: 2024, pp. 649–666.isbn: 978-1-939133-40-3.url: https://www.usenix.org/conference/ osdi24/presentation/sun-xudong (visited on 05/05/2026)

  18. [18]

    VeriSMo: A Verified Security Module for Confidential VMs

    Z. Zhou, Anjali, W. Chen, S. Gong, C. Hawblitzel, and W. Cui. “VeriSMo: A Verified Security Module for Confidential VMs.” en. In: 2024, pp. 599–614.isbn: 978-1-939133- 40-3.url: https://www.usenix.org/conference/osdi24/presentation/zhou (visited on 05/05/2026)

  19. [19]

    The lean mathematical library

    T. mathlib Community. “The lean mathematical library.” In:Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. POPL ’20. ACM, Jan. 2020, pp. 367–381.doi: 10.1145/3372885.3373824.url: http://dx.doi.org/10.1145/3372885.3373824

  20. [20]

    Lean formalization of the main theorem of liquid vector spaces, completed July 2022 in response to Peter Scholze’s challenge

    Lean Prover Community.Liquid Tensor Experiment. Lean formalization of the main theorem of liquid vector spaces, completed July 2022 in response to Peter Scholze’s challenge. GitHub, 2022.url: https://github.com/leanprover-community/lean-liquid (visited on 05/04/2026)

  21. [21]

    Tao and PFR formalization contributors.The Polynomial Freiman–Ruzsa Conjec- ture: A digitisation of the proof in Lean 4

    T. Tao and PFR formalization contributors.The Polynomial Freiman–Ruzsa Conjec- ture: A digitisation of the proof in Lean 4. 2023.url: https://teorth.github.io/pfr/ (visited on 05/04/2026). 127

  22. [22]

    Analysis I

    A Lean companion to “Analysis I” | What’s new.url: https://terrytao.wordpress. com/2025/05/31/a-lean-companion-to-analysis-i/ (visited on 05/05/2026)

  23. [23]

    xenaproject.Formalization of Erdős problems. en. Dec. 2025.url: https://xenaproject. wordpress.com/2025/12/05/formalization-of-erdos-problems/ (visited on 05/05/2026)

  24. [24]

    xenaproject.Beyond the Liquid Tensor Experiment. en. Sept. 2022.url: https: //xenaproject.wordpress.com/2022/09/12/beyond-the-liquid-tensor-experiment/ (visited on 05/05/2026)

  25. [25]

    seL4: formal verification of an OS kernel

    G. Klein et al. “seL4: formal verification of an OS kernel.” In:Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. SOSP09. ACM, Oct. 2009, pp. 207–220.doi: 10.1145/1629575.1629596.url: http://dx.doi.org/10.1145/1629575. 1629596

  26. [26]

    Anthropic, 2025.url: https://www.anthropic.com/product/claude-code (visited on 05/04/2026)

    Anthropic.Claude Code: Anthropic’s agentic coding system. Anthropic, 2025.url: https://www.anthropic.com/product/claude-code (visited on 05/04/2026)

  27. [27]

    Introducing OpenAI o3 and o4-mini | OpenAI.url: https://openai.com/index/ introducing-o3-and-o4-mini/ (visited on 05/05/2026)

  28. [28]

    en.url: https://www

    Project Glasswing: Securing critical software for the AI era. en.url: https://www. anthropic.com/glasswing (visited on 05/05/2026)

  29. [29]

    Our evaluation of Claude Mythos Preview’s cyber capabilities | AISI Work.url: https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber- capabilities (visited on 05/05/2026)

  30. [30]

    Olympiad-level formal mathematical reasoning with reinforcement learning

    T. Hubert et al. “Olympiad-level formal mathematical reasoning with reinforcement learning.”Nature,651(8106), Nov. 2025, pp. 607–613.issn: 1476-4687.doi: 10.1038/ s41586-025-09833-y.url: http://dx.doi.org/10.1038/s41586-025-09833-y

  31. [31]

    Aristotle: IMO-level Automated Theorem Proving

    T. Achim et al. “Aristotle: IMO-level Automated Theorem Proving.”arXiv.org, 2025. arXiv: 2510.01346.url: https://arxiv.org/abs/2510.01346. 128

  32. [32]

    H. Xin, D. Guo, Z. Shao, Z. Ren, Q. Zhu, B. Liu, C. Ruan, W. Li, and X. Liang. DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data. arXiv:2405.14333 [cs]. May 2024.doi: 10.48550/arXiv.2405.14333.url: http://arxiv.org/abs/2405.14333 (visited on 12/17/2025)

  33. [33]

    Goedel-Prover : A frontier model for open-source automated theorem proving

    Y. Lin et al.Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving. arXiv:2502.07640 [cs]. Apr. 2025.doi: 10.48550/arXiv.2502.07640.url: http://arxiv.org/abs/2502.07640 (visited on 12/17/2025)

  34. [34]

    Seed-Prover : Deep and broad reasoning for automated theorem proving

    L. Chen et al. “Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving.”arXiv.org, 2025. arXiv: 2507.23726.url: https://arxiv.org/abs/2507.23726

  35. [35]

    auto-extract failed; link: https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/

    AlphaProof — DeepMind, 2024 (blog + Nature 2025). auto-extract failed; link: https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/

  36. [36]

    Z. Z. Ren et al.DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition. arXiv:2504.21801 [cs]. July 2025. doi: 10.48550/arXiv.2504.21801.url: http://arxiv.org/abs/2504.21801 (visited on 12/17/2025)

  37. [37]

    Bursuc et al.A benchmark for vericoding: formally verified program synthesis

    S. Bursuc et al.A benchmark for vericoding: formally verified program synthesis. 2025. doi: 10.48550/ARXIV.2509.22908.url: https://arxiv.org/abs/2509.22908

  38. [38]

    Verus proofs over a fork ofcurve25519-dalek

    Beneficial AI Foundation.dalek-lite: A pure-Rust implementation of group operations on Ristretto and Curve25519. Verus proofs over a fork ofcurve25519-dalek. GitHub, 2025.url: https://github.com/Beneficial-AI-Foundation/dalek-lite (visited on 05/04/2026)

  39. [39]

    Varambally, T

    S. Varambally, T. Voice, Y. Sun, Z. Chen, R. Yu, and K. Ye.Hilbert: Recursively Building Formal Proofs with Informal Reasoning. 2025.doi: 10.48550/ARXIV.2509. 22819.url: https://arxiv.org/abs/2509.22819. 129

  40. [40]

    Dafny: An Automatic Program Verifier for Functional Correctness

    K. R. M. Leino. “Dafny: An Automatic Program Verifier for Functional Correctness.” In:Logic for Programming, Artificial Intelligence, and Reasoning. Springer Berlin Heidelberg, 2010, pp. 348–370.isbn: 9783642175114.doi: 10.1007/978-3-642-17511- 4_20.url: http://dx.doi.org/10.1007/978-3-642-17511-4_20

  41. [41]

    en.url: https://dafny-lang.github.io/dafny/latest/DafnyRef/ DafnyRef.html (visited on 05/20/2026)

    Dafny Documentation. en.url: https://dafny-lang.github.io/dafny/latest/DafnyRef/ DafnyRef.html (visited on 05/20/2026)

  42. [42]

    A Piecewise Rotation of the Circle, IPR Maps and Their Connection with Translation Surfaces

    L. de Moura, S. Kong, J. Avigad, F. van Doorn, and J. von Raumer. “The Lean Theorem Prover (System Description).” In:Automated Deduction - CADE-25. Springer International Publishing, 2015, pp. 378–388.isbn: 9783319214016.doi: 10.1007/978- 3-319-21401-6_26.url: http://dx.doi.org/10.1007/978-3-319-21401-6_26

  43. [43]

    Springer Berlin Heidelberg, 2002.isbn: 9783540459491

    Isabelle/HOL. Springer Berlin Heidelberg, 2002.isbn: 9783540459491.doi: 10.1007/3- 540-45949-9.url: http://dx.doi.org/10.1007/3-540-45949-9

  44. [44]

    TacticToe: Learn- ing to Prove with Tactics

    T. Gauthier, C. Kaliszyk, J. Urban, R. Kumar, and M. Norrish. “TacticToe: Learn- ing to Prove with Tactics.”Journal of Automated Reasoning,65(2), Feb. 2021. arXiv:1804.00596 [cs.AI], pp. 257–286.issn: 0168-7433, 1573-0670.doi: 10.1007/ s10817-020-09580-x.url: http://arxiv.org/abs/1804.00596 (visited on 05/28/2026)

  45. [45]

    Learning to Prove Theorems via Interacting with Proof Assistants

    K. Yang and J. Deng.Learning to Prove Theorems via Interacting with Proof Assistants. arXiv:1905.09381 [cs.LO]. May 2019.doi: 10.48550/arXiv.1905.09381.url: http: //arxiv.org/abs/1905.09381 (visited on 05/28/2026)

  46. [46]

    The Tactician (extended version): A Seam- less, Interactive Tactic Learner and Prover for Coq

    L. Blaauwbroek, J. Urban, and H. Geuvers. “The Tactician (extended version): A Seam- less, Interactive Tactic Learner and Prover for Coq.” In: vol. 12236. arXiv:2008.00120 [cs.AI]. 2020, pp. 271–277.doi: 10.1007/978-3-030-53518-6_17.url: http: //arxiv.org/abs/2008.00120 (visited on 05/28/2026)

  47. [47]

    Generative Language Modeling for Automated Theorem Proving

    S. Polu and I. Sutskever. “Generative Language Modeling for Automated Theorem Proving.”arXiv.org, 2020. arXiv: 2009.03393.url: https://arxiv.org/abs/2009.03393. 130

  48. [48]

    Proof Artifact Co-training for Theorem Proving with Language Models

    J. M. Han, J. M. Rute, Y. Wu, E. W. Ayers, and S. Polu. “Proof Artifact Co-training for Theorem Proving with Language Models.”International Conference on Learning Representations, 2021. arXiv: 2102.06203.url: https://arxiv.org/abs/2102.06203

  49. [49]

    Formal mathematics statement curriculum learning

    S. Polu, J. M. Han, K. Zheng, M. Baksys, I. Babuschkin, and I. Sutskever. “Formal Mathematics Statement Curriculum Learning.”International Conference on Learning Representations, 2022. arXiv: 2202.01344.url: https://arxiv.org/abs/2202.01344

  50. [50]

    Lample, M.-A

    G. Lample, M. Lachaux, T. Lavril, X. Martinet, A. Hayat, G. Ebner, A. Rodriguez, and T. Lacroix. “HyperTree Proof Search for Neural Theorem Proving.”Neural Information Processing Systems, 2022. arXiv: 2205.11491.url: https://arxiv.org/abs/2205.11491

  51. [51]

    K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. Prenger, and A. Anandkumar.LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. 2023.doi: 10.48550/ARXIV.2306.15626.url: https://arxiv.org/abs/2306. 15626

  52. [52]

    MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

    K. Zheng, J. M. Han, and S. Polu. “MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics.”International Conference on Learning Representations,

  53. [53]

    arXiv: 2109.00110.url: https://arxiv.org/abs/2109.00110

  54. [54]

    Llemma: An Open Language Model For Mathematics

    Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck. “Llemma: An Open Language Model For Mathematics.” International Conference on Learning Representations, 2023. arXiv: 2310.10631.url: https://arxiv.org/abs/2310.10631

  55. [55]

    Lean-STaR: Learning to Interleave Thinking and Proving

    H. Lin, Z. Sun, Y. Yang, and S. Welleck. “Lean-STaR: Learning to Interleave Thinking and Proving.”International Conference on Learning Representations, 2024. arXiv: 2407.10040.url: https://arxiv.org/abs/2407.10040

  56. [56]

    Xin et al.DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Rein- forcement Learning and Monte-Carlo Tree Search

    H. Xin et al.DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Rein- forcement Learning and Monte-Carlo Tree Search. arXiv:2408.08152 [cs]. Aug. 2024. 131 doi: 10.48550/arXiv.2408.08152.url: http://arxiv.org/abs/2408.08152 (visited on 05/05/2026)

  57. [58]

    Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

    H. Wang et al. “Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning.”arXiv.org, 2025. arXiv: 2504.11354.url: https://arxiv. org/abs/2504.11354

  58. [59]

    miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

    A. Ospanov, F. Farnia, and R. Yousefzadeh. “miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward.”arXiv.org, 2025. arXiv: 2511.03108.url: https://arxiv.org/abs/2511.03108

  59. [60]

    RLMEval: Evaluating Research-Level Neural Theorem Proving

    A. Poiroux, A. Bosselut, and V. Kuncak. “RLMEval: Evaluating Research-Level Neural Theorem Proving.”Conference on Empirical Methods in Natural Language Processing,

  60. [61]

    arXiv: 2510.25427.url: https://arxiv.org/abs/2510.25427

  61. [62]

    VERINA: Benchmarking Verifiable Code Generation

    Z. Ye, Z. Yan, J. He, T. Kasriel, K. Yang, and D. Song. “VERINA: Benchmarking Verifiable Code Generation.”arXiv.org, 2025. arXiv: 2505.23135.url: https://arxiv. org/abs/2505.23135

  62. [63]

    auto-extract failed; link:

    CLEVER + VERINA — 2025 (Lean end-to-end vericoding benchmarks). auto-extract failed; link:

  63. [64]

    Loughridge, Q

    C. Loughridge, Q. Sun, S. Ahrenbach, F. Cassano, C. Sun, Y. Sheng, A. Mudide, M. R. H. Misu, N. Amin, and M. Tegmark.DafnyBench: A Benchmark for Formal Software Verification. 2024.doi: 10.48550/ARXIV.2406.08467.url: https://arxiv. org/abs/2406.08467

  64. [65]

    Towards AI-Assisted Synthesis of Verified Dafny Methods

    M. R. H. Misu, C. V. Lopes, I. Ma, and J. Noble. “Towards AI-Assisted Synthesis of Verified Dafny Methods.”Proceedings of the ACM on Software Engineering,1(FSE), 132 2024, pp. 812–835.issn: 2994-970X.doi: 10.1145/3643763.url: http://dx.doi.org/10. 1145/3643763

  65. [66]

    Laurel: Unblocking Automated Verification with Large Language Models

    E. Mugnier, E. A. Gonzalez, N. Polikarpova, R. Jhala, and Z. Yuanyuan. “Laurel: Unblocking Automated Verification with Large Language Models.”Proceedings of the ACM on Programming Languages,9(OOPSLA1), Apr. 2025, pp. 1519–1545.issn: 2475-1421.doi: 10.1145/3720499.url: http://dx.doi.org/10.1145/3720499

  66. [67]

    Clover: Closed-Loop Verifiable Code Generation

    C. Sun, Y. Sheng, O. Padon, and C. W. Barrett. “Clover: Closed-Loop Verifiable Code Generation.”SAIV, 2023. arXiv: 2310.17807.url: https://arxiv.org/abs/2310.17807

  67. [68]

    Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

    S. Chakraborty, G. Ebner, S. Bhat, S. Fakhoury, S. Fatima, S. K. Lahiri, and N. Swamy. “Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming.” International Conference on Software Engineering, 2024. arXiv: 2405.01787.url: https://arxiv.org/abs/2405.01787

  68. [69]

    Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarcity

    D. Zhang, J. Wang, and T. Sun. “Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarcity.” In:Findings of the Association for Computational Linguistics: ACL 2025. Ed. by W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar. Vienna, Austria: Association for Computational Linguistics, July 2025, pp. 23101–23118.isbn: 979-8-89176...

  69. [70]

    What’s in a Proof? Analyzing Expert Proof-Writing Processes in F* and Verus

    R. Jain, S. Barke, G. Ebner, M. R. H. Misu, S. Lu, and S. Fakhoury. “What’s in a Proof? Analyzing Expert Proof-Writing Processes in F* and Verus.”arXiv.org, 2025. arXiv: 2508.02733.url: https://arxiv.org/abs/2508.02733

  70. [71]

    In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE).pp.675–675.IEEEComputerSociety,LosAlamitos,CA,USA(May2025)

    K. Thompson, N. Saavedra, P. Carrott, K. Fisher, A. Sanchez-Stern, Y. Brun, J. F. Ferreira, S. Lerner, and E. First. “Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification.” In:2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, Apr. 2025, pp. 347–359.doi: 10.1109/ icse55347.2025.00161.url: http://dx...

  71. [72]

    AutoVerus: Automated Proof Generation for Rust Code

    C. Yang et al. “AutoVerus: Automated Proof Generation for Rust Code.”Proc. ACM Program. Lang., 2024. arXiv: 2409.13082.url: https://arxiv.org/abs/2409.13082

  72. [73]

    Automated Proof Generation for Rust Code via Self-Evolution

    T. Chen et al. “Automated Proof Generation for Rust Code via Self-Evolution.” International Conference on Learning Representations, 2024. arXiv: 2410.15756.url: https://arxiv.org/abs/2410.15756

  73. [74]

    VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus

    C. Sun, Y. Sun, D. Amrollahi, E. Zhang, S. K. Lahiri, S. Lu, D. Dill, and C. W. Barrett. “VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus.” arXiv.org, 2025. arXiv: 2510.25015.url: https://arxiv.org/abs/2510.25015

  74. [76]

    arXiv: 2603.25810.url: https://arxiv.org/abs/2603.25810

  75. [77]

    Proof Automation with Large Language Models

    M. Lu, B. Delaware, and T. Zhang. “Proof Automation with Large Language Models.” International Conference on Automated Software Engineering, 2024. arXiv: 2409.14274. url: https://arxiv.org/abs/2409.14274

  76. [78]

    Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

    Z. Li, Z. Yang, D. He, H. Zhao, A. Zhao, S. Tang, K. Yang, A. Gupta, Z. Su, and C. Jin. “Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification.”arXiv preprint arXiv:2603.19329, 2026. arXiv: 2603.19329.url: https://arxiv.org/abs/2603.19329

  77. [79]

    Aniva, C

    L. Aniva, C. Sun, B. Miranda, C. Barrett, and S. Koyejo.Pantograph: A Machine-to- Machine Interaction Interface for Advanced Theorem Proving, High Level Reasoning, and Data Extraction in Lean 4. 2024.doi: 10.48550/ARXIV.2410.16429.url: https://arxiv.org/abs/2410.16429

  78. [80]

    Auto- formalization with Large Language Models

    Y. Wu, A. Q. Jiang, W. Li, M. Rabe, C. Staats, M. Jamnik, and C. Szegedy. “Auto- formalization with Large Language Models.”Neural Information Processing Systems,

  79. [81]

    arXiv: 2205.12615.url: https://arxiv.org/abs/2205.12615. 134

  80. [82]

    Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

    A. Q. Jiang, S. Welleck, J. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y. Wu, and G. Lample. “Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs.”International Conference on Learning Representations, 2022. arXiv: 2210.12283.url: https://arxiv.org/abs/2210.12283

Showing first 80 references.