Agentic Proving for Program Verification

Akhil Arora; Alessandro Sosso; Bas Spitters

REVIEW 3 major objections 2 minor 38 references

Claude Code in an agentic framework produces valid specifications for 98.8 percent of CLEVER problems and achieves 98.1 percent end-to-end verification success.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 04:00 UTC pith:7LYMBPRD

load-bearing objection Claude posts high numbers on CLEVER but the evaluation rests on the benchmark's own scoring and manual checks even with known dataset bugs. the 3 major comments →

arxiv 2605.23772 v1 pith:7LYMBPRD submitted 2026-05-22 cs.AI cs.LOcs.PLcs.SE

Agentic Proving for Program Verification

Alessandro Sosso , Akhil Arora , Bas Spitters This is my paper

classification cs.AI cs.LOcs.PLcs.SE

keywords agentic provingprogram verificationLean 4CLEVER benchmarkformal methodsautomated theorem provingClaude model

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates an agentic proving system based on Claude on the CLEVER Lean 4 benchmark for verifiable code generation. It reports high rates of generating valid specifications and certifying implementations, along with effective self-feedback. This leads the authors to conclude that current benchmarks are mismatched to modern prover capabilities and that compiler-in-the-loop agentic methods work best for program verification. A sympathetic reader would see this as evidence that automated verification is becoming practical on standard test sets.

Core claim

Claude generates arguably valid specifications for 98.8% of problems, with 81.3% accepted by isomorphism-based scoring, certifies implementations for 87.5% against ground-truth specs, and reaches 98.1% success on the full pipeline for self-consistent entries, while providing high-quality feedback and identifying dataset bugs.

What carries the argument

The agentic proving framework that integrates Claude Code with iterative compiler feedback and manual review for Lean 4 specification and implementation generation on the CLEVER benchmark.

Load-bearing premise

The evaluation treats CLEVER's ground-truth specifications and isomorphism scoring plus manual checks as a reliable measure of actual verification correctness even though the paper notes bugs in the dataset.

What would settle it

A concrete counterexample would be a generated specification that passes both the agent's validation and CLEVER's scoring but is shown by independent proof or model checking to be incorrect for the intended program behavior.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

High success rates indicate that tight integration of the compiler in the loop is the most effective approach for foundational program verification.
Existing benchmarks like CLEVER no longer pose sufficient challenge for state-of-the-art agentic provers.
Manual review confirms the quality of the agent's self-generated feedback on failures.
Alternatives to isomorphism-based scoring are needed for more rigorous evaluation of generated specifications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agentic setups could be tested on harder or more diverse program verification tasks beyond CLEVER.
The identification of lingering bugs in the dataset suggests that future benchmarks should incorporate automated bug detection as part of their design.
These results may imply that program verification can scale to real-world codebases if benchmarks are updated accordingly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Claude posts high numbers on CLEVER but the evaluation rests on the benchmark's own scoring and manual checks even with known dataset bugs.

read the letter

The main thing here is that the paper shows Claude achieving high success rates in an agentic setup on the CLEVER benchmark for Lean program verification, but the evaluation relies on the benchmark's ground-truth specifications and isomorphism scoring despite the authors noting bugs in the dataset. What the paper does that is new is applying existing agentic methods to this program verification task and reporting specific performance figures. It finds that Claude generates arguably valid specs for 98.8% of problems, with 81.3% also passing the isomorphism check on the correct subset. It certifies implementations against ground-truth specs for 87.5% of problems and hits 98.1% on the full pipeline for self-consistent cases. The authors also observe that the model gives high-quality feedback on its attempts, which helps identify failure causes and dataset bugs. This provides empirical evidence that agentic approaches with compiler feedback work well here. The paper does well in highlighting the mismatch between existing benchmarks and current agent capabilities, and in calling for better, bug-resilient evaluation methods. The soft spots are in how the results are measured. The headline numbers depend on CLEVER's own mechanisms and manual review of feedback. With admitted lingering bugs in the dataset, it's possible that some incorrect specs or implementations are counted as successes. The paper does not describe an independent verification process outside the benchmark for the full set of claimed wins, so the performance claims have some uncertainty attached. This work is aimed at people working on automated verification and AI agents for formal methods. A reader looking for data on how LLMs perform on program verification benchmarks will get something useful from the numbers and the discussion of feedback. It deserves a serious referee. The empirical contribution is clear enough to warrant review, even if the evaluation needs closer scrutiny. I recommend sending it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper evaluates an agentic proving framework using Claude Code on the CLEVER Lean 4 benchmark for verifiable code generation. It reports that Claude produces arguably valid specifications for 98.8% of problems (81.3% also accepted by CLEVER's isomorphism scoring on the correct subset), certifies implementations against ground-truth specs for 87.5% of problems, and achieves 98.1% end-to-end success on entries with self-consistent premises. The work additionally claims high-quality self-feedback from the model, identifies dataset bugs via manual review, and argues that existing benchmarks are mismatched to current agentic prover capabilities, calling for alternatives to isomorphism-based scoring.

Significance. If the performance numbers prove robust under stricter verification, the results would indicate that tight compiler-in-the-loop agentic systems can already saturate current program-verification benchmarks, providing concrete evidence of a capability-benchmark gap and motivating the development of harder, bug-resilient evaluation suites. The paper's explicit discussion of dataset limitations and scoring weaknesses is a constructive contribution.

major comments (3)

[Evaluation] Evaluation section: the 98.8% 'arguably valid' figure rests on manual review whose criteria, reviewer count, and inter-rater reliability are not reported; this subjectivity directly supports the headline performance claim and requires explicit documentation or an automated proxy.
[Results] Results section: success rates (98.8%, 87.5%, 98.1%) are stated without dataset cardinality, confidence intervals, or per-category failure breakdown, making it impossible to assess whether the numbers are statistically distinguishable from ceiling effects.
[Discussion] Discussion: the paper acknowledges lingering bugs in CLEVER yet still computes all metrics against the benchmark's ground-truth specifications and isomorphism scorer; no sensitivity analysis quantifies how many reported successes could be artifacts of those bugs.

minor comments (2)

[Abstract] Abstract: the phrase 'correct portion of the benchmark' is used without a forward reference to the filtering procedure described later in the text.
[Results] The manuscript would benefit from an explicit statement of total problem count and the size of the 'self-consistent premises' subset used for the 98.1% figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the significance of the capability-benchmark gap highlighted in our work. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the 98.8% 'arguably valid' figure rests on manual review whose criteria, reviewer count, and inter-rater reliability are not reported; this subjectivity directly supports the headline performance claim and requires explicit documentation or an automated proxy.

Authors: We agree that the manuscript lacks explicit documentation of the manual review process. In the revised version we will add a dedicated paragraph in the Evaluation section specifying the review criteria (Lean 4 syntactic validity, semantic plausibility relative to the problem statement, and absence of obvious logical contradictions), the number of reviewers (two authors performed independent reviews), and the disagreement-resolution procedure (discussion until consensus). An automated proxy remains difficult because the 'arguably valid' category is defined precisely by cases that the isomorphism scorer cannot capture. revision: yes
Referee: [Results] Results section: success rates (98.8%, 87.5%, 98.1%) are stated without dataset cardinality, confidence intervals, or per-category failure breakdown, making it impossible to assess whether the numbers are statistically distinguishable from ceiling effects.

Authors: The CLEVER benchmark comprises 160 problems. We will revise the Results section to report exact numerators and denominators for each percentage, append binomial (Wilson) confidence intervals, and insert a table providing per-category failure counts and qualitative descriptions. These additions will make proximity to ceiling performance directly evaluable. revision: yes
Referee: [Discussion] Discussion: the paper acknowledges lingering bugs in CLEVER yet still computes all metrics against the benchmark's ground-truth specifications and isomorphism scorer; no sensitivity analysis quantifies how many reported successes could be artifacts of those bugs.

Authors: A full quantitative sensitivity analysis would require exhaustive re-auditing and correction of every ground-truth specification in the benchmark, which is outside the scope of the present study. In the revision we will expand the Discussion to provide a qualitative accounting of how the bugs we manually identified affected individual cases, restate that all reported metrics follow the benchmark's official protocol, and explicitly flag the absence of a complete sensitivity analysis as a limitation and a priority for future benchmark work. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper reports success rates (98.8%, 87.5%, 98.1%) from running Claude Code on the external CLEVER Lean 4 benchmark. No derivations, equations, fitted parameters, or first-principles claims appear in the provided text. All metrics are computed directly against the benchmark's provided ground-truth specifications and isomorphism-based scoring, which are external inputs. The paper notes dataset bugs and calls for better evaluation methods, but this does not create any self-referential reduction or load-bearing self-citation chain. The evaluation is self-contained against the external dataset with no internal quantities redefined as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical evaluation paper whose central claims rest on the representativeness of the CLEVER benchmark and the reliability of its scoring plus manual review.

axioms (1)

domain assumption CLEVER benchmark and its isomorphism-based scoring provide a valid measure of program verification success
The reported success rates depend on this benchmark being an appropriate testbed; the paper itself flags bugs in the dataset.

pith-pipeline@v0.9.0 · 5758 in / 1241 out tokens · 25160 ms · 2026-05-25T04:00:16.161151+00:00 · methodology

0 comments

read the original abstract

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

Figures

Figures reproduced from arXiv: 2605.23772 by Akhil Arora, Alessandro Sosso, Bas Spitters.

**Figure 2.** Figure 2: Distribution of output flags for SPEC_ISO runs. The following color scheme will be used throughout the paper: green for OK entries, red for FAIL , yellow for MISMATCH , blue for ISSUE in benchmark components, orange for those in generated components, purple for issues in both. 3.2 Implementation certification From ground-truth specifications. Following the original CLEVER protocol, the first run of this st… view at source ↗

**Figure 3.** Figure 3: Distribution of output flags for PROOF_GEN runs. Note the change of coloring for specification ISSUE between rows, as their source changes from the ground-truth to automatic generation. 4 Discussion 4.1 Specification certification Autoformalization is hard to evaluate. Beyond the technical hurdle of producing a Lean specification that compiles, two more substantive issues dominate: choosing the right stren… view at source ↗

**Figure 4.** Figure 4: Overview of prover performance on the CLEVER and VERINA benchmarks. Solid bars indicate the number of tests solved on the verified-correct portions of each dataset, while the patterned bar segments above show additional solved instances obtained after correcting erroneous benchmark entries. Numbers annotated with a ‘+’ denote the corresponding count of these additional solves. 15 [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 5.** Figure 5: Output flag for each problem ID across the three [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Breakdown of results for the Multi-spec SPEC_ISO run. Output flag for tested problems in the Multi-spec SPEC_ISO run: overall result and detailed per-spec flagging. Color scheme corresponds to the one in [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Scatter plot of attempt costs (USD) vs. runtimes (s), across all tasks and runs per task. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Fix of CLEVER problem 32 generated by Claude Code. At specification certification, no run produced a generated_spec that proved isomorphic to the ground-truth one. The reason was systematic: every generated specification required the implementation to return an exact root of the polynomial, whereas the ground-truth specification only required an approximate root within a fixed tolerance. The SPEC_GEN agen… view at source ↗

**Figure 9.** Figure 9: Extract of Claude’s replies explaining the edits it made in fixing problem 32. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 6 internal anchors

[1]

Aristotle: IMO-level Automated Theorem Proving

Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern- Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, Riley Patterson, Eric Rodriguez, Laura Scharff, Vikram Shanker, Vladmir Sicca, Hari Sowrirajan, Aidan Swope, Matyas Tamas, Vlad Tenev, Jonathan Thomm, Harold Williams, and L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Claude-agent-sdk-python

Anthropic. Claude-agent-sdk-python. GitHub repository anthropics/claude-agent-sdk-python. Version 0.1.52, commit566e41f

work page
[3]

Mizar: State-of-the-art and beyond

Grzegorz Bancerek, Czesław Byli´nski, Adam Grabowski, Artur Korniłowicz, Roman Matuszewski, Adam Naumowicz, Karol P ˛ ak, and Josef Urban. Mizar: State-of-the-art and beyond. In Manfred Kerber, Jacques Carette, Cezary Kaliszyk, Florian Rabe, and V olker Sorge (eds.),Intelligent Computer Mathe- matics, pp. 261–279, Cham, 2015. Springer International Publis...

work page doi:10.1007/978-3-319-20615-8_17 2015
[4]

A case study on the effectiveness of llms in verification with proof assistants.arXiv:2508.18587, 2025

Barı¸ s Bayazıt, Yao Li, and Xujie Si. A case study on the effectiveness of llms in verification with proof assistants.arXiv:2508.18587, 2025

work page arXiv 2025
[5]

Brown, and Josef Urban

Mario Carneiro, Chad E. Brown, and Josef Urban. Automated theorem proving for metamath. In Adam Naumowicz and René Thiemann (eds.),14th International Conference on Interactive Theo- rem Proving (ITP 2023), volume 268 ofLeibniz International Proceedings in Informatics (LIPIcs), pp. 9:1–9:19, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für I...

work page doi:10.4230/lipics.itp.2023.9 2023
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Wenhu Chen, Shuo Wei, Hao Wang, et al. Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv:2507.23726, 2025

work page Pith review arXiv 2025
[8]

LeanHammer

Joshua Clune. LeanHammer. GitHub repository JOSHCLUNE/LeanHammer. Version 4.26.0, commit 87e1bba

work page
[9]

Lean-lsp-mcp

Oliver Dressler. Lean-lsp-mcp. GitHub repository oOo0oOo/lean-lsp-mcp. Version 0.25.1, commit 246e2ad

work page
[10]

Certified Program Synthesis with a Multi-Modal Verifier

Yueyang Feng, Dipesh Kafle, Vladimir Gladshtein, Vitaly Kurin, George Pîrlea, Qiyuan Zhao, Peter Müller, and Ilya Sergey. Certified program synthesis with a multi-modal verifier.arXiv:2604.16584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Lean4-skills

Cameron Freer. Lean4-skills. GitHub repository cameronfreer/lean4-skills. Version 4.4.5, commit a1cef7d. (preliminary experiments of Appendix A on the earlier commit5d6bd03)

work page
[12]

Hol light: A tutorial introduction

John Harrison. Hol light: A tutorial introduction. InFormal Methods in Computer-Aided Design (FMCAD), 1996

work page 1996
[13]

Yichen Huang and Lin F. Yang. Winning Gold at IMO 2025 with a Model-Agnostic Verification-and- Refinement Pipeline.arXiv:2507.15855, 2025

work page arXiv 2025
[14]

doi:10.1038/s41586-025-09833-y , url =

Thomas Hubert, Harsh Mehta, Laurent Sartran, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, 2025. DOI:10.1038/s41586-025-09833-y

work page doi:10.1038/s41586-025-09833-y 2025
[15]

Shuvendu K. Lahiri. Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents.arXiv:2603.17150, 2026

work page arXiv 2026
[16]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al

Zenan Li, Ziran Yang, Deyuan He, Haoyu Zhao, Andrew Zhao, Shange Tang, Kaiyu Yang, Aarti Gupta, Zhendong Su, and Chi Jin. Goedel-Code-Prover: Hierarchical proof search for open state-of-the-art code verification.arXiv:2603.19329, 2026. 11

work page arXiv 2026
[17]

Aesop: White-Box Best-First Proof Search for Lean

Jannis Limperg and Asta Halkjær From. Aesop: White-Box Best-First Proof Search for Lean. In Proceedings of the 12th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2023, pp. 253–266, New York, NY , USA, January 2023. Association for Computing Machinery. ISBN 979-8-4007-0026-2. DOI:10.1145/3573105.3575671

work page doi:10.1145/3573105.3575671 2023
[18]

Fvel: Interactive formal verification environment with large language models via theorem proving.arXiv:2406.14408, 2024

Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, and Xiaodan Liang. Fvel: Interactive formal verification environment with large language models via theorem proving.arXiv:2406.14408, 2024

work page arXiv 2024
[19]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction.arXiv:2508.03...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Numina-lean-agent: An open and general agentic reasoning system for formal mathematics.arXiv preprint arXiv:2601.14027, 2026

Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, Lihong Zhi, Jia Li, and Wenda Li. Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics.arXiv:2601.14027, 2026

work page arXiv 2026
[21]

The Lean 4 Theorem Prover and Programming Language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, pp. 625–635, Berlin, Heidelberg, 2021. Springer-Verlag. ISBN 978-3-030-79875-8. DOI:10.1007/978-3-030-79876-5_37

work page doi:10.1007/978-3-030-79876-5_37 2021
[22]

Paulson, and Markus Wenzel.Isabelle/HOL: A Proof Assistant for Higher- Order Logic, volume 2283 ofLecture Notes in Computer Science

Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel.Isabelle/HOL: A Proof Assistant for Higher- Order Logic, volume 2283 ofLecture Notes in Computer Science. Springer, 2002

work page 2002
[23]

PhD thesis, Chalmers University of Technology and Göteborg University, 2007

Ulf Norell.Towards a Practical Programming Language Based on Dependent Type Theory. PhD thesis, Chalmers University of Technology and Göteborg University, 2007

work page 2007
[24]

Canonical for Automated Theorem Proving in Lean

Chase Norman and Jeremy Avigad. Canonical for Automated Theorem Proving in Lean. arXiv:2504.06239, 2025

work page arXiv 2025
[25]

Lean-lsp-mcp

Project Numina. Lean-lsp-mcp. GitHub repositoryproject-numina/lean-lsp-mcp. Accessed: 2026- 02-03

work page 2026
[26]

Numina-lean-agent

Project Numina. Numina-lean-agent. GitHub repository project-numina/numina-lean-agent. Com- mit4c8e039

work page
[27]

Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.arXiv:2504.21801, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

The rocq prover.Project website, 2026.https://rocq-prover.org

Rocq Development Team. The rocq prover.Project website, 2026.https://rocq-prover.org

work page 2026
[29]

A brief overview of HOL4

Konrad Slind and Michael Norrish. A brief overview of HOL4. InTheorem Proving in Higher Order Logics (TPHOLs 2008), volume 5170 ofLecture Notes in Computer Science, pp. 28–32. Springer, 2008. DOI:10.1007/978-3-540-71067-7_6

work page doi:10.1007/978-3-540-71067-7_6 2008
[30]

An in-context learning agent for formal theorem-proving

Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An in-context learning agent for formal theorem-proving.arXiv:2310.04353, 2024

work page arXiv 2024
[31]

Thakur, J

Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri. CLEVER: A Curated Benchmark for Formally Verified Code Generation.arXiv:2505.13938, 2025

work page arXiv 2025
[32]

Lean FRO, 2024

The Lean Developers.The Lean Language Reference: The grind tactic. Lean FRO, 2024. URL https://lean-lang.org/doc/reference/latest/The--grind--tactic/ . Lean version 4.15.0 or later

work page 2024
[33]

Minif2f in rocq: Automatic translation between proof assistants – a case study.arXiv:2503.04763, 2025

Jules Viennot, Guillaume Baudart, Emilio Jesùs Gallego Arias, and Marc Lelarge. Minif2f in rocq: Automatic translation between proof assistants – a case study.arXiv:2503.04763, 2025

work page arXiv 2025
[34]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Léo Dreyfus-Schmidt, Lewis Tunstall, Luigi Paga...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

A minimalist proof language for neural theorem proving over isabelle/hol.arXiv:2507.18885, 2025

Qiyuan Xu, Renxi Wang, Peixin Wang, Haonan Li, and Conrad Watt. A minimalist proof language for neural theorem proving over isabelle/hol.arXiv:2507.18885, 2025

work page arXiv 2025
[36]

Verina: Benchmarking verifiable code generation

Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, and Dawn Song. VERINA: Bench- marking Verifiable Code Generation.arXiv:2505.23135, 2025. 13 A Preliminary Experiments In prior experiments, the authors evaluated a suite of automated provers on proof generation on the CLEVER[ 31] and VERINA[ 36] datasets, finding that agentic paradigms substan...

work page arXiv 2025
[37]

This is guaranteed to exist for odd-degree polynomials (even-length coefficient list) with non-zero leading coefficient – exactly the docstring’s preconditions

findBound: Finds a bound b (starting from 1, doubling up to 2 100) such that poly(b) and poly(-b) have opposite signs. This is guaranteed to exist for odd-degree polynomials (even-length coefficient list) with non-zero leading coefficient – exactly the docstring’s preconditions

work page
[38]

Each step halves the interval containing the root

bisect: Standard bisection on [-bound, bound] with 1,000,000 iterations. Each step halves the interval containing the root. After n steps, the interval has width 2·bound / 2 n, giving astronomically precise convergence – far more than needed for eps = 10-6. Verification against test cases: - [1, 2] (1 + 2x): bisect finds mid = -0.5 where poly = 0 exactly ...

work page

[1] [1]

Aristotle: IMO-level Automated Theorem Proving

Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern- Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, Riley Patterson, Eric Rodriguez, Laura Scharff, Vikram Shanker, Vladmir Sicca, Hari Sowrirajan, Aidan Swope, Matyas Tamas, Vlad Tenev, Jonathan Thomm, Harold Williams, and L...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Claude-agent-sdk-python

Anthropic. Claude-agent-sdk-python. GitHub repository anthropics/claude-agent-sdk-python. Version 0.1.52, commit566e41f

work page

[3] [3]

Mizar: State-of-the-art and beyond

Grzegorz Bancerek, Czesław Byli´nski, Adam Grabowski, Artur Korniłowicz, Roman Matuszewski, Adam Naumowicz, Karol P ˛ ak, and Josef Urban. Mizar: State-of-the-art and beyond. In Manfred Kerber, Jacques Carette, Cezary Kaliszyk, Florian Rabe, and V olker Sorge (eds.),Intelligent Computer Mathe- matics, pp. 261–279, Cham, 2015. Springer International Publis...

work page doi:10.1007/978-3-319-20615-8_17 2015

[4] [4]

A case study on the effectiveness of llms in verification with proof assistants.arXiv:2508.18587, 2025

Barı¸ s Bayazıt, Yao Li, and Xujie Si. A case study on the effectiveness of llms in verification with proof assistants.arXiv:2508.18587, 2025

work page arXiv 2025

[5] [5]

Brown, and Josef Urban

Mario Carneiro, Chad E. Brown, and Josef Urban. Automated theorem proving for metamath. In Adam Naumowicz and René Thiemann (eds.),14th International Conference on Interactive Theo- rem Proving (ITP 2023), volume 268 ofLeibniz International Proceedings in Informatics (LIPIcs), pp. 9:1–9:19, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für I...

work page doi:10.4230/lipics.itp.2023.9 2023

[6] [6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Wenhu Chen, Shuo Wei, Hao Wang, et al. Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv:2507.23726, 2025

work page Pith review arXiv 2025

[8] [8]

LeanHammer

Joshua Clune. LeanHammer. GitHub repository JOSHCLUNE/LeanHammer. Version 4.26.0, commit 87e1bba

work page

[9] [9]

Lean-lsp-mcp

Oliver Dressler. Lean-lsp-mcp. GitHub repository oOo0oOo/lean-lsp-mcp. Version 0.25.1, commit 246e2ad

work page

[10] [10]

Certified Program Synthesis with a Multi-Modal Verifier

Yueyang Feng, Dipesh Kafle, Vladimir Gladshtein, Vitaly Kurin, George Pîrlea, Qiyuan Zhao, Peter Müller, and Ilya Sergey. Certified program synthesis with a multi-modal verifier.arXiv:2604.16584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Lean4-skills

Cameron Freer. Lean4-skills. GitHub repository cameronfreer/lean4-skills. Version 4.4.5, commit a1cef7d. (preliminary experiments of Appendix A on the earlier commit5d6bd03)

work page

[12] [12]

Hol light: A tutorial introduction

John Harrison. Hol light: A tutorial introduction. InFormal Methods in Computer-Aided Design (FMCAD), 1996

work page 1996

[13] [13]

Yichen Huang and Lin F. Yang. Winning Gold at IMO 2025 with a Model-Agnostic Verification-and- Refinement Pipeline.arXiv:2507.15855, 2025

work page arXiv 2025

[14] [14]

doi:10.1038/s41586-025-09833-y , url =

Thomas Hubert, Harsh Mehta, Laurent Sartran, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, 2025. DOI:10.1038/s41586-025-09833-y

work page doi:10.1038/s41586-025-09833-y 2025

[15] [15]

Shuvendu K. Lahiri. Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents.arXiv:2603.17150, 2026

work page arXiv 2026

[16] [16]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al

Zenan Li, Ziran Yang, Deyuan He, Haoyu Zhao, Andrew Zhao, Shange Tang, Kaiyu Yang, Aarti Gupta, Zhendong Su, and Chi Jin. Goedel-Code-Prover: Hierarchical proof search for open state-of-the-art code verification.arXiv:2603.19329, 2026. 11

work page arXiv 2026

[17] [17]

Aesop: White-Box Best-First Proof Search for Lean

Jannis Limperg and Asta Halkjær From. Aesop: White-Box Best-First Proof Search for Lean. In Proceedings of the 12th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2023, pp. 253–266, New York, NY , USA, January 2023. Association for Computing Machinery. ISBN 979-8-4007-0026-2. DOI:10.1145/3573105.3575671

work page doi:10.1145/3573105.3575671 2023

[18] [18]

Fvel: Interactive formal verification environment with large language models via theorem proving.arXiv:2406.14408, 2024

Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, and Xiaodan Liang. Fvel: Interactive formal verification environment with large language models via theorem proving.arXiv:2406.14408, 2024

work page arXiv 2024

[19] [19]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction.arXiv:2508.03...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Numina-lean-agent: An open and general agentic reasoning system for formal mathematics.arXiv preprint arXiv:2601.14027, 2026

Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, Lihong Zhi, Jia Li, and Wenda Li. Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics.arXiv:2601.14027, 2026

work page arXiv 2026

[21] [21]

The Lean 4 Theorem Prover and Programming Language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, pp. 625–635, Berlin, Heidelberg, 2021. Springer-Verlag. ISBN 978-3-030-79875-8. DOI:10.1007/978-3-030-79876-5_37

work page doi:10.1007/978-3-030-79876-5_37 2021

[22] [22]

Paulson, and Markus Wenzel.Isabelle/HOL: A Proof Assistant for Higher- Order Logic, volume 2283 ofLecture Notes in Computer Science

Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel.Isabelle/HOL: A Proof Assistant for Higher- Order Logic, volume 2283 ofLecture Notes in Computer Science. Springer, 2002

work page 2002

[23] [23]

PhD thesis, Chalmers University of Technology and Göteborg University, 2007

Ulf Norell.Towards a Practical Programming Language Based on Dependent Type Theory. PhD thesis, Chalmers University of Technology and Göteborg University, 2007

work page 2007

[24] [24]

Canonical for Automated Theorem Proving in Lean

Chase Norman and Jeremy Avigad. Canonical for Automated Theorem Proving in Lean. arXiv:2504.06239, 2025

work page arXiv 2025

[25] [25]

Lean-lsp-mcp

Project Numina. Lean-lsp-mcp. GitHub repositoryproject-numina/lean-lsp-mcp. Accessed: 2026- 02-03

work page 2026

[26] [26]

Numina-lean-agent

Project Numina. Numina-lean-agent. GitHub repository project-numina/numina-lean-agent. Com- mit4c8e039

work page

[27] [27]

Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.arXiv:2504.21801, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

The rocq prover.Project website, 2026.https://rocq-prover.org

Rocq Development Team. The rocq prover.Project website, 2026.https://rocq-prover.org

work page 2026

[29] [29]

A brief overview of HOL4

Konrad Slind and Michael Norrish. A brief overview of HOL4. InTheorem Proving in Higher Order Logics (TPHOLs 2008), volume 5170 ofLecture Notes in Computer Science, pp. 28–32. Springer, 2008. DOI:10.1007/978-3-540-71067-7_6

work page doi:10.1007/978-3-540-71067-7_6 2008

[30] [30]

An in-context learning agent for formal theorem-proving

Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An in-context learning agent for formal theorem-proving.arXiv:2310.04353, 2024

work page arXiv 2024

[31] [31]

Thakur, J

Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri. CLEVER: A Curated Benchmark for Formally Verified Code Generation.arXiv:2505.13938, 2025

work page arXiv 2025

[32] [32]

Lean FRO, 2024

The Lean Developers.The Lean Language Reference: The grind tactic. Lean FRO, 2024. URL https://lean-lang.org/doc/reference/latest/The--grind--tactic/ . Lean version 4.15.0 or later

work page 2024

[33] [33]

Minif2f in rocq: Automatic translation between proof assistants – a case study.arXiv:2503.04763, 2025

Jules Viennot, Guillaume Baudart, Emilio Jesùs Gallego Arias, and Marc Lelarge. Minif2f in rocq: Automatic translation between proof assistants – a case study.arXiv:2503.04763, 2025

work page arXiv 2025

[34] [34]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Léo Dreyfus-Schmidt, Lewis Tunstall, Luigi Paga...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

A minimalist proof language for neural theorem proving over isabelle/hol.arXiv:2507.18885, 2025

Qiyuan Xu, Renxi Wang, Peixin Wang, Haonan Li, and Conrad Watt. A minimalist proof language for neural theorem proving over isabelle/hol.arXiv:2507.18885, 2025

work page arXiv 2025

[36] [36]

Verina: Benchmarking verifiable code generation

Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, and Dawn Song. VERINA: Bench- marking Verifiable Code Generation.arXiv:2505.23135, 2025. 13 A Preliminary Experiments In prior experiments, the authors evaluated a suite of automated provers on proof generation on the CLEVER[ 31] and VERINA[ 36] datasets, finding that agentic paradigms substan...

work page arXiv 2025

[37] [37]

This is guaranteed to exist for odd-degree polynomials (even-length coefficient list) with non-zero leading coefficient – exactly the docstring’s preconditions

findBound: Finds a bound b (starting from 1, doubling up to 2 100) such that poly(b) and poly(-b) have opposite signs. This is guaranteed to exist for odd-degree polynomials (even-length coefficient list) with non-zero leading coefficient – exactly the docstring’s preconditions

work page

[38] [38]

Each step halves the interval containing the root

bisect: Standard bisection on [-bound, bound] with 1,000,000 iterations. Each step halves the interval containing the root. After n steps, the interval has width 2·bound / 2 n, giving astronomically precise convergence – far more than needed for eps = 10-6. Verification against test cases: - [1, 2] (1 + 2x): bisect finds mid = -0.5 where poly = 0 exactly ...

work page