Agentic Proving for Program Verification
Pith reviewed 2026-05-25 04:00 UTC · model grok-4.3
The pith
Claude Code in an agentic framework produces valid specifications for 98.8 percent of CLEVER problems and achieves 98.1 percent end-to-end verification success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Claude generates arguably valid specifications for 98.8% of problems, with 81.3% accepted by isomorphism-based scoring, certifies implementations for 87.5% against ground-truth specs, and reaches 98.1% success on the full pipeline for self-consistent entries, while providing high-quality feedback and identifying dataset bugs.
What carries the argument
The agentic proving framework that integrates Claude Code with iterative compiler feedback and manual review for Lean 4 specification and implementation generation on the CLEVER benchmark.
If this is right
- High success rates indicate that tight integration of the compiler in the loop is the most effective approach for foundational program verification.
- Existing benchmarks like CLEVER no longer pose sufficient challenge for state-of-the-art agentic provers.
- Manual review confirms the quality of the agent's self-generated feedback on failures.
- Alternatives to isomorphism-based scoring are needed for more rigorous evaluation of generated specifications.
Where Pith is reading between the lines
- Similar agentic setups could be tested on harder or more diverse program verification tasks beyond CLEVER.
- The identification of lingering bugs in the dataset suggests that future benchmarks should incorporate automated bug detection as part of their design.
- These results may imply that program verification can scale to real-world codebases if benchmarks are updated accordingly.
Load-bearing premise
The evaluation treats CLEVER's ground-truth specifications and isomorphism scoring plus manual checks as a reliable measure of actual verification correctness even though the paper notes bugs in the dataset.
What would settle it
A concrete counterexample would be a generated specification that passes both the agent's validation and CLEVER's scoring but is shown by independent proof or model checking to be incorrect for the intended program behavior.
Figures
read the original abstract
Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates an agentic proving framework using Claude Code on the CLEVER Lean 4 benchmark for verifiable code generation. It reports that Claude produces arguably valid specifications for 98.8% of problems (81.3% also accepted by CLEVER's isomorphism scoring on the correct subset), certifies implementations against ground-truth specs for 87.5% of problems, and achieves 98.1% end-to-end success on entries with self-consistent premises. The work additionally claims high-quality self-feedback from the model, identifies dataset bugs via manual review, and argues that existing benchmarks are mismatched to current agentic prover capabilities, calling for alternatives to isomorphism-based scoring.
Significance. If the performance numbers prove robust under stricter verification, the results would indicate that tight compiler-in-the-loop agentic systems can already saturate current program-verification benchmarks, providing concrete evidence of a capability-benchmark gap and motivating the development of harder, bug-resilient evaluation suites. The paper's explicit discussion of dataset limitations and scoring weaknesses is a constructive contribution.
major comments (3)
- [Evaluation] Evaluation section: the 98.8% 'arguably valid' figure rests on manual review whose criteria, reviewer count, and inter-rater reliability are not reported; this subjectivity directly supports the headline performance claim and requires explicit documentation or an automated proxy.
- [Results] Results section: success rates (98.8%, 87.5%, 98.1%) are stated without dataset cardinality, confidence intervals, or per-category failure breakdown, making it impossible to assess whether the numbers are statistically distinguishable from ceiling effects.
- [Discussion] Discussion: the paper acknowledges lingering bugs in CLEVER yet still computes all metrics against the benchmark's ground-truth specifications and isomorphism scorer; no sensitivity analysis quantifies how many reported successes could be artifacts of those bugs.
minor comments (2)
- [Abstract] Abstract: the phrase 'correct portion of the benchmark' is used without a forward reference to the filtering procedure described later in the text.
- [Results] The manuscript would benefit from an explicit statement of total problem count and the size of the 'self-consistent premises' subset used for the 98.1% figure.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the significance of the capability-benchmark gap highlighted in our work. We address each major comment point by point below, indicating the revisions we will incorporate.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the 98.8% 'arguably valid' figure rests on manual review whose criteria, reviewer count, and inter-rater reliability are not reported; this subjectivity directly supports the headline performance claim and requires explicit documentation or an automated proxy.
Authors: We agree that the manuscript lacks explicit documentation of the manual review process. In the revised version we will add a dedicated paragraph in the Evaluation section specifying the review criteria (Lean 4 syntactic validity, semantic plausibility relative to the problem statement, and absence of obvious logical contradictions), the number of reviewers (two authors performed independent reviews), and the disagreement-resolution procedure (discussion until consensus). An automated proxy remains difficult because the 'arguably valid' category is defined precisely by cases that the isomorphism scorer cannot capture. revision: yes
-
Referee: [Results] Results section: success rates (98.8%, 87.5%, 98.1%) are stated without dataset cardinality, confidence intervals, or per-category failure breakdown, making it impossible to assess whether the numbers are statistically distinguishable from ceiling effects.
Authors: The CLEVER benchmark comprises 160 problems. We will revise the Results section to report exact numerators and denominators for each percentage, append binomial (Wilson) confidence intervals, and insert a table providing per-category failure counts and qualitative descriptions. These additions will make proximity to ceiling performance directly evaluable. revision: yes
-
Referee: [Discussion] Discussion: the paper acknowledges lingering bugs in CLEVER yet still computes all metrics against the benchmark's ground-truth specifications and isomorphism scorer; no sensitivity analysis quantifies how many reported successes could be artifacts of those bugs.
Authors: A full quantitative sensitivity analysis would require exhaustive re-auditing and correction of every ground-truth specification in the benchmark, which is outside the scope of the present study. In the revision we will expand the Discussion to provide a qualitative accounting of how the bugs we manually identified affected individual cases, restate that all reported metrics follow the benchmark's official protocol, and explicitly flag the absence of a complete sensitivity analysis as a limitation and a priority for future benchmark work. revision: partial
Circularity Check
No circularity: purely empirical benchmark evaluation
full rationale
The paper reports success rates (98.8%, 87.5%, 98.1%) from running Claude Code on the external CLEVER Lean 4 benchmark. No derivations, equations, fitted parameters, or first-principles claims appear in the provided text. All metrics are computed directly against the benchmark's provided ground-truth specifications and isomorphism-based scoring, which are external inputs. The paper notes dataset bugs and calls for better evaluation methods, but this does not create any self-referential reduction or load-bearing self-citation chain. The evaluation is self-contained against the external dataset with no internal quantities redefined as predictions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CLEVER benchmark and its isomorphism-based scoring provide a valid measure of program verification success
Reference graph
Works this paper leans on
-
[1]
Aristotle: IMO-level Automated Theorem Proving
Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern- Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, Riley Patterson, Eric Rodriguez, Laura Scharff, Vikram Shanker, Vladmir Sicca, Hari Sowrirajan, Aidan Swope, Matyas Tamas, Vlad Tenev, Jonathan Thomm, Harold Williams, and L...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Claude-agent-sdk-python
Anthropic. Claude-agent-sdk-python. GitHub repository anthropics/claude-agent-sdk-python. Version 0.1.52, commit566e41f
-
[3]
Mizar: State-of-the-art and beyond
Grzegorz Bancerek, Czesław Byli´nski, Adam Grabowski, Artur Korniłowicz, Roman Matuszewski, Adam Naumowicz, Karol P ˛ ak, and Josef Urban. Mizar: State-of-the-art and beyond. In Manfred Kerber, Jacques Carette, Cezary Kaliszyk, Florian Rabe, and V olker Sorge (eds.),Intelligent Computer Mathe- matics, pp. 261–279, Cham, 2015. Springer International Publis...
-
[4]
Barı¸ s Bayazıt, Yao Li, and Xujie Si. A case study on the effectiveness of llms in verification with proof assistants.arXiv:2508.18587, 2025
-
[5]
Mario Carneiro, Chad E. Brown, and Josef Urban. Automated theorem proving for metamath. In Adam Naumowicz and René Thiemann (eds.),14th International Conference on Interactive Theo- rem Proving (ITP 2023), volume 268 ofLeibniz International Proceedings in Informatics (LIPIcs), pp. 9:1–9:19, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für I...
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Wenhu Chen, Shuo Wei, Hao Wang, et al. Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv:2507.23726, 2025
-
[8]
LeanHammer
Joshua Clune. LeanHammer. GitHub repository JOSHCLUNE/LeanHammer. Version 4.26.0, commit 87e1bba
-
[9]
Lean-lsp-mcp
Oliver Dressler. Lean-lsp-mcp. GitHub repository oOo0oOo/lean-lsp-mcp. Version 0.25.1, commit 246e2ad
-
[10]
Certified Program Synthesis with a Multi-Modal Verifier
Yueyang Feng, Dipesh Kafle, Vladimir Gladshtein, Vitaly Kurin, George Pîrlea, Qiyuan Zhao, Peter Müller, and Ilya Sergey. Certified program synthesis with a multi-modal verifier.arXiv:2604.16584, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Lean4-skills
Cameron Freer. Lean4-skills. GitHub repository cameronfreer/lean4-skills. Version 4.4.5, commit a1cef7d. (preliminary experiments of Appendix A on the earlier commit5d6bd03)
-
[12]
Hol light: A tutorial introduction
John Harrison. Hol light: A tutorial introduction. InFormal Methods in Computer-Aided Design (FMCAD), 1996
1996
- [13]
-
[14]
URL https://aclanthology.org/2025
Thomas Hubert, Harsh Mehta, Laurent Sartran, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, 2025. DOI:10.1038/s41586-025-09833-y
- [15]
-
[16]
Goedel-code-prover: Hierarchical proof search for open state-of-the-art code verification
Zenan Li, Ziran Yang, Deyuan He, Haoyu Zhao, Andrew Zhao, Shange Tang, Kaiyu Yang, Aarti Gupta, Zhendong Su, and Chi Jin. Goedel-Code-Prover: Hierarchical proof search for open state-of-the-art code verification.arXiv:2603.19329, 2026. 11
-
[17]
Aesop: White-Box Best-First Proof Search for Lean
Jannis Limperg and Asta Halkjær From. Aesop: White-Box Best-First Proof Search for Lean. In Proceedings of the 12th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2023, pp. 253–266, New York, NY , USA, January 2023. Association for Computing Machinery. ISBN 979-8-4007-0026-2. DOI:10.1145/3573105.3575671
-
[18]
Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, and Xiaodan Liang. Fvel: Interactive formal verification environment with large language models via theorem proving.arXiv:2406.14408, 2024
-
[19]
Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction.arXiv:2508.03...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, Lihong Zhi, Jia Li, and Wenda Li. Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics.arXiv:2601.14027, 2026
-
[21]
The lean 4 theorem prover and programming language
Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, pp. 625–635, Berlin, Heidelberg, 2021. Springer-Verlag. ISBN 978-3-030-79875-8. DOI:10.1007/978-3-030-79876-5_37
-
[22]
Paulson, and Markus Wenzel.Isabelle/HOL: A Proof Assistant for Higher- Order Logic, volume 2283 ofLecture Notes in Computer Science
Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel.Isabelle/HOL: A Proof Assistant for Higher- Order Logic, volume 2283 ofLecture Notes in Computer Science. Springer, 2002
2002
-
[23]
PhD thesis, Chalmers University of Technology and Göteborg University, 2007
Ulf Norell.Towards a Practical Programming Language Based on Dependent Type Theory. PhD thesis, Chalmers University of Technology and Göteborg University, 2007
2007
-
[24]
Canonical for Automated Theorem Proving in Lean
Chase Norman and Jeremy Avigad. Canonical for Automated Theorem Proving in Lean. arXiv:2504.06239, 2025
-
[25]
Lean-lsp-mcp
Project Numina. Lean-lsp-mcp. GitHub repositoryproject-numina/lean-lsp-mcp. Accessed: 2026- 02-03
2026
-
[26]
Numina-lean-agent
Project Numina. Numina-lean-agent. GitHub repository project-numina/numina-lean-agent. Com- mit4c8e039
-
[27]
Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.arXiv:2504.21801, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
The rocq prover.Project website, 2026.https://rocq-prover.org
Rocq Development Team. The rocq prover.Project website, 2026.https://rocq-prover.org
2026
-
[29]
Konrad Slind and Michael Norrish. A brief overview of HOL4. InTheorem Proving in Higher Order Logics (TPHOLs 2008), volume 5170 ofLecture Notes in Computer Science, pp. 28–32. Springer, 2008. DOI:10.1007/978-3-540-71067-7_6
-
[30]
An in-context learning agent for formal theorem-proving
Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An in-context learning agent for formal theorem-proving.arXiv:2310.04353, 2024
-
[31]
CLEVER: A Curated Benchmark for Formally Verified Code Generation.arXiv:2505.13938, 2025
Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri. CLEVER: A Curated Benchmark for Formally Verified Code Generation.arXiv:2505.13938, 2025
-
[32]
Lean FRO, 2024
The Lean Developers.The Lean Language Reference: The grind tactic. Lean FRO, 2024. URL https://lean-lang.org/doc/reference/latest/The--grind--tactic/ . Lean version 4.15.0 or later
2024
-
[33]
Jules Viennot, Guillaume Baudart, Emilio Jesùs Gallego Arias, and Marc Lelarge. Minif2f in rocq: Automatic translation between proof assistants – a case study.arXiv:2503.04763, 2025
-
[34]
Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning
Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Léo Dreyfus-Schmidt, Lewis Tunstall, Luigi Paga...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
A minimalist proof language for neural theorem proving over isabelle/hol.arXiv:2507.18885, 2025
Qiyuan Xu, Renxi Wang, Peixin Wang, Haonan Li, and Conrad Watt. A minimalist proof language for neural theorem proving over isabelle/hol.arXiv:2507.18885, 2025
-
[36]
VERINA: Bench- marking Verifiable Code Generation.arXiv:2505.23135, 2025
Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, and Dawn Song. VERINA: Bench- marking Verifiable Code Generation.arXiv:2505.23135, 2025. 13 A Preliminary Experiments In prior experiments, the authors evaluated a suite of automated provers on proof generation on the CLEVER[ 31] and VERINA[ 36] datasets, finding that agentic paradigms substan...
-
[37]
This is guaranteed to exist for odd-degree polynomials (even-length coefficient list) with non-zero leading coefficient – exactly the docstring’s preconditions
findBound: Finds a bound b (starting from 1, doubling up to 2 100) such that poly(b) and poly(-b) have opposite signs. This is guaranteed to exist for odd-degree polynomials (even-length coefficient list) with non-zero leading coefficient – exactly the docstring’s preconditions
-
[38]
Each step halves the interval containing the root
bisect: Standard bisection on [-bound, bound] with 1,000,000 iterations. Each step halves the interval containing the root. After n steps, the interval has width 2·bound / 2 n, giving astronomically precise convergence – far more than needed for eps = 10-6. Verification against test cases: - [1, 2] (1 + 2x): bisect finds mid = -0.5 where poly = 0 exactly ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.