pith. sign in

arxiv: 2605.23772 · v1 · pith:7LYMBPRDnew · submitted 2026-05-22 · 💻 cs.AI · cs.LO· cs.PL· cs.SE

Agentic Proving for Program Verification

Pith reviewed 2026-05-25 04:00 UTC · model grok-4.3

classification 💻 cs.AI cs.LOcs.PLcs.SE
keywords agentic provingprogram verificationLean 4CLEVER benchmarkformal methodsautomated theorem provingClaude model
0
0 comments X

The pith

Claude Code in an agentic framework produces valid specifications for 98.8 percent of CLEVER problems and achieves 98.1 percent end-to-end verification success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates an agentic proving system based on Claude on the CLEVER Lean 4 benchmark for verifiable code generation. It reports high rates of generating valid specifications and certifying implementations, along with effective self-feedback. This leads the authors to conclude that current benchmarks are mismatched to modern prover capabilities and that compiler-in-the-loop agentic methods work best for program verification. A sympathetic reader would see this as evidence that automated verification is becoming practical on standard test sets.

Core claim

Claude generates arguably valid specifications for 98.8% of problems, with 81.3% accepted by isomorphism-based scoring, certifies implementations for 87.5% against ground-truth specs, and reaches 98.1% success on the full pipeline for self-consistent entries, while providing high-quality feedback and identifying dataset bugs.

What carries the argument

The agentic proving framework that integrates Claude Code with iterative compiler feedback and manual review for Lean 4 specification and implementation generation on the CLEVER benchmark.

If this is right

  • High success rates indicate that tight integration of the compiler in the loop is the most effective approach for foundational program verification.
  • Existing benchmarks like CLEVER no longer pose sufficient challenge for state-of-the-art agentic provers.
  • Manual review confirms the quality of the agent's self-generated feedback on failures.
  • Alternatives to isomorphism-based scoring are needed for more rigorous evaluation of generated specifications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar agentic setups could be tested on harder or more diverse program verification tasks beyond CLEVER.
  • The identification of lingering bugs in the dataset suggests that future benchmarks should incorporate automated bug detection as part of their design.
  • These results may imply that program verification can scale to real-world codebases if benchmarks are updated accordingly.

Load-bearing premise

The evaluation treats CLEVER's ground-truth specifications and isomorphism scoring plus manual checks as a reliable measure of actual verification correctness even though the paper notes bugs in the dataset.

What would settle it

A concrete counterexample would be a generated specification that passes both the agent's validation and CLEVER's scoring but is shown by independent proof or model checking to be incorrect for the intended program behavior.

Figures

Figures reproduced from arXiv: 2605.23772 by Akhil Arora, Alessandro Sosso, Bas Spitters.

Figure 1
Figure 1. Figure 1: Schematic overview of our experimental pipeline. The four generation and proof tasks [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of output flags for SPEC_ISO runs. The following color scheme will be used throughout the paper: green for OK entries, red for FAIL , yellow for MISMATCH , blue for ISSUE in benchmark components, orange for those in generated components, purple for issues in both. 3.2 Implementation certification From ground-truth specifications. Following the original CLEVER protocol, the first run of this st… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of output flags for PROOF_GEN runs. Note the change of coloring for specification ISSUE between rows, as their source changes from the ground-truth to automatic generation. 4 Discussion 4.1 Specification certification Autoformalization is hard to evaluate. Beyond the technical hurdle of producing a Lean specification that compiles, two more substantive issues dominate: choosing the right stren… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of prover performance on the CLEVER and VERINA benchmarks. Solid bars indicate the number of tests solved on the verified-correct portions of each dataset, while the patterned bar segments above show additional solved instances obtained after correcting erroneous benchmark entries. Numbers annotated with a ‘+’ denote the corresponding count of these additional solves. 15 [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 5
Figure 5. Figure 5: Output flag for each problem ID across the three [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Breakdown of results for the Multi-spec SPEC_ISO run. Output flag for tested problems in the Multi-spec SPEC_ISO run: overall result and detailed per-spec flagging. Color scheme corresponds to the one in [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Scatter plot of attempt costs (USD) vs. runtimes (s), across all tasks and runs per task. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Fix of CLEVER problem 32 generated by Claude Code. At specification certification, no run produced a generated_spec that proved isomorphic to the ground-truth one. The reason was systematic: every generated specification required the implementa￾tion to return an exact root of the polynomial, whereas the ground-truth specification only required an approximate root within a fixed tolerance. The SPEC_GEN agen… view at source ↗
Figure 9
Figure 9. Figure 9: Extract of Claude’s replies explaining the edits it made in fixing problem 32. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper evaluates an agentic proving framework using Claude Code on the CLEVER Lean 4 benchmark for verifiable code generation. It reports that Claude produces arguably valid specifications for 98.8% of problems (81.3% also accepted by CLEVER's isomorphism scoring on the correct subset), certifies implementations against ground-truth specs for 87.5% of problems, and achieves 98.1% end-to-end success on entries with self-consistent premises. The work additionally claims high-quality self-feedback from the model, identifies dataset bugs via manual review, and argues that existing benchmarks are mismatched to current agentic prover capabilities, calling for alternatives to isomorphism-based scoring.

Significance. If the performance numbers prove robust under stricter verification, the results would indicate that tight compiler-in-the-loop agentic systems can already saturate current program-verification benchmarks, providing concrete evidence of a capability-benchmark gap and motivating the development of harder, bug-resilient evaluation suites. The paper's explicit discussion of dataset limitations and scoring weaknesses is a constructive contribution.

major comments (3)
  1. [Evaluation] Evaluation section: the 98.8% 'arguably valid' figure rests on manual review whose criteria, reviewer count, and inter-rater reliability are not reported; this subjectivity directly supports the headline performance claim and requires explicit documentation or an automated proxy.
  2. [Results] Results section: success rates (98.8%, 87.5%, 98.1%) are stated without dataset cardinality, confidence intervals, or per-category failure breakdown, making it impossible to assess whether the numbers are statistically distinguishable from ceiling effects.
  3. [Discussion] Discussion: the paper acknowledges lingering bugs in CLEVER yet still computes all metrics against the benchmark's ground-truth specifications and isomorphism scorer; no sensitivity analysis quantifies how many reported successes could be artifacts of those bugs.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'correct portion of the benchmark' is used without a forward reference to the filtering procedure described later in the text.
  2. [Results] The manuscript would benefit from an explicit statement of total problem count and the size of the 'self-consistent premises' subset used for the 98.1% figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the significance of the capability-benchmark gap highlighted in our work. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the 98.8% 'arguably valid' figure rests on manual review whose criteria, reviewer count, and inter-rater reliability are not reported; this subjectivity directly supports the headline performance claim and requires explicit documentation or an automated proxy.

    Authors: We agree that the manuscript lacks explicit documentation of the manual review process. In the revised version we will add a dedicated paragraph in the Evaluation section specifying the review criteria (Lean 4 syntactic validity, semantic plausibility relative to the problem statement, and absence of obvious logical contradictions), the number of reviewers (two authors performed independent reviews), and the disagreement-resolution procedure (discussion until consensus). An automated proxy remains difficult because the 'arguably valid' category is defined precisely by cases that the isomorphism scorer cannot capture. revision: yes

  2. Referee: [Results] Results section: success rates (98.8%, 87.5%, 98.1%) are stated without dataset cardinality, confidence intervals, or per-category failure breakdown, making it impossible to assess whether the numbers are statistically distinguishable from ceiling effects.

    Authors: The CLEVER benchmark comprises 160 problems. We will revise the Results section to report exact numerators and denominators for each percentage, append binomial (Wilson) confidence intervals, and insert a table providing per-category failure counts and qualitative descriptions. These additions will make proximity to ceiling performance directly evaluable. revision: yes

  3. Referee: [Discussion] Discussion: the paper acknowledges lingering bugs in CLEVER yet still computes all metrics against the benchmark's ground-truth specifications and isomorphism scorer; no sensitivity analysis quantifies how many reported successes could be artifacts of those bugs.

    Authors: A full quantitative sensitivity analysis would require exhaustive re-auditing and correction of every ground-truth specification in the benchmark, which is outside the scope of the present study. In the revision we will expand the Discussion to provide a qualitative accounting of how the bugs we manually identified affected individual cases, restate that all reported metrics follow the benchmark's official protocol, and explicitly flag the absence of a complete sensitivity analysis as a limitation and a priority for future benchmark work. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper reports success rates (98.8%, 87.5%, 98.1%) from running Claude Code on the external CLEVER Lean 4 benchmark. No derivations, equations, fitted parameters, or first-principles claims appear in the provided text. All metrics are computed directly against the benchmark's provided ground-truth specifications and isomorphism-based scoring, which are external inputs. The paper notes dataset bugs and calls for better evaluation methods, but this does not create any self-referential reduction or load-bearing self-citation chain. The evaluation is self-contained against the external dataset with no internal quantities redefined as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical evaluation paper whose central claims rest on the representativeness of the CLEVER benchmark and the reliability of its scoring plus manual review.

axioms (1)
  • domain assumption CLEVER benchmark and its isomorphism-based scoring provide a valid measure of program verification success
    The reported success rates depend on this benchmark being an appropriate testbed; the paper itself flags bugs in the dataset.

pith-pipeline@v0.9.0 · 5758 in / 1241 out tokens · 25160 ms · 2026-05-25T04:00:16.161151+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 25 canonical work pages · 6 internal anchors

  1. [1]

    Aristotle: IMO-level Automated Theorem Proving

    Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern- Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, Riley Patterson, Eric Rodriguez, Laura Scharff, Vikram Shanker, Vladmir Sicca, Hari Sowrirajan, Aidan Swope, Matyas Tamas, Vlad Tenev, Jonathan Thomm, Harold Williams, and L...

  2. [2]

    Claude-agent-sdk-python

    Anthropic. Claude-agent-sdk-python. GitHub repository anthropics/claude-agent-sdk-python. Version 0.1.52, commit566e41f

  3. [3]

    Mizar: State-of-the-art and beyond

    Grzegorz Bancerek, Czesław Byli´nski, Adam Grabowski, Artur Korniłowicz, Roman Matuszewski, Adam Naumowicz, Karol P ˛ ak, and Josef Urban. Mizar: State-of-the-art and beyond. In Manfred Kerber, Jacques Carette, Cezary Kaliszyk, Florian Rabe, and V olker Sorge (eds.),Intelligent Computer Mathe- matics, pp. 261–279, Cham, 2015. Springer International Publis...

  4. [4]

    A case study on the effectiveness of llms in verification with proof assistants.arXiv:2508.18587, 2025

    Barı¸ s Bayazıt, Yao Li, and Xujie Si. A case study on the effectiveness of llms in verification with proof assistants.arXiv:2508.18587, 2025

  5. [5]

    Brown, and Josef Urban

    Mario Carneiro, Chad E. Brown, and Josef Urban. Automated theorem proving for metamath. In Adam Naumowicz and René Thiemann (eds.),14th International Conference on Interactive Theo- rem Proving (ITP 2023), volume 268 ofLeibniz International Proceedings in Informatics (LIPIcs), pp. 9:1–9:19, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für I...

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  7. [7]

    Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv preprint arXiv:2507.23726, 2025

    Wenhu Chen, Shuo Wei, Hao Wang, et al. Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv:2507.23726, 2025

  8. [8]

    LeanHammer

    Joshua Clune. LeanHammer. GitHub repository JOSHCLUNE/LeanHammer. Version 4.26.0, commit 87e1bba

  9. [9]

    Lean-lsp-mcp

    Oliver Dressler. Lean-lsp-mcp. GitHub repository oOo0oOo/lean-lsp-mcp. Version 0.25.1, commit 246e2ad

  10. [10]

    Certified Program Synthesis with a Multi-Modal Verifier

    Yueyang Feng, Dipesh Kafle, Vladimir Gladshtein, Vitaly Kurin, George Pîrlea, Qiyuan Zhao, Peter Müller, and Ilya Sergey. Certified program synthesis with a multi-modal verifier.arXiv:2604.16584, 2026

  11. [11]

    Lean4-skills

    Cameron Freer. Lean4-skills. GitHub repository cameronfreer/lean4-skills. Version 4.4.5, commit a1cef7d. (preliminary experiments of Appendix A on the earlier commit5d6bd03)

  12. [12]

    Hol light: A tutorial introduction

    John Harrison. Hol light: A tutorial introduction. InFormal Methods in Computer-Aided Design (FMCAD), 1996

  13. [13]

    Yichen Huang and Lin F. Yang. Winning Gold at IMO 2025 with a Model-Agnostic Verification-and- Refinement Pipeline.arXiv:2507.15855, 2025

  14. [14]

    URL https://aclanthology.org/2025

    Thomas Hubert, Harsh Mehta, Laurent Sartran, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, 2025. DOI:10.1038/s41586-025-09833-y

  15. [15]

    Shuvendu K. Lahiri. Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents.arXiv:2603.17150, 2026

  16. [16]

    Goedel-code-prover: Hierarchical proof search for open state-of-the-art code verification

    Zenan Li, Ziran Yang, Deyuan He, Haoyu Zhao, Andrew Zhao, Shange Tang, Kaiyu Yang, Aarti Gupta, Zhendong Su, and Chi Jin. Goedel-Code-Prover: Hierarchical proof search for open state-of-the-art code verification.arXiv:2603.19329, 2026. 11

  17. [17]

    Aesop: White-Box Best-First Proof Search for Lean

    Jannis Limperg and Asta Halkjær From. Aesop: White-Box Best-First Proof Search for Lean. In Proceedings of the 12th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2023, pp. 253–266, New York, NY , USA, January 2023. Association for Computing Machinery. ISBN 979-8-4007-0026-2. DOI:10.1145/3573105.3575671

  18. [18]

    Fvel: Interactive formal verification environment with large language models via theorem proving.arXiv:2406.14408, 2024

    Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, and Xiaodan Liang. Fvel: Interactive formal verification environment with large language models via theorem proving.arXiv:2406.14408, 2024

  19. [19]

    Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

    Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction.arXiv:2508.03...

  20. [20]

    Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics.arXiv:2601.14027, 2026

    Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, Lihong Zhi, Jia Li, and Wenda Li. Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics.arXiv:2601.14027, 2026

  21. [21]

    The lean 4 theorem prover and programming language

    Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, pp. 625–635, Berlin, Heidelberg, 2021. Springer-Verlag. ISBN 978-3-030-79875-8. DOI:10.1007/978-3-030-79876-5_37

  22. [22]

    Paulson, and Markus Wenzel.Isabelle/HOL: A Proof Assistant for Higher- Order Logic, volume 2283 ofLecture Notes in Computer Science

    Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel.Isabelle/HOL: A Proof Assistant for Higher- Order Logic, volume 2283 ofLecture Notes in Computer Science. Springer, 2002

  23. [23]

    PhD thesis, Chalmers University of Technology and Göteborg University, 2007

    Ulf Norell.Towards a Practical Programming Language Based on Dependent Type Theory. PhD thesis, Chalmers University of Technology and Göteborg University, 2007

  24. [24]

    Canonical for Automated Theorem Proving in Lean

    Chase Norman and Jeremy Avigad. Canonical for Automated Theorem Proving in Lean. arXiv:2504.06239, 2025

  25. [25]

    Lean-lsp-mcp

    Project Numina. Lean-lsp-mcp. GitHub repositoryproject-numina/lean-lsp-mcp. Accessed: 2026- 02-03

  26. [26]

    Numina-lean-agent

    Project Numina. Numina-lean-agent. GitHub repository project-numina/numina-lean-agent. Com- mit4c8e039

  27. [27]

    Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.arXiv:2504.21801, 2025

  28. [28]

    The rocq prover.Project website, 2026.https://rocq-prover.org

    Rocq Development Team. The rocq prover.Project website, 2026.https://rocq-prover.org

  29. [29]

    A brief overview of HOL4

    Konrad Slind and Michael Norrish. A brief overview of HOL4. InTheorem Proving in Higher Order Logics (TPHOLs 2008), volume 5170 ofLecture Notes in Computer Science, pp. 28–32. Springer, 2008. DOI:10.1007/978-3-540-71067-7_6

  30. [30]

    An in-context learning agent for formal theorem-proving

    Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An in-context learning agent for formal theorem-proving.arXiv:2310.04353, 2024

  31. [31]

    CLEVER: A Curated Benchmark for Formally Verified Code Generation.arXiv:2505.13938, 2025

    Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri. CLEVER: A Curated Benchmark for Formally Verified Code Generation.arXiv:2505.13938, 2025

  32. [32]

    Lean FRO, 2024

    The Lean Developers.The Lean Language Reference: The grind tactic. Lean FRO, 2024. URL https://lean-lang.org/doc/reference/latest/The--grind--tactic/ . Lean version 4.15.0 or later

  33. [33]

    Minif2f in rocq: Automatic translation between proof assistants – a case study.arXiv:2503.04763, 2025

    Jules Viennot, Guillaume Baudart, Emilio Jesùs Gallego Arias, and Marc Lelarge. Minif2f in rocq: Automatic translation between proof assistants – a case study.arXiv:2503.04763, 2025

  34. [34]

    Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

    Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Léo Dreyfus-Schmidt, Lewis Tunstall, Luigi Paga...

  35. [35]

    A minimalist proof language for neural theorem proving over isabelle/hol.arXiv:2507.18885, 2025

    Qiyuan Xu, Renxi Wang, Peixin Wang, Haonan Li, and Conrad Watt. A minimalist proof language for neural theorem proving over isabelle/hol.arXiv:2507.18885, 2025

  36. [36]

    VERINA: Bench- marking Verifiable Code Generation.arXiv:2505.23135, 2025

    Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, and Dawn Song. VERINA: Bench- marking Verifiable Code Generation.arXiv:2505.23135, 2025. 13 A Preliminary Experiments In prior experiments, the authors evaluated a suite of automated provers on proof generation on the CLEVER[ 31] and VERINA[ 36] datasets, finding that agentic paradigms substan...

  37. [37]

    This is guaranteed to exist for odd-degree polynomials (even-length coefficient list) with non-zero leading coefficient – exactly the docstring’s preconditions

    findBound: Finds a bound b (starting from 1, doubling up to 2 100) such that poly(b) and poly(-b) have opposite signs. This is guaranteed to exist for odd-degree polynomials (even-length coefficient list) with non-zero leading coefficient – exactly the docstring’s preconditions

  38. [38]

    Each step halves the interval containing the root

    bisect: Standard bisection on [-bound, bound] with 1,000,000 iterations. Each step halves the interval containing the root. After n steps, the interval has width 2·bound / 2 n, giving astronomically precise convergence – far more than needed for eps = 10-6. Verification against test cases: - [1, 2] (1 + 2x): bisect finds mid = -0.5 where poly = 0 exactly ...