Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts
Pith reviewed 2026-06-28 09:51 UTC · model grok-4.3
The pith
Proof-Refactor decomposes LLM-generated Lean proofs into four phases to yield more modular and readable artifacts than direct baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Proof-Refactor framework decomposes refactoring into four phases—extracting candidate proof fragments, designing helper declarations, formally proving the extracted and designed components, and repairing the original proof using the verified components—and thereby improves rubric-based refactoring scores over a strong Claude Code baseline on generated Lean proofs from PutnamBench and Putnam2025, with the largest gains in signature quality and human readability.
What carries the argument
The four-phase agentic framework that extracts fragments, designs helpers, proves components, and repairs the original proof.
If this is right
- Signature quality and human readability rise without treating proof length as the primary objective.
- Extracted and verified helper declarations become reusable library components.
- The repaired proofs maintain formal correctness while exhibiting improved structure.
- The same decomposition can be run on proofs generated from other benchmarks that produce Lean output.
Where Pith is reading between the lines
- Integrating the refactoring loop inside the initial generation step could reduce the need for post-hoc cleanup.
- The four-phase structure may transfer to other interactive theorem provers whose libraries prize modularity.
- If the agentic phases are made more robust, the method could lower the human effort required to turn raw formalizations into maintainable developments.
Load-bearing premise
Rubric-based human judgments reliably capture the intended properties of modularity, maintainability, and reusability.
What would settle it
Apply the four-phase pipeline to a new collection of LLM-generated Putnam proofs and measure whether rubric scores remain flat or drop while new compilation or logical errors appear.
Figures
read the original abstract
While Large Language Models (LLMs) have shown strong performance in generating formal proofs, their outputs often remain less readable, modular, maintainable, and reusable than proofs in mature formal mathematics libraries. We argue that this gap stems in part from the compile-first objective implicit in most proof-generation pipelines, which encourages monolithic or ad hoc proof scripts rather than library-quality artifacts. Existing approaches to proof-quality improvement often rely on explicit, computable optimization objectives. In practice, however, the most tractable and experimentally validated objectives are largely length-based, while higher-level qualities such as readability, modularity, maintainability, and reusability are difficult to reduce to reliable automatic metrics. Instead of optimizing proof improvement against a single proxy metric, we take a process-guided approach inspired by human proof-refactoring workflows. We propose an agentic framework $\textbf{Proof-Refactor}$ that decomposes proof refactoring into four phases: extracting candidate proof fragments, designing helper declarations, formally proving the extracted and designed components, and repairing the original proof using the verified components. On generated Lean proofs from PutnamBench and Putnam2025, Proof-Refactor improves rubric-based refactoring scores over a strong Claude Code refactoring baseline, with the largest gains in signature quality and human readability. These results suggest that process-guided refactoring can improve proof structure without treating proof length as the primary objective.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Proof-Refactor, an agentic framework that decomposes refactoring of LLM-generated Lean proofs into four phases (extracting fragments, designing helpers, proving components, repairing the proof) to produce more modular and readable artifacts. It reports that the framework improves rubric-based refactoring scores over a Claude Code baseline on PutnamBench and Putnam2025 proofs, with largest gains in signature quality and human readability, arguing that process guidance outperforms proxy-metric optimization.
Significance. If the rubric reliably tracks modularity, maintainability, and reusability and the four-phase process runs without introducing errors, the work would be significant for shifting formal-proof generation from monolithic scripts toward library-quality artifacts. The process-guided approach addresses a recognized limitation of length-based objectives and could influence agentic workflows in interactive theorem proving.
major comments (3)
- [Abstract] Abstract: the central claim of improved refactoring quality rests on rubric-based human scores, yet the manuscript provides neither the rubric definition nor inter-rater reliability statistics, nor any correlation between rubric scores and objective indicators such as reuse across files or successful compilation in larger contexts.
- [Evaluation] Evaluation description: no quantitative results are given on per-phase success rates or on the frequency with which the four-phase process introduces new proof errors or requires manual fixes, leaving open whether the reported score gains are achieved under fully automated conditions.
- [Results] Results paragraph: the comparison is described as being against a 'strong Claude Code refactoring baseline,' but the manuscript does not detail how the baseline prompt, temperature, or iteration count were matched to the Proof-Refactor agent, raising the possibility that prompt-engineering differences rather than the four-phase decomposition explain the observed gains.
minor comments (2)
- The abstract would be clearer if it briefly listed the four rubric dimensions that produced the largest gains.
- Notation for the agentic phases is introduced without a diagram or pseudocode listing the exact inputs and outputs of each phase.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the evaluation and presentation of our results. We address each major comment below, committing to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of improved refactoring quality rests on rubric-based human scores, yet the manuscript provides neither the rubric definition nor inter-rater reliability statistics, nor any correlation between rubric scores and objective indicators such as reuse across files or successful compilation in larger contexts.
Authors: We agree the abstract should reference the rubric. In revision we will insert a concise definition of the four rubric dimensions (signature quality, readability, modularity, maintainability). The detailed rubric appears in Section 4.1. Inter-rater reliability statistics and quantitative correlations with objective reuse or cross-file compilation metrics are not present in the current study; we will explicitly note these as limitations and discuss them in a new Limitations subsection. revision: partial
-
Referee: [Evaluation] Evaluation description: no quantitative results are given on per-phase success rates or on the frequency with which the four-phase process introduces new proof errors or requires manual fixes, leaving open whether the reported score gains are achieved under fully automated conditions.
Authors: We will expand the Evaluation section to report per-phase success rates (fragment extraction, helper design, component proving, and repair) and to state explicitly that the entire pipeline is fully automated. All final Lean proofs were verified to compile without new errors or manual intervention; these statistics will be added to Table 2 and the accompanying text. revision: yes
-
Referee: [Results] Results paragraph: the comparison is described as being against a 'strong Claude Code refactoring baseline,' but the manuscript does not detail how the baseline prompt, temperature, or iteration count were matched to the Proof-Refactor agent, raising the possibility that prompt-engineering differences rather than the four-phase decomposition explain the observed gains.
Authors: We will revise the Results and Experimental Setup sections to specify the baseline configuration: identical temperature (0.7), the same number of refinement iterations (3), and the precise single-prompt template used for the Claude Code baseline. This ensures the only systematic difference is the four-phase decomposition versus the monolithic prompt. revision: yes
- Inter-rater reliability statistics for the rubric scores
- Quantitative correlations between rubric scores and objective indicators such as reuse across files or successful compilation in larger contexts
Circularity Check
No significant circularity; evaluation uses external baseline and rubric judgments.
full rationale
The paper describes an empirical agentic framework (four-phase decomposition) evaluated via rubric scores on PutnamBench/Putnam2025 against a Claude baseline. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The result is a direct comparison to an independent external system rather than a self-referential derivation, so the claim does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Proof-Refactor agentic framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
ImProver: Agent-Based Automated Proof Optimization
Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, and Sean Welleck. Improver: Agent-based automated proof optimization.arXiv preprint arXiv:2410.04753,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Leonardo Mendonça de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer
Axiom Math blog post. Leonardo Mendonça de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. The lean theorem prover (system description). In Amy P. Felty and Aart Middeldorp, editors,Automated Deduction - CADE-25 - 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings, volume 9195 of...
2015
-
[3]
URLhttps://doi.org/10.1007/978-3-319-21401-6_26
doi: 10.1007/978-3-319-21401-6\_26. URLhttps://doi.org/10.1007/978-3-319-21401-6_26. DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, apr
-
[4]
URLhttps://arxiv.org/abs/2403.13310. Google DeepMind. Gemini 3.1 pro model card. Technical report, Google DeepMind, Febru- ary
- [5]
-
[6]
URLhttps://arxiv.org/abs/2210.12283. Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, Lihong Zhi, Jia Li, and Wenda Li. Numina-lean-agent: An open and general agentic reasoning system for formal mathematics,
-
[7]
URLhttps://arxiv.org/abs/2601.14027. The mathlib Community. The lean mathematical library.CoRR, abs/1910.09336,
-
[8]
URL http://arxiv. org/abs/1910.09336. Norman Megill and David A Wheeler.Metamath: a computer language for mathematical proofs. Lulu. com,
-
[9]
URLhttps://doi.org/10.1007/BF00248324
doi: 10.1007/BF00248324. URLhttps://doi.org/10.1007/BF00248324. George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. PutnamBench: Evaluating neural theorem-provers on the putnam mathematical competition.Advances in Neural Information Processing Systems, 37:11545–11569,
-
[10]
Yutong Xin, Jimmy Xin, Gabriel Poesia, Noah Goodman, Qiaochu Chen, and Isil Dillig
URLhttps://arxiv.org/abs/2509.22819. Yutong Xin, Jimmy Xin, Gabriel Poesia, Noah Goodman, Qiaochu Chen, and Isil Dillig. Automated discovery of tactic libraries for interactive theorem proving,
-
[11]
Jin Peng Zhou, Yuhuai Wu, Qiyang Li, and Roger Grosse
URLhttps://arxiv.org/abs/2503.24036. Jin Peng Zhou, Yuhuai Wu, Qiyang Li, and Roger Grosse. Refactor: Learning to extract theorems from proofs. arXiv preprint arXiv:2402.17032,
-
[12]
We omit the full prompt suites from the arXiv appendix to keep the source concise; individual listings are captioned by role rather than by repository path
A Additional Artifacts This appendix records selected Lean examples and the evaluation rubric. We omit the full prompt suites from the arXiv appendix to keep the source concise; individual listings are captioned by role rather than by repository path. Table 3: Appendix artifact index. ID Artifact group Appendix location A1 Lean showcase artifacts Appendix...
1968
-
[13]
:= by refine (one_div_le_one_div (by positivity) (by positivity)).mpr ?_ have h_n_ge_N : (N :R)≤(n :R) := by exact mod_cast hn linarith linarith have hmem : q n∈Metric.ball (0 :Q)ε:= by rw [Metric.mem_ball] exact h_bound exact hball_sub hmem have h_fin_covers_head (n :N) (hn : n < N) :∃i :ι, q n∈U i := by have hqn_mem_S : q n∈S := by simp [S] have hqn_in_...
1968
-
[14]
import Mathlib open Finset Polynomial Topology Filter Metric /-- In the rationals, any nonempty open set contains a point outside any given compact set
but avoids every K(n) let S : SetQ:= {0}∪Set.range q have hS_compact : IsCompact S := isCompact_zero_union_range q hq_ball have hS_not_covered (n :N) :¬S⊆K n := by intro h_sub exact hq_notin n (h_sub (by simp [S])) 13 rcases hcover S hS_compact with⟨n, hn⟩ exact hS_not_covered n hn Listing 2: Baseline refactoring for Putnam 1968 B6. import Mathlib open Fi...
1968
-
[15]
:= by apply one_div_le_one_div_of_le (by positivity) have : (↑N :R)≤(↑n :R) := by exact_mod_cast hn linarith linarith /-- Prove that no sequence $\{K_n\}_{n=0}^{\infty}$ of compact (closed and bounded) sets of rational numbers has the property that every compact set of rational numbers is contained by at least one $K_n$. -/ theorem putnam_1968_b6 :¬∃K :N→...
1968
-
[16]
15 import Mathlib open Set abbrev putnam_1971_a2_solution : Set (PolynomialR) := {Polynomial.X} -- {Polynomial.X} /-- Determine all polynomials $P(x)$ such that $P(x^2 +
with (hzero | hpos_case) ·rw [hzero]; norm_num ·have hpos : 0 < a n := lt_of_le_of_ne (ha_nonneg n) (Ne.symm hpos_case) nlinarith have ha_inj : Function.Injective a := by intro x y h rcases Nat.lt_trichotomy x y with (hlt | heq | hlt) ·have hle : x.succ≤y := hlt have : a x < a y := Nat.le_induction (by simpa [Nat.succ_eq_add_one] using ha_strictMono x) (f...
1971
-
[17]
with (hzero | hpos_case) ·rw [hzero]; norm_num ·have hpos : 0 < a n := lt_of_le_of_ne (ha_nonneg n) (Ne.symm hpos_case) nlinarith have ha_inj : Function.Injective a := (strictMono_nat_of_lt_succ ha_strictMono).injective -- P fixes every term of the sequence: P(a(n)) = a(n) have hP_a :∀n, P.eval (a n) = a n := by intro n induction’ n with n ih ·exact h0 ·r...
1971
-
[18]
18 A.2 Evaluation Rubric The listing below records the rubric used by the LLM-as-a-judge evaluator
:= seq_strictMono a ha_succ ha_nonneg have ha_inj : Function.Injective a := (strictMono_nat_of_lt_succ ha_strictMono).injective have hP_a :∀n, P.eval (a n) = a n := eval_seq_eq_seq P a h0 heq rfl ha_succ set Q := P - Polynomial.X with hQ_def have hQ_ne : Q̸=0 := by intro hQzero apply h_ne rw [hQ_def] at hQzero exact sub_eq_zero.mp hQzero set d := Q.natDeg...
1971
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.