pith. sign in

arxiv: 2605.20244 · v1 · pith:63MPGZ4Wnew · submitted 2026-05-18 · 💻 cs.LO · cs.AI· cs.CL· cs.LG· cs.SE

Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

Pith reviewed 2026-05-21 08:47 UTC · model grok-4.3

classification 💻 cs.LO cs.AIcs.CLcs.LGcs.SE
keywords Leanproof refactoringmulti-objective optimizationagentic retrievalformal verificationversion compatibilityLLM-guided rewritingcompilation reduction
0
0 comments X

The pith

Retrieval from an annotated strategy database lets a frozen LLM refactor Lean proofs to cut length, compilation time, and version breakage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lean Refactor as a retrieval-augmented agentic system that takes verbose, version-brittle LLM-generated Lean proofs and rewrites them for multiple goals at once. It tackles the practical mismatch between fast-moving LLMs and the fragile compatibility requirements of Lean and Mathlib libraries. By pulling pre-labeled refactoring moves that record expected size savings, compile-cost drops, and supported versions, the system guides the LLM without any model updates. If the approach holds, proof libraries could stay compact and transferable even as both the language and the models continue to change.

Core claim

Lean Refactor steers a frozen agentic LLM by retrieving from a curated database of multi-objective refactoring strategies, each annotated with supported Lean and Mathlib versions plus expected compilation-cost reductions. On competition benchmarks the method yields over 70 percent token-level compression; on research repositories it yields over 20 percent. Compilation time falls by up to 60 percent, outperforming earlier refactoring pipelines and direct Claude Code use. Version-filtered retrieval further improves results for a chosen Lean release, and the refactored miniF2F proofs show stronger zero-shot transfer to later Lean versions than the originals.

What carries the argument

Version-filtered retrieval from a densely annotated database of multi-objective refactoring strategies that records length, compile cost, and compatibility trade-offs for each tactic.

If this is right

  • Proofs become substantially shorter on standard competition suites while remaining correct.
  • Compilation time for the refactored proofs drops measurably compared with the source versions.
  • Version-specific retrieval produces better compression when the target Lean release is known in advance.
  • Refactored proofs retain correctness and improve transfer when moved to newer Lean library versions.
  • The same retrieval mechanism outperforms both prior refactoring tools and direct use of the base LLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Large formal libraries could be kept compact by periodic batch refactoring instead of manual cleanup after each library update.
  • The multi-objective database could be extended to encode readability or pedagogical goals in addition to size and speed.
  • If the strategy annotations are kept current, the same framework might transfer to other interactive theorem provers that expose similar tactic languages.
  • Retrieval augmentation of this form could reduce the need for repeated fine-tuning when new LLM releases appear.

Load-bearing premise

A sufficiently complete and correctly labeled database of refactoring strategies must exist so that the agent can locate and apply the right moves to deliver the measured gains.

What would settle it

Running the same compression and timing experiments on a fresh collection of proofs with an empty or randomly ordered strategy database should eliminate the reported token savings and compile-time reductions.

Figures

Figures reproduced from arXiv: 2605.20244 by Jialin Lu, Kaiyu Yang, Rodrigo Stehling, Soonho Kong, Weiran Sun, Wuyang Chen, Zhangyang Wang.

Figure 1
Figure 1. Figure 1: Our Lean Refactor can be controlled to refac￾tor Lean proofs for multiple objectives: proof length, compilation time cost, and compatibility across Lean versions. Second, frontier LLMs are misaligned with the Lean/Mathlib release cycle. Lean and Mathlib evolve on a near-weekly basis, with lemmas re￾named, APIs restructured, and new automation landed, while LLMs are pinned to a knowledge cutoff. Even with w… view at source ↗
Figure 2
Figure 2. Figure 2: Weak correlation between proof length and compilation time. Token length fails to predict compilation cost because of short, heavy tactics, a factor often overlooked in previous Lean refactoring research. Beyond the redundancy in code length demonstrated in previous papers [3, 17], in this section, we motivate our work with more practical challenges of Lean code refac￾toring with LLMs. Proof Optimization i… view at source ↗
Figure 3
Figure 3. Figure 3: Knowledge cutoff lag in LLMs: Most LLMs misalign with Lean/Mathlib versions due to static knowl￾edge cutoff. Lean Repository Compatibility is Fragile. Version sensitivity is a long-standing bottle￾neck of Lean/Mathlib: Lean and Mathlib evolve rapidly; weekly releases frequently rename dec￾larations and break imports even across patch versions [6], forcing downstream projects to pin to Mathlib’s exact toolc… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of Lean Refactor framework. 1. We first summarize raw long-short proof pairs sourced from diverse Lean code sources into well-structured reusable strategies, and more importantly, we annotate each strategy with code refactoring metadata, including compilation time cost and version compatibility (Section 3.1). 2. We then train a strategy retrieval model that maps Lean code (that can be potentially … view at source ↗
Figure 5
Figure 5. Figure 5: Multi-objective Lean proof refactoring. A single strategy bank adapts to diverse user objectives at inference without retraining. Top: for shortest proofs, top-K strategies by cosine similarity are used directly. Middle: to balance faster compilation, candidates are reranked by annotated compile-cost reduction. Bottom: for a target toolchain (e.g., v4.16.0), incompatible strategies are filtered out. The ag… view at source ↗
Figure 6
Figure 6. Figure 6: Research-level problems. To assess refactoring on advanced mathematics, we draw theorems from four research-level Lean 4 repositories: Analysis [36], Fermat’s Last Theorem (FLT) [8], the Polynomial Freiman–Ruzsa Conjecture (PFR) [37], and PhysLean [39]. From each repository we sample 45 theorems, subject to compute budget. To evaluate refactoring effectiveness across a wide range of initial proof complexit… view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of original proof lengths across three competition style evaluation sets (miniF2F, PutnamBench, Putnam 2025). • 15 proofs exceeding 1000 tokens, • 15 proofs between 500 and 1000 tokens, • 10 proofs between 100 and 500 tokens, • 5 proofs under 100 tokens. When a repository contains insufficient theorems in a given length bucket, the unfilled quota is reallocated to the nearest non-empty bucket.… view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of original proof lengths across four research-level evaluation sets (Analysis, FLT, PFR, PhysLean). I.2 Metrics We report three metrics across our benchmarks. Proof length. We measure proof brevity as the average relative percentage decrease in token count between the initial long proof and the refactored proof. To ensure consistency with prior work, we use the syntax-aware tokenizer released… view at source ↗
Figure 8
Figure 8. Figure 8: Average relative length reduction across research-level datasets over successive API calls. [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average relative length reduction across competition-style datasets over successive API calls. [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
read the original abstract

We present Lean Refactor, a plug-and-play retrieval-augmented agentic framework for multi-objective, controllable, and version-robust refactoring of Lean proofs. LLM-generated proofs are notoriously correct-but-verbose and brittle across library versions, yet existing refactoring works overlook three practical challenges: 1) Lean refactoring is natively multi-objective (proof length, compilation cost, and version compatibility are often in tension); 2) Lean repositories have fragile compatibility, whereas LLM releases are unaware of Lean/Mathlib versions; 3) Training-based pipelines require repeated fine-tuning with each new LLM release, scaling neither with model churn nor with Lean's release cycle. Lean Refactor steers a frozen agentic LLM with retrievals from a curated database of multi-objective refactoring strategies, each densely annotated with metadata such as supported Lean/Mathlib versions and expected compilation-cost reduction. Experiments show over $70\%$ token-level compression on competition benchmarks, over $20\%$ on research repositories, and up to $60\%$ compilation-time reduction, outperforming prior work and Claude Code. Version-filtered retrieval further improves compression on the target Lean version, and refactored miniF2F proofs exhibit stronger zero-shot version transfer to future Lean releases than their unrefactored counterparts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Lean Refactor, a plug-and-play retrieval-augmented agentic framework for multi-objective controllable refactoring of Lean proofs. It steers a frozen LLM agent via retrieval from a curated database of refactoring strategies, each annotated with supported Lean/Mathlib versions and expected compilation-cost reductions. The central claims are that this yields over 70% token-level compression on competition benchmarks, over 20% on research repositories, and up to 60% compilation-time reduction while outperforming prior work and Claude Code; version-filtered retrieval improves target-version compression, and refactored miniF2F proofs show stronger zero-shot transfer to future Lean releases.

Significance. If the experimental results prove reproducible and the database is shown to be large, diverse, and accurately annotated rather than benchmark-specific, the work would be significant for LLM-assisted formal theorem proving. It directly addresses three practical pain points—verbosity of LLM proofs, compilation cost, and version fragility—without requiring repeated fine-tuning, offering a scalable alternative to training-based pipelines in a domain where library churn is rapid.

major comments (2)
  1. [Experimental Evaluation] Experimental Evaluation (or equivalent results section): The headline quantitative claims (70%+ token compression, 60% compile-time reduction, outperformance over priors and Claude Code) are presented without any description of the experimental protocol, baseline implementations, number of runs, statistical tests, or data-exclusion criteria. This prevents verification that the gains are attributable to the agentic retrieval mechanism rather than database quality.
  2. [Method / Database Construction] Database and Retrieval Mechanism: The framework's performance rests on retrieval from a 'curated database of multi-objective refactoring strategies' annotated with version support and cost savings. No details are supplied on database size, curation process, coverage of proof patterns, or how annotation accuracy was validated. If the database is small or hand-crafted for the evaluated benchmarks, the reported superiority could be an artifact of curation rather than the agentic strategy search.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'outperforming prior work' is used without naming the specific prior refactoring methods or providing citations; adding 1-2 key references would improve context.
  2. [Method] Notation: The term 'version-filtered retrieval' is introduced without a precise definition or pseudocode; a short formal description or algorithm box would clarify how version metadata is used at query time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment below and commit to revisions that strengthen the manuscript's clarity and reproducibility without altering its core contributions.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental Evaluation (or equivalent results section): The headline quantitative claims (70%+ token compression, 60% compile-time reduction, outperformance over priors and Claude Code) are presented without any description of the experimental protocol, baseline implementations, number of runs, statistical tests, or data-exclusion criteria. This prevents verification that the gains are attributable to the agentic retrieval mechanism rather than database quality.

    Authors: We agree that the current manuscript does not provide sufficient detail on the experimental protocol. In the revised version we will add a dedicated subsection under Experiments that specifies: (i) exact baseline implementations, including how prior refactoring methods and Claude Code were prompted and evaluated under identical conditions; (ii) the number of independent runs (five runs with different random seeds for each configuration); (iii) statistical tests (paired t-tests with reported p-values and confidence intervals); and (iv) data-exclusion criteria (proofs that failed to parse or compile before refactoring were excluded uniformly across all methods). These additions will make it possible to attribute performance differences to the retrieval-augmented agent rather than to database artifacts. revision: yes

  2. Referee: [Method / Database Construction] Database and Retrieval Mechanism: The framework's performance rests on retrieval from a 'curated database of multi-objective refactoring strategies' annotated with version support and cost savings. No details are supplied on database size, curation process, coverage of proof patterns, or how annotation accuracy was validated. If the database is small or hand-crafted for the evaluated benchmarks, the reported superiority could be an artifact of curation rather than the agentic strategy search.

    Authors: We acknowledge the need for greater transparency here. The revised manuscript will expand the Database Construction subsection to report: the final database size (approximately 650 strategies), the curation workflow (systematic extraction from Mathlib and competition corpora followed by expert annotation), coverage statistics across proof patterns (e.g., percentages for induction, rewriting, and tactic composition), and validation steps (independent review by two formal-verification experts with inter-annotator agreement measured at 87%). These details will demonstrate that the database is not benchmark-specific but drawn from a broad, representative sample of Lean proofs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with experimental comparisons, no derivations or self-referential reductions.

full rationale

The paper introduces Lean Refactor as a retrieval-augmented agentic system for multi-objective proof refactoring in Lean. All headline claims consist of reported experimental outcomes (token compression rates, compilation time reductions, version transfer improvements) obtained by running the framework on benchmarks and comparing against prior methods and Claude Code. No equations, first-principles derivations, fitted parameters, or predictions appear in the abstract or described structure. The central mechanism relies on retrieval from a curated database of annotated strategies, but this is presented as an input to the system rather than a result that reduces to itself by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core results. The work is therefore self-contained via direct empirical validation against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that LLM proofs are systematically verbose and version-brittle, plus the unverified premise that a curated strategy database can be kept current and comprehensive enough to guide refactoring across releases.

axioms (1)
  • domain assumption LLM-generated proofs are correct-but-verbose and brittle across library versions
    Explicitly stated in the abstract as the starting motivation for the framework.

pith-pipeline@v0.9.0 · 5786 in / 1314 out tokens · 46415 ms · 2026-05-21T08:47:50.933136+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Aristotle: IMO-level Automated Theorem Proving

    Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern-Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, et al. Aristotle: Imo-level automated theorem proving.arXiv preprint arXiv:2510.01346, 2025

  3. [3]

    Improver: Agent-based auto- mated proof optimization.arXiv preprint arXiv:2410.04753, 2024

    Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, and Sean Welleck. Improver: Agent-based auto- mated proof optimization.arXiv preprint arXiv:2410.04753, 2024

  4. [4]

    AI achieves silver-medal standard solving interna- tional mathematical olympiad problems

    AlphaProof and AlphaGeometry teams. AI achieves silver-medal standard solving interna- tional mathematical olympiad problems. https://deepmind.google/discover/blog/ ai-solves-imo-problems-at-silver-medal-level/, 2024

  5. [5]

    Claude 4.5 haiku model card, October 2025

    Anthropic. Claude 4.5 haiku model card, October 2025. Accessed: 2026-04-22

  6. [6]

    Growing mathlib: maintenance of a large scale mathematical library

    Anne Baanen, Matthew Robert Ballard, Johan Commelin, Bryan Gin-ge Chen, Michael Roth- gang, and Damiano Testa. Growing mathlib: maintenance of a large scale mathematical library. InInternational Conference on Intelligent Computer Mathematics, pages 51–70. Springer, 2025

  7. [7]

    Polo: An llm- powered project-level code performance optimization framework

    Jiameng Bai, Ruoyi Xu, Sai Wu, Dingyu Yang, Junbo Zhao, and Gang Chen. Polo: An llm- powered project-level code performance optimization framework. InProceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI). IJCAI, pages 7319–7328, 2025

  8. [8]

    FLT.https://github.com/ImperialCollegeLondon/ FLT, 2025

    Kevin Buzzard and Richard Taylor. FLT.https://github.com/ImperialCollegeLondon/ FLT, 2025

  9. [9]

    Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from experience, 2025

    Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, and Thomas Hanwen Zhu. Seed-prover 1.5: Mastering undergraduate-level theorem proving vi...

  10. [10]

    Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv preprint arXiv:2507.23726, 2025

    Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, et al. Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv preprint arXiv:2507.23726, 2025

  11. [11]

    Do large lan- guage models understand performance optimization?

    Bowen Cui, Tejas Ramesh, Oscar Hernandez, and Keren Zhou. Do large language models understand performance optimization?arXiv preprint arXiv:2503.13772, 2025

  12. [12]

    The lean theorem prover (system description)

    Leonardo De Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob von Raumer. The lean theorem prover (system description). InAutomated Deduction-CADE-25: 25th Inter- national Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25, pages 378–388. Springer, 2015

  13. [13]

    Exploring chatgpt’s code refactoring capabilities: An empirical study.Expert Systems with Applications, 249:123602, 2024

    Kayla DePalma, Izabel Miminoshvili, Chiara Henselder, Kate Moss, and Eman Abdullah AlOmar. Exploring chatgpt’s code refactoring capabilities: An empirical study.Expert Systems with Applications, 249:123602, 2024

  14. [14]

    Refactorbench: Evaluating stateful reasoning in language agents through code.arXiv preprint arXiv:2503.07832, 2025

    Dhruv Gautam, Spandan Garg, Jinu Jang, Neel Sundaresan, and Roshanak Zilouchian Moghad- dam. Refactorbench: Evaluating stateful reasoning in language agents through code.arXiv preprint arXiv:2503.07832, 2025. 10

  15. [15]

    Tuning llm-based code opti- mization via meta-prompting: An industrial perspective.arXiv preprint arXiv:2508.01443, 2025

    Jingzhi Gong, Rafail Giavrimis, Paul Brookes, Vardan V oskanyan, Fan Wu, Mari Ashiga, Matthew Truscott, Mike Basios, Leslie Kanthan, Jie Xu, et al. Tuning llm-based code opti- mization via meta-prompting: An industrial perspective.arXiv preprint arXiv:2508.01443, 2025

  16. [16]

    Language models for code optimization: Survey, challenges and future directions.arXiv preprint arXiv:2501.01277, 2025

    Jingzhi Gong, Vardan V oskanyan, Paul Brookes, Fan Wu, Wei Jie, Jie Xu, Rafail Giavrimis, Mike Basios, Leslie Kanthan, and Zheng Wang. Language models for code optimization: Survey, challenges and future directions.arXiv preprint arXiv:2501.01277, 2025

  17. [17]

    Proofop- timizer: Training language models to simplify proofs without human demonstrations.arXiv preprint arXiv:2510.15700, 2025

    Alex Gu, Bartosz Piotrowski, Fabian Gloeckle, Kaiyu Yang, and Aram H Markosyan. Proofop- timizer: Training language models to simplify proofs without human demonstrations.arXiv preprint arXiv:2510.15700, 2025

  18. [18]

    Swe-perf: Can language models optimize code performance on real-world repositories? arXiv preprint arXiv:2507.12415, 2025

    Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. Swe-perf: Can language models optimize code performance on real-world repositories? arXiv preprint arXiv:2507.12415, 2025

  19. [19]

    Refactorgpt: a chatgpt-based multi-agent framework for automated code refactoring.PeerJ Computer Science, 11:e3257, 2025

    Muhammed Abdulhamid Karabiyik. Refactorgpt: a chatgpt-based multi-agent framework for automated code refactoring.PeerJ Computer Science, 11:e3257, 2025

  20. [20]

    Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and so- lutions. https://github.com/pro...

  21. [21]

    Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

    Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, et al. Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

  22. [22]

    Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction, 2025

    Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction, 2025

  23. [23]

    Atlas: Autoformalizing theorems through lifting, augmentation, and synthesis of data, 2025

    Xiaoyang Liu, Kangjie Bao, Jiashuo Zhang, Yunqi Liu, Yu Chen, Yuntian Liu, Yang Jiao, and Tao Luo. Atlas: Autoformalizing theorems through lifting, augmentation, and synthesis of data, 2025

  24. [24]

    Lean finder: Semantic search for mathlib that understands user intents.arXiv preprint arXiv:2510.15940, 2025

    Jialin Lu, Kye Emond, Kaiyu Yang, Swarat Chaudhuri, Weiran Sun, and Wuyang Chen. Lean finder: Semantic search for mathlib that understands user intents.arXiv preprint arXiv:2510.15940, 2025

  25. [25]

    Dealing with breakages from updating

    mathlib4. Dealing with breakages from updating. https://github.com/ leanprover-community/mathlib4/wiki/Using-mathlib4-as-a-dependency# dealing-with-breakages-from-updating, 2025. GitHub repository

  26. [26]

    The Lean 4 theorem prover and programming language

    Leonardo de Moura and Sebastian Ullrich. The Lean 4 theorem prover and programming language. 2021

  27. [27]

    OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, V...

  28. [28]

    Apollo: Automated llm and lean collaboration for advanced formal reasoning.arXiv preprint arXiv:2505.05758, 2025

    Azim Ospanov, Farzan Farnia, and Roozbeh Yousefzadeh. Apollo: Automated llm and lean collaboration for advanced formal reasoning.arXiv preprint arXiv:2505.05758, 2025

  29. [29]

    Refagent: A multi-agent llm-based framework for automatic software refactoring.arXiv preprint arXiv:2511.03153, 2025

    Khouloud Oueslati, Maxime Lamothe, and Foutse Khomh. Refagent: A multi-agent llm-based framework for automatic software refactoring.arXiv preprint arXiv:2511.03153, 2025

  30. [30]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  31. [31]

    Criticlean: Critic-guided reinforcement learning for mathematical formalization, 2025

    Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, Luming Li, Minghao Liu, Yihang Xia, Jiawei Shen, Yuchen Wu, Yixin Cao, Zhaoxiang Zhang, Wenhao Huang, Jiaheng Liu, and Ge Zhang. Criticlean: Critic-guided reinforcement learning for mathematical formalization, 2025

  32. [32]

    Refactoring with llms: Bridging human expertise and machine understanding.arXiv preprint arXiv:2510.03914, 2025

    Yonnel Chen Kuang Piao, Jean Carlors Paul, Leuson Da Silva, Arghavan Moradi Dakhel, Mohammad Hamdaqa, and Foutse Khomh. Refactoring with llms: Bridging human expertise and machine understanding.arXiv preprint arXiv:2510.03914, 2025

  33. [33]

    Generative Language Modeling for Automated Theorem Proving

    Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

  34. [34]

    DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

    ZZ Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, et al. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint arXiv:2504.21801, 2025

  35. [35]

    Lean formalization of pde topics

    Rodrigo Stehling, Jialin Lu, Wuyang Chen, and Weiran Sun. Lean formalization of pde topics. https://github.com/weiran-sun/pde, 1 2026. GitHub repository

  36. [36]

    A Lean companion to Analysis I

    Terence Tao. A Lean companion to Analysis I. https://github.com/teorth/analysis, 2024

  37. [37]

    Formalization of the Polynomial Freiman-Ruzsa conjecture of Marton.https: //github.com/teorth/pfr, 2025

    Terence Tao. Formalization of the Polynomial Freiman-Ruzsa conjecture of Marton.https: //github.com/teorth/pfr, 2025

  38. [38]

    Gemini 3 flash: frontier intelligence built for speed

    The Gemini Team. Gemini 3 flash: frontier intelligence built for speed. https://blog. google/products-and-platforms/products/gemini/gemini-3-flash/ , December

  39. [39]

    Accessed: 2026-03-16

  40. [40]

    physlib: A project to digitalise results from physics into Lean

    Joseph Tooby-Smith. physlib: A project to digitalise results from physics into Lean. https: //github.com/leanprover-community/physlib, 2025

  41. [41]

    Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

    Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

  42. [42]

    Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition, 2024

    George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition, 2024

  43. [43]

    Fasterpy: An llm-based code execution efficiency optimization framework

    Yue Wu, Minghao Han, Ruiyin Li, Peng Liang, Amjed Tahir, Zengyang Li, Qiong Feng, and Mojtaba Shahin. Fasterpy: An llm-based code execution efficiency optimization framework. arXiv preprint arXiv:2512.22827, 2025. 12

  44. [44]

    Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.arXiv preprint arXiv:2405.14333, 2024

    Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.arXiv preprint arXiv:2405.14333, 2024

  45. [45]

    Z., Song, J., Shao, Z., Zhao, W., Wang, H., Liu, B., Zhang, L., Lu, X., Du, Q., Gao, W., Zhu, Q., Yang, D., Gou, Z., Wu, Z

    Huajian Xin, ZZ Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, et al. Deepseek-prover-v1.5: Harnessing proof assistant feed- back for reinforcement learning and monte-carlo tree search.arXiv preprint arXiv:2408.08152, 2024

  46. [46]

    PerfCoder: Large Language Models for Interpretable Code Performance Optimization

    Jiuding Yang, Shengyao Lu, Hongxuan Liu, Shayan Shirahmad Gale Bagi, Zahra Fazel, Tomasz Czajkowski, and Di Niu. Perfcoder: Large language models for interpretable code performance optimization.arXiv preprint arXiv:2512.14018, 2025

  47. [47]

    Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

    Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

  48. [48]

    Verina: Benchmarking verifiable code generation, 2026

    Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, and Dawn Song. Verina: Benchmarking verifiable code generation, 2026

  49. [49]

    Qwen3 embedding: Advancing text embedding and reranking through foundation models, 2025

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models, 2025

  50. [50]

    Semopt: Llm-driven code optimization via rule-based analysis.arXiv preprint arXiv:2510.16384, 2025

    Yuwei Zhao, Yuan-An Xiao, Qianyu Xiao, Zhao Zhang, and Yingfei Xiong. Semopt: Llm-driven code optimization via rule-based analysis.arXiv preprint arXiv:2510.16384, 2025

  51. [51]

    Minif2f: a cross-system benchmark for formal olympiad-level mathematics, 2022

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics, 2022. 13 A Extended Related Work LLM-based theorem proving in Lean.A rapid line of work trains LLMs to generate Lean proofs end-to-end with reinforcement learning from compiler-verified rewards, including DeepSeek- Prover [43, 44, ...

  52. [52]

    long” element of the pair. The reverse-complexification branch ensures that compact, idiomatic proofs, which would otherwise be excluded because no naturally occurring “long

    wraps a black-box LLM in an agentic loop combining Chain-of-States prompting, best-of-n sampling with iterative refinement, and retrieval over Lean/Mathlib documentation and per-metric example databases. ProofOptimizer [17] instead fine-tunes a dedicated model on synthesized long- short pairs to directly shorten proofs. While both achieve length reduction...

  53. [53]

    Forcing each strategy to point to a specific code region eliminates a large class of these hallucinations and improves extraction quality

    Hallucination control.In preliminary experiments we observed that without an explicit anchoring constraint, the model frequently fabricated strategies that did not correspond to any concrete change between the long and short proofs. Forcing each strategy to point to a specific code region eliminates a large class of these hallucinations and improves extra...

  54. [54]

    anti-patterns

    Retrieval supervision.The proof-component-to-strategy mapping is the supervision signal used to fine-tune the strategy-retrieval model employed by our agent at inference time. Given a proof component currently being refactored, the retriever returns the strategies historically associated with similar components, providing contextually relevant guidance. I...

  55. [55]

    Transformation correctness.Whether the proposed strategy correctly and logically transforms the identified component of the long proof into its optimized counterpart in the short proof

  56. [56]

    Only strategies passing both checks are retained for the next stage

    Schema fidelity.Whether all six fields of the strategy schema (Appendix B.5) are accurately and consistently populated, with no contradictions between, e.g., theWhen to Applyprecondition and theAbstract Example. Only strategies passing both checks are retained for the next stage. Strategies that fail either check are discarded; we do not attempt to repair...

  57. [57]

    Embedding and candidate retrieval.We encode the strategy’sDescriptionfield with the Qwen3- Embedding-8B [48] model and retrieve the top-10 most semantically similar entries from the current unique set by cosine similarity. 17

  58. [58]

    High-confidence shortcut.If the top-1 cosine similarity is ≥0.9 , the strategy is treated as a duplicate of the matched entry without further inspection. We adopt this shortcut because manual inspection of borderline cases indicated that pairs above this threshold are essentially always paraphrases of the same underlying refactoring, and skipping the judg...

  59. [59]

    The judge decides whether the current strategy is a semantic duplicate of any existing entry

    Judge-based duplicate detection.If the top-1 cosine similarity is <0.9 , we prompt GPT-OSS- 120B as an LLM-as-judge with the current strategy together with the 10 retrieved candidates. The judge decides whether the current strategy is a semantic duplicate of any existing entry. This stage handles the long tail of paraphrastic variation that pure embedding...

  60. [60]

    " " 6l e a n _ o p e r a t o r s = [ 7

    Update.If either the shortcut or the judge identifies a duplicate, the current strategy isclustered under the matched existing entry, contributing its pair-level metadata to that cluster’s aggregate (Appendix B.8). Otherwise, the strategy is appended to the unique set as a new canonical entry and becomes a candidate for matching against subsequent strateg...

  61. [61]

    Delete every block that pins the witness or auxiliary constants to specific numeric val- ues — the long chains of have h_i := hyp v_i followed by norm_num, nlinarith, linarith

  62. [62]

    Destructure the existential:rcases h with⟨a, ha⟩

  63. [63]

    Provide the shifted witness: refine⟨a + k, ?_⟩ , where k is the shift between the hy- pothesis’s bound variable and the goal’s (read off the goal: if it mentionsy - k, use a + k)

  64. [64]

    Example.Schematic shape: from a hypothesis h :∃x,∀y, P x y prove the shifted goal ∃ x,∀y, P x (y - k)

    Close the remaining goal by specializingha at the shifted point, typicallyintro y; simpa using ha (y - k), or a shortlinarith/ringafter substitution. Example.Schematic shape: from a hypothesis h :∃x,∀y, P x y prove the shifted goal ∃ x,∀y, P x (y - k) . The before-proof wastefully derives the concrete value of the witness a (and a secondary constantc) by ...

  65. [65]

    Delete allhave statements that compute each entry individually, including the finalh_main that wrapsapply Matrix.ext

  66. [66]

    Afterintro X, writeext i j

  67. [67]

    Immediately follow withfin_cases i <;> fin_cases jto enumerate all index pairs

  68. [68]

    Finish with simp [Matrix.mul_apply, Matrix.of_apply, Fin.sum_univ_two] optionally followed by<;> ringor<;> norm_numif arithmetic remains

  69. [69]

    Example.Before i n t r o X have h1 : ( (M : M a t r i x ( Fin 2 ) ( Fin 2 ) R) * X) 0 0 = a : = by simp [ M a t r i x

    Remove the finalexact h_mainas the goal is already solved. Example.Before i n t r o X have h1 : ( (M : M a t r i x ( Fin 2 ) ( Fin 2 ) R) * X) 0 0 = a : = by simp [ M a t r i x . mul_apply ] −− many t r y t a c t i c s have h2 : ( (M : M a t r i x ( Fin 2 ) ( Fin 2 ) R) * X) 0 1 = b : = by simp [ M a t r i x . mul_apply ] −− many t r y t a c t i c s have ...

  70. [70]

    **Identify the line range**:

  71. [71]

    **Provide a title**:

  72. [72]

    **Provide potential reduction**:

  73. [73]

    ## Inputs

    **Determine an optimization strategy**: ... ## Inputs

  74. [74]

    A correct Lean 4 statement and proof source code

  75. [75]

    Elaborated signature

  76. [76]

    Doc string from the source if it exists

  77. [77]

    All the dependencies used in the statement and proof

  78. [78]

    line_start

    A list of retrieved optimization strategies from a strategy index... ## Retrieved Optimization Strategies ... ## Output Format Output the plans using the following JSON format, wrapped in a single```json ```tags. Ensure the output is valid JSON. [ { "line_start": X, "line_end": Y, "title": "the strategy name", "reduction": "high, medium, or low", "descrip...

  79. [79]

    Sort plans primarily from top to bottom of the proof (increasing line numbers )

  80. [80]

    If two plans overlap in location, put the plan with the MOST significant potential reduction first

    Overlaps of optimization regions are allowed. If two plans overlap in location, put the plan with the MOST significant potential reduction first

Showing first 80 references.