Discovering New Theorems via LLMs with In-Context Proof Learning in Lean
Pith reviewed 2026-05-18 16:04 UTC · model grok-4.3
The pith
Conditioning LLMs on their own verified theorems and proofs raises the rate at which they discover hard-to-prove statements in Lean.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Conjecturing-Proving Loop (CPL) increases the discovery rate of hard-to-prove theorems compared to simultaneous statement-and-proof generation by iteratively producing conjectures and then proving them in Lean 4, with each iteration conditioning the LLM on the theorems and formal proofs it generated and verified in previous steps. This conditioning supplies in-context examples that improve proof strategies in a parameter-free manner. The paper supplies both theoretical arguments and experimental results showing that reuse of the model's own verified outputs consistently raises subsequent proof success.
What carries the argument
The Conjecturing-Proving Loop (CPL), an iterative pipeline that alternates conjecture generation with proof attempts while conditioning the LLM on its own previously verified theorems and proofs.
If this is right
- Reusing the LLM's own formally verified outputs as context improves proof success on later iterations.
- CPL discovers a larger number of hard-to-prove theorems than frameworks that generate statements and proofs simultaneously.
- Self-generated in-context examples enable measurable improvement in neural theorem proving without parameter changes or additional training.
Where Pith is reading between the lines
- Running the loop for many successive iterations could build an expanding internal library that further accelerates discovery.
- The same self-conditioning pattern might transfer to other formal systems if the proof assistant supplies comparable verified outputs.
- The approach suggests a route for LLMs to bootstrap mathematical capability on domains where human-provided training examples are scarce.
Load-bearing premise
That supplying the LLM with its own previously generated and verified theorems and proofs produces genuine new proof strategies rather than simple repetition of patterns already present in the training data.
What would settle it
An experiment in which the same conjecture-generation and proof tasks are run once with self-generated verified context and once with either no context or unrelated verified theorems, then measuring whether the self-context condition produces a statistically higher proof success rate.
Figures
read the original abstract
Large Language Models (LLMs) have demonstrated significant promise in formal theorem proving. In this study, we investigate the ability of LLMs to discover novel theorems and produce verified proofs. We propose a pipeline called \textit{Conjecturing-Proving Loop} (CPL), which iteratively generates mathematical conjectures and attempts to prove them in Lean 4. A key feature of CPL is that each iteration conditions the LLM on previously generated theorems and their formal proofs, enabling parameter-free improvement of proof strategies via in-context learning. We provide both theoretical and experimental evidence that CPL increases the discovery rate of hard-to-prove theorems compared to frameworks that generate statements and proofs simultaneously. Moreover, our experiments show that reusing the LLM's own formally verified outputs as context consistently improves subsequent proof success, demonstrating the effectiveness of self-generated in-context learning for neural theorem proving. The source code is available at https://github.com/auto-res/ConjecturingProvingLoop.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Conjecturing-Proving Loop (CPL), an iterative pipeline in which an LLM generates mathematical conjectures and attempts to prove them in Lean 4. Each iteration conditions the model on its own previously generated and formally verified theorems and proofs, enabling parameter-free improvement of proof strategies through in-context learning. The authors assert both theoretical arguments and experimental results showing that CPL raises the discovery rate of hard-to-prove theorems relative to frameworks that generate statements and proofs simultaneously, and that reusing the model's own verified outputs as context consistently improves subsequent proof success.
Significance. If the central claim holds after addressing the experimental controls, the work would provide a concrete demonstration of self-generated in-context learning for neural theorem proving, offering a reproducible route to improved discovery rates without additional training. The public release of source code is a clear strength that supports verification and extension.
major comments (2)
- [§4.2 and Table 2] §4.2 (Experimental Setup) and Table 2: The simultaneous-generation baselines lack an equivalent iterative structure with accumulating context. Without an ablation that holds the number of LLM calls and rounds fixed while varying only the content of the context (verified self-generated theorems vs. empty or unrelated prompts), the reported gains in discovery rate cannot be unambiguously attributed to the in-context learning mechanism rather than the iterative loop itself.
- [Abstract and §3] Abstract and §3 (CPL Description): The theoretical evidence for genuine strategy improvement (as opposed to repetition of patterns already present in the training distribution) is asserted without a derivation or proof sketch. A concrete argument showing why conditioning on the model's own verified outputs produces new proof tactics would be needed to support the parameter-free improvement claim.
minor comments (2)
- [Figure 3] Figure 3: Axis labels and legend entries for the different methods are too small to read clearly in the printed version.
- [§5.1] §5.1: The definition of 'hard-to-prove' theorems is given only qualitatively; an explicit criterion (e.g., proof length or search depth threshold) would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the experimental controls and theoretical grounding of the Conjecturing-Proving Loop. We address each major point below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [§4.2 and Table 2] §4.2 (Experimental Setup) and Table 2: The simultaneous-generation baselines lack an equivalent iterative structure with accumulating context. Without an ablation that holds the number of LLM calls and rounds fixed while varying only the content of the context (verified self-generated theorems vs. empty or unrelated prompts), the reported gains in discovery rate cannot be unambiguously attributed to the in-context learning mechanism rather than the iterative loop itself.
Authors: We agree that the current experimental design does not fully isolate the contribution of accumulating verified context from the mere presence of an iterative loop. To address this directly, we will add a controlled ablation in the revised §4.2 and Table 2. This ablation will hold the total number of LLM calls and iteration rounds constant across conditions, comparing (i) the full CPL with accumulating self-generated verified theorems and proofs as context against (ii) an otherwise identical iterative baseline that uses either empty context or unrelated filler prompts. The results will be reported alongside the existing baselines to show whether gains persist when only the context content varies. revision: yes
-
Referee: [Abstract and §3] Abstract and §3 (CPL Description): The theoretical evidence for genuine strategy improvement (as opposed to repetition of patterns already present in the training distribution) is asserted without a derivation or proof sketch. A concrete argument showing why conditioning on the model's own verified outputs produces new proof tactics would be needed to support the parameter-free improvement claim.
Authors: We acknowledge that the current presentation of the theoretical argument is high-level and would benefit from an explicit sketch. In the revised §3 we will insert a concise derivation: conditioning on the model's own verified proofs supplies in-context examples of tactic sequences that succeeded on novel, self-generated conjectures. Because these proofs are produced and verified within the loop, they can contain combinations of Lean tactics that are not direct repetitions of training data but rather adaptations to the specific conjecture distribution encountered so far. This allows the model to reference and recombine successful sub-proofs (e.g., a particular induction tactic or lemma application) when facing a new conjecture, yielding measurable improvements in proof success rate without parameter updates. We will illustrate the argument with a short example of tactic reuse across iterations. revision: yes
Circularity Check
No significant circularity in CPL empirical claims
full rationale
The paper's central claims rest on an empirical pipeline (CPL) that iteratively generates conjectures and proofs in Lean 4 while conditioning on its own verified prior outputs. Evidence consists of direct comparisons to simultaneous-generation baselines and measured improvements in proof success rates from actual formal verifications. No load-bearing step reduces by construction to fitted parameters, self-definitions, or self-citation chains; the in-context mechanism is tested via observable, externally verifiable Lean outputs rather than internal re-labeling of inputs as predictions. The evaluation therefore remains independent of the method's own generated content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can generate syntactically valid Lean statements and proofs when prompted with prior verified theorems.
Reference graph
Works this paper leans on
-
[1]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...
work page 2025
-
[2]
STP: Self-play LLM theorem provers with iterative conjecturing and proving, 2025
Kefan Dong and Tengyu Ma. STP: Self-play LLM theorem provers with iterative conjecturing and proving, 2025
work page 2025
-
[3]
Julian Dontchev. Survey on preopen sets.arXiv preprint math/9810177, 1998
work page internal anchor Pith review Pith/arXiv arXiv 1998
-
[4]
Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, et al. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level.Proceedings of the National Academy of Sciences, 119(32):e2123433119, 2022
work page 2022
-
[5]
Baldur: Whole-proof generation and repair with large language models
Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1229–1241, 2023
work page 2023
-
[6]
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning LLM from zero data.arXiv preprint arXiv:2508.05004, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Goedel-prover: A frontier model for open-source automated theorem proving
Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia LI, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving. In Second Conference on Language Modeling, 2025
work page 2025
-
[8]
Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025
-
[9]
A survey of in-context reinforcement learning.arXiv preprint arXiv:2502.07978, 2025
Amir Moeini, Jiuqi Wang, Jacob Beck, Ethan Blaser, Shimon Whiteson, Rohan Chandra, and Shangtong Zhang. A survey of in-context reinforcement learning.arXiv preprint arXiv:2502.07978, 2025
-
[10]
On some classes of nearly open sets.Pacific journal of mathematics, 15(3):961–970, 1965
Olav Nj ˙astad. On some classes of nearly open sets.Pacific journal of mathematics, 15(3):961–970, 1965
work page 1965
-
[11]
Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition, 2025
work page 2025
-
[12]
An in-context learning agent for formal theorem-proving.arXiv preprint arXiv:2310.04353, 2023
Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An in-context learning agent for formal theorem-proving.arXiv preprint arXiv:2310.04353, 2023
-
[13]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[14]
Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, and Fei Tang. Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy.arXiv preprint arXiv:2508.05592, 2025
-
[15]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022. A Generated Proof of Focused Theorem theorem intersection_of_alpha_open_sets_is_alpha_open {A B : Set...
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.