arxiv: 2509.14274 · v2 · submitted 2025-09-16 · 💻 cs.LG · cs.AI· cs.LO

Discovering New Theorems via LLMs with In-Context Proof Learning in Lean

Kazumi Kasaura , Naoto Onda , Yuta Oriike , Masaya Taniguchi , Akiyoshi Sannai , Sho Sonoda This is my paper

Pith reviewed 2026-05-18 16:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LO

keywords LLM theorem provingLean 4in-context learningconjecture generationformal proofsautomated discoveryself-improvement

0 comments

The pith

Conditioning LLMs on their own verified theorems and proofs raises the rate at which they discover hard-to-prove statements in Lean.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the Conjecturing-Proving Loop, an iterative process in which an LLM generates mathematical conjectures and then attempts to prove them in Lean 4. Each round supplies the model with the theorems and formal proofs it produced and verified in earlier rounds, allowing in-context learning from its own outputs without any parameter updates. The authors argue that this loop yields more successful proofs on difficult statements than approaches that generate a statement and its proof in one pass. A sympathetic reader would care because the method points toward automated expansion of formal mathematics through self-reinforcing cycles rather than external training data or human intervention. Experiments are presented as evidence that proof success improves consistently when the model reuses its prior verified results as context.

Core claim

The Conjecturing-Proving Loop (CPL) increases the discovery rate of hard-to-prove theorems compared to simultaneous statement-and-proof generation by iteratively producing conjectures and then proving them in Lean 4, with each iteration conditioning the LLM on the theorems and formal proofs it generated and verified in previous steps. This conditioning supplies in-context examples that improve proof strategies in a parameter-free manner. The paper supplies both theoretical arguments and experimental results showing that reuse of the model's own verified outputs consistently raises subsequent proof success.

What carries the argument

The Conjecturing-Proving Loop (CPL), an iterative pipeline that alternates conjecture generation with proof attempts while conditioning the LLM on its own previously verified theorems and proofs.

If this is right

Reusing the LLM's own formally verified outputs as context improves proof success on later iterations.
CPL discovers a larger number of hard-to-prove theorems than frameworks that generate statements and proofs simultaneously.
Self-generated in-context examples enable measurable improvement in neural theorem proving without parameter changes or additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Running the loop for many successive iterations could build an expanding internal library that further accelerates discovery.
The same self-conditioning pattern might transfer to other formal systems if the proof assistant supplies comparable verified outputs.
The approach suggests a route for LLMs to bootstrap mathematical capability on domains where human-provided training examples are scarce.

Load-bearing premise

That supplying the LLM with its own previously generated and verified theorems and proofs produces genuine new proof strategies rather than simple repetition of patterns already present in the training data.

What would settle it

An experiment in which the same conjecture-generation and proof tasks are run once with self-generated verified context and once with either no context or unrelated verified theorems, then measuring whether the self-context condition produces a statistically higher proof success rate.

Figures

Figures reproduced from arXiv: 2509.14274 by Akiyoshi Sannai, Kazumi Kasaura, Masaya Taniguchi, Naoto Onda, Sho Sonoda, Yuta Oriike.

**Figure 1.** Figure 1: Overview of Conjecturing-Proving Loop. Conjecturer generates conjectures using Library as context, Prover [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of proof length of theorems generated by our framework and the simple loop framework, where [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Success rates of proving the theorem by LLMs. With and without context in Lean ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated significant promise in formal theorem proving. In this study, we investigate the ability of LLMs to discover novel theorems and produce verified proofs. We propose a pipeline called \textit{Conjecturing-Proving Loop} (CPL), which iteratively generates mathematical conjectures and attempts to prove them in Lean 4. A key feature of CPL is that each iteration conditions the LLM on previously generated theorems and their formal proofs, enabling parameter-free improvement of proof strategies via in-context learning. We provide both theoretical and experimental evidence that CPL increases the discovery rate of hard-to-prove theorems compared to frameworks that generate statements and proofs simultaneously. Moreover, our experiments show that reusing the LLM's own formally verified outputs as context consistently improves subsequent proof success, demonstrating the effectiveness of self-generated in-context learning for neural theorem proving. The source code is available at https://github.com/auto-res/ConjecturingProvingLoop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CPL adds an iterative loop that feeds the LLM its own verified Lean theorems as context, showing gains over simultaneous generation but without clear ablations to isolate the in-context effect from extra rounds.

read the letter

The paper introduces the Conjecturing-Proving Loop, where an LLM generates conjectures in Lean 4, attempts proofs, and then reuses successful verified outputs as in-context examples for later iterations. This setup is positioned as a way to improve discovery of hard theorems without parameter updates, and the experiments report higher success rates than baselines that generate statements and proofs together in one pass. The public GitHub code is a clear positive, letting others reproduce the prompting and Lean integration directly. That openness makes the practical side easier to evaluate than many similar efforts. The central experimental claim is that self-generated verified context helps more than starting fresh each time, which aligns with the idea of building on what the model has already confirmed. The theoretical sketch in the abstract tries to explain why this loop might explore better than non-iterative methods. One soft spot is the comparison setup. The baselines appear to lack the same multi-round structure with accumulating context, so any lift could trace to simply allowing more total calls or refined prompts rather than the specific content of the verified theorems. An ablation that holds iteration count fixed and swaps only the context source would tighten the attribution. The paper is aimed at researchers working on LLM-driven formal mathematics and automated exploration. Readers who already run Lean-based experiments will get the most from the pipeline details and the reported consistency of the self-context gains. It has enough concrete method and open resources to justify sending to peer review, even if the controls around iteration versus context need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Conjecturing-Proving Loop (CPL), an iterative pipeline in which an LLM generates mathematical conjectures and attempts to prove them in Lean 4. Each iteration conditions the model on its own previously generated and formally verified theorems and proofs, enabling parameter-free improvement of proof strategies through in-context learning. The authors assert both theoretical arguments and experimental results showing that CPL raises the discovery rate of hard-to-prove theorems relative to frameworks that generate statements and proofs simultaneously, and that reusing the model's own verified outputs as context consistently improves subsequent proof success.

Significance. If the central claim holds after addressing the experimental controls, the work would provide a concrete demonstration of self-generated in-context learning for neural theorem proving, offering a reproducible route to improved discovery rates without additional training. The public release of source code is a clear strength that supports verification and extension.

major comments (2)

[§4.2 and Table 2] §4.2 (Experimental Setup) and Table 2: The simultaneous-generation baselines lack an equivalent iterative structure with accumulating context. Without an ablation that holds the number of LLM calls and rounds fixed while varying only the content of the context (verified self-generated theorems vs. empty or unrelated prompts), the reported gains in discovery rate cannot be unambiguously attributed to the in-context learning mechanism rather than the iterative loop itself.
[Abstract and §3] Abstract and §3 (CPL Description): The theoretical evidence for genuine strategy improvement (as opposed to repetition of patterns already present in the training distribution) is asserted without a derivation or proof sketch. A concrete argument showing why conditioning on the model's own verified outputs produces new proof tactics would be needed to support the parameter-free improvement claim.

minor comments (2)

[Figure 3] Figure 3: Axis labels and legend entries for the different methods are too small to read clearly in the printed version.
[§5.1] §5.1: The definition of 'hard-to-prove' theorems is given only qualitatively; an explicit criterion (e.g., proof length or search depth threshold) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the experimental controls and theoretical grounding of the Conjecturing-Proving Loop. We address each major point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§4.2 and Table 2] §4.2 (Experimental Setup) and Table 2: The simultaneous-generation baselines lack an equivalent iterative structure with accumulating context. Without an ablation that holds the number of LLM calls and rounds fixed while varying only the content of the context (verified self-generated theorems vs. empty or unrelated prompts), the reported gains in discovery rate cannot be unambiguously attributed to the in-context learning mechanism rather than the iterative loop itself.

Authors: We agree that the current experimental design does not fully isolate the contribution of accumulating verified context from the mere presence of an iterative loop. To address this directly, we will add a controlled ablation in the revised §4.2 and Table 2. This ablation will hold the total number of LLM calls and iteration rounds constant across conditions, comparing (i) the full CPL with accumulating self-generated verified theorems and proofs as context against (ii) an otherwise identical iterative baseline that uses either empty context or unrelated filler prompts. The results will be reported alongside the existing baselines to show whether gains persist when only the context content varies. revision: yes
Referee: [Abstract and §3] Abstract and §3 (CPL Description): The theoretical evidence for genuine strategy improvement (as opposed to repetition of patterns already present in the training distribution) is asserted without a derivation or proof sketch. A concrete argument showing why conditioning on the model's own verified outputs produces new proof tactics would be needed to support the parameter-free improvement claim.

Authors: We acknowledge that the current presentation of the theoretical argument is high-level and would benefit from an explicit sketch. In the revised §3 we will insert a concise derivation: conditioning on the model's own verified proofs supplies in-context examples of tactic sequences that succeeded on novel, self-generated conjectures. Because these proofs are produced and verified within the loop, they can contain combinations of Lean tactics that are not direct repetitions of training data but rather adaptations to the specific conjecture distribution encountered so far. This allows the model to reference and recombine successful sub-proofs (e.g., a particular induction tactic or lemma application) when facing a new conjecture, yielding measurable improvements in proof success rate without parameter updates. We will illustrate the argument with a short example of tactic reuse across iterations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in CPL empirical claims

full rationale

The paper's central claims rest on an empirical pipeline (CPL) that iteratively generates conjectures and proofs in Lean 4 while conditioning on its own verified prior outputs. Evidence consists of direct comparisons to simultaneous-generation baselines and measured improvements in proof success rates from actual formal verifications. No load-bearing step reduces by construction to fitted parameters, self-definitions, or self-citation chains; the in-context mechanism is tested via observable, externally verifiable Lean outputs rather than internal re-labeling of inputs as predictions. The evaluation therefore remains independent of the method's own generated content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of in-context learning from self-generated Lean proofs and on the assumption that the LLM can leverage that context to improve proof success rates. No explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption LLMs can generate syntactically valid Lean statements and proofs when prompted with prior verified theorems.
Invoked implicitly when the loop conditions on previous outputs.

pith-pipeline@v0.9.0 · 5718 in / 1254 out tokens · 40516 ms · 2026-05-18T16:04:37.465273+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page 2025
[2]

STP: Self-play LLM theorem provers with iterative conjecturing and proving, 2025

Kefan Dong and Tengyu Ma. STP: Self-play LLM theorem provers with iterative conjecturing and proving, 2025

work page 2025
[3]

Survey on preopen sets

Julian Dontchev. Survey on preopen sets.arXiv preprint math/9810177, 1998

work page internal anchor Pith review Pith/arXiv arXiv 1998
[4]

Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, et al. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level.Proceedings of the National Academy of Sciences, 119(32):e2123433119, 2022

work page 2022
[5]

Baldur: Whole-proof generation and repair with large language models

Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1229–1241, 2023

work page 2023
[6]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning LLM from zero data.arXiv preprint arXiv:2508.05004, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Goedel-prover: A frontier model for open-source automated theorem proving

Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia LI, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving. In Second Conference on Language Modeling, 2025

work page 2025
[8]

Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

work page arXiv 2025
[9]

A survey of in-context reinforcement learning.arXiv preprint arXiv:2502.07978, 2025

Amir Moeini, Jiuqi Wang, Jacob Beck, Ethan Blaser, Shimon Whiteson, Rohan Chandra, and Shangtong Zhang. A survey of in-context reinforcement learning.arXiv preprint arXiv:2502.07978, 2025

work page arXiv 2025
[10]

On some classes of nearly open sets.Pacific journal of mathematics, 15(3):961–970, 1965

Olav Nj ˙astad. On some classes of nearly open sets.Pacific journal of mathematics, 15(3):961–970, 1965

work page 1965
[11]

Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition, 2025

work page 2025
[12]

An in-context learning agent for formal theorem-proving.arXiv preprint arXiv:2310.04353, 2023

Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An in-context learning agent for formal theorem-proving.arXiv preprint arXiv:2310.04353, 2023

work page arXiv 2023
[13]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[14]

Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy.arXiv preprint arXiv:2508.05592, 2025

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, and Fei Tang. Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy.arXiv preprint arXiv:2508.05592, 2025

work page arXiv 2025
[15]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022. A Generated Proof of Focused Theorem theorem intersection_of_alpha_open_sets_is_alpha_open {A B : Set...

work page internal anchor Pith review Pith/arXiv arXiv 2022