pith. machine review for the scientific record. sign in

arxiv: 2509.14274 · v2 · submitted 2025-09-16 · 💻 cs.LG · cs.AI· cs.LO

Discovering New Theorems via LLMs with In-Context Proof Learning in Lean

Pith reviewed 2026-05-18 16:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LO
keywords LLM theorem provingLean 4in-context learningconjecture generationformal proofsautomated discoveryself-improvement
0
0 comments X

The pith

Conditioning LLMs on their own verified theorems and proofs raises the rate at which they discover hard-to-prove statements in Lean.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the Conjecturing-Proving Loop, an iterative process in which an LLM generates mathematical conjectures and then attempts to prove them in Lean 4. Each round supplies the model with the theorems and formal proofs it produced and verified in earlier rounds, allowing in-context learning from its own outputs without any parameter updates. The authors argue that this loop yields more successful proofs on difficult statements than approaches that generate a statement and its proof in one pass. A sympathetic reader would care because the method points toward automated expansion of formal mathematics through self-reinforcing cycles rather than external training data or human intervention. Experiments are presented as evidence that proof success improves consistently when the model reuses its prior verified results as context.

Core claim

The Conjecturing-Proving Loop (CPL) increases the discovery rate of hard-to-prove theorems compared to simultaneous statement-and-proof generation by iteratively producing conjectures and then proving them in Lean 4, with each iteration conditioning the LLM on the theorems and formal proofs it generated and verified in previous steps. This conditioning supplies in-context examples that improve proof strategies in a parameter-free manner. The paper supplies both theoretical arguments and experimental results showing that reuse of the model's own verified outputs consistently raises subsequent proof success.

What carries the argument

The Conjecturing-Proving Loop (CPL), an iterative pipeline that alternates conjecture generation with proof attempts while conditioning the LLM on its own previously verified theorems and proofs.

If this is right

  • Reusing the LLM's own formally verified outputs as context improves proof success on later iterations.
  • CPL discovers a larger number of hard-to-prove theorems than frameworks that generate statements and proofs simultaneously.
  • Self-generated in-context examples enable measurable improvement in neural theorem proving without parameter changes or additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Running the loop for many successive iterations could build an expanding internal library that further accelerates discovery.
  • The same self-conditioning pattern might transfer to other formal systems if the proof assistant supplies comparable verified outputs.
  • The approach suggests a route for LLMs to bootstrap mathematical capability on domains where human-provided training examples are scarce.

Load-bearing premise

That supplying the LLM with its own previously generated and verified theorems and proofs produces genuine new proof strategies rather than simple repetition of patterns already present in the training data.

What would settle it

An experiment in which the same conjecture-generation and proof tasks are run once with self-generated verified context and once with either no context or unrelated verified theorems, then measuring whether the self-context condition produces a statistically higher proof success rate.

Figures

Figures reproduced from arXiv: 2509.14274 by Akiyoshi Sannai, Kazumi Kasaura, Masaya Taniguchi, Naoto Onda, Sho Sonoda, Yuta Oriike.

Figure 1
Figure 1. Figure 1: Overview of Conjecturing-Proving Loop. Conjecturer generates conjectures using Library as context, Prover [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of proof length of theorems generated by our framework and the simple loop framework, where [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Success rates of proving the theorem by LLMs. With and without context in Lean ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have demonstrated significant promise in formal theorem proving. In this study, we investigate the ability of LLMs to discover novel theorems and produce verified proofs. We propose a pipeline called \textit{Conjecturing-Proving Loop} (CPL), which iteratively generates mathematical conjectures and attempts to prove them in Lean 4. A key feature of CPL is that each iteration conditions the LLM on previously generated theorems and their formal proofs, enabling parameter-free improvement of proof strategies via in-context learning. We provide both theoretical and experimental evidence that CPL increases the discovery rate of hard-to-prove theorems compared to frameworks that generate statements and proofs simultaneously. Moreover, our experiments show that reusing the LLM's own formally verified outputs as context consistently improves subsequent proof success, demonstrating the effectiveness of self-generated in-context learning for neural theorem proving. The source code is available at https://github.com/auto-res/ConjecturingProvingLoop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Conjecturing-Proving Loop (CPL), an iterative pipeline in which an LLM generates mathematical conjectures and attempts to prove them in Lean 4. Each iteration conditions the model on its own previously generated and formally verified theorems and proofs, enabling parameter-free improvement of proof strategies through in-context learning. The authors assert both theoretical arguments and experimental results showing that CPL raises the discovery rate of hard-to-prove theorems relative to frameworks that generate statements and proofs simultaneously, and that reusing the model's own verified outputs as context consistently improves subsequent proof success.

Significance. If the central claim holds after addressing the experimental controls, the work would provide a concrete demonstration of self-generated in-context learning for neural theorem proving, offering a reproducible route to improved discovery rates without additional training. The public release of source code is a clear strength that supports verification and extension.

major comments (2)
  1. [§4.2 and Table 2] §4.2 (Experimental Setup) and Table 2: The simultaneous-generation baselines lack an equivalent iterative structure with accumulating context. Without an ablation that holds the number of LLM calls and rounds fixed while varying only the content of the context (verified self-generated theorems vs. empty or unrelated prompts), the reported gains in discovery rate cannot be unambiguously attributed to the in-context learning mechanism rather than the iterative loop itself.
  2. [Abstract and §3] Abstract and §3 (CPL Description): The theoretical evidence for genuine strategy improvement (as opposed to repetition of patterns already present in the training distribution) is asserted without a derivation or proof sketch. A concrete argument showing why conditioning on the model's own verified outputs produces new proof tactics would be needed to support the parameter-free improvement claim.
minor comments (2)
  1. [Figure 3] Figure 3: Axis labels and legend entries for the different methods are too small to read clearly in the printed version.
  2. [§5.1] §5.1: The definition of 'hard-to-prove' theorems is given only qualitatively; an explicit criterion (e.g., proof length or search depth threshold) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the experimental controls and theoretical grounding of the Conjecturing-Proving Loop. We address each major point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [§4.2 and Table 2] §4.2 (Experimental Setup) and Table 2: The simultaneous-generation baselines lack an equivalent iterative structure with accumulating context. Without an ablation that holds the number of LLM calls and rounds fixed while varying only the content of the context (verified self-generated theorems vs. empty or unrelated prompts), the reported gains in discovery rate cannot be unambiguously attributed to the in-context learning mechanism rather than the iterative loop itself.

    Authors: We agree that the current experimental design does not fully isolate the contribution of accumulating verified context from the mere presence of an iterative loop. To address this directly, we will add a controlled ablation in the revised §4.2 and Table 2. This ablation will hold the total number of LLM calls and iteration rounds constant across conditions, comparing (i) the full CPL with accumulating self-generated verified theorems and proofs as context against (ii) an otherwise identical iterative baseline that uses either empty context or unrelated filler prompts. The results will be reported alongside the existing baselines to show whether gains persist when only the context content varies. revision: yes

  2. Referee: [Abstract and §3] Abstract and §3 (CPL Description): The theoretical evidence for genuine strategy improvement (as opposed to repetition of patterns already present in the training distribution) is asserted without a derivation or proof sketch. A concrete argument showing why conditioning on the model's own verified outputs produces new proof tactics would be needed to support the parameter-free improvement claim.

    Authors: We acknowledge that the current presentation of the theoretical argument is high-level and would benefit from an explicit sketch. In the revised §3 we will insert a concise derivation: conditioning on the model's own verified proofs supplies in-context examples of tactic sequences that succeeded on novel, self-generated conjectures. Because these proofs are produced and verified within the loop, they can contain combinations of Lean tactics that are not direct repetitions of training data but rather adaptations to the specific conjecture distribution encountered so far. This allows the model to reference and recombine successful sub-proofs (e.g., a particular induction tactic or lemma application) when facing a new conjecture, yielding measurable improvements in proof success rate without parameter updates. We will illustrate the argument with a short example of tactic reuse across iterations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in CPL empirical claims

full rationale

The paper's central claims rest on an empirical pipeline (CPL) that iteratively generates conjectures and proofs in Lean 4 while conditioning on its own verified prior outputs. Evidence consists of direct comparisons to simultaneous-generation baselines and measured improvements in proof success rates from actual formal verifications. No load-bearing step reduces by construction to fitted parameters, self-definitions, or self-citation chains; the in-context mechanism is tested via observable, externally verifiable Lean outputs rather than internal re-labeling of inputs as predictions. The evaluation therefore remains independent of the method's own generated content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of in-context learning from self-generated Lean proofs and on the assumption that the LLM can leverage that context to improve proof success rates. No explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption LLMs can generate syntactically valid Lean statements and proofs when prompted with prior verified theorems.
    Invoked implicitly when the loop conditions on previous outputs.

pith-pipeline@v0.9.0 · 5718 in / 1254 out tokens · 40516 ms · 2026-05-18T16:04:37.465273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

  1. [1]

    Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  2. [2]

    STP: Self-play LLM theorem provers with iterative conjecturing and proving, 2025

    Kefan Dong and Tengyu Ma. STP: Self-play LLM theorem provers with iterative conjecturing and proving, 2025

  3. [3]

    Survey on preopen sets

    Julian Dontchev. Survey on preopen sets.arXiv preprint math/9810177, 1998

  4. [4]

    Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, et al. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level.Proceedings of the National Academy of Sciences, 119(32):e2123433119, 2022

  5. [5]

    Baldur: Whole-proof generation and repair with large language models

    Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1229–1241, 2023

  6. [6]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning LLM from zero data.arXiv preprint arXiv:2508.05004, 2025

  7. [7]

    Goedel-prover: A frontier model for open-source automated theorem proving

    Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia LI, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving. In Second Conference on Language Modeling, 2025

  8. [8]

    Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

    Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

  9. [9]

    A survey of in-context reinforcement learning.arXiv preprint arXiv:2502.07978, 2025

    Amir Moeini, Jiuqi Wang, Jacob Beck, Ethan Blaser, Shimon Whiteson, Rohan Chandra, and Shangtong Zhang. A survey of in-context reinforcement learning.arXiv preprint arXiv:2502.07978, 2025

  10. [10]

    On some classes of nearly open sets.Pacific journal of mathematics, 15(3):961–970, 1965

    Olav Nj ˙astad. On some classes of nearly open sets.Pacific journal of mathematics, 15(3):961–970, 1965

  11. [11]

    Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition, 2025

  12. [12]

    An in-context learning agent for formal theorem-proving.arXiv preprint arXiv:2310.04353, 2023

    Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An in-context learning agent for formal theorem-proving.arXiv preprint arXiv:2310.04353, 2023

  13. [13]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  14. [14]

    Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy.arXiv preprint arXiv:2508.05592, 2025

    Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, and Fei Tang. Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy.arXiv preprint arXiv:2508.05592, 2025

  15. [15]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022. A Generated Proof of Focused Theorem theorem intersection_of_alpha_open_sets_is_alpha_open {A B : Set...