pith. sign in

arxiv: 2606.28747 · v1 · pith:UO3LRYVFnew · submitted 2026-06-27 · 💻 cs.AI · cs.LG

Self-Supervised Theorem Discovery in a Formal Axiomatic System

Pith reviewed 2026-06-30 09:45 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords self-supervised theorem discoveryformal axiomatic systemproof searchtheorem extractionautomated reasoningmathematical library growthLLM prompting with lemmas
0
0 comments X

The pith

An agent starting only from axioms and rules discovers tens of thousands of theorems that prove human benchmarks and improve LLM performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a self-supervised loop in which an agent alternates between searching for proofs with its current library and extracting new theorems from successful proofs. Starting with no human theorems, the process grows a library whose entries serve as reusable lemmas for later searches. Experiments demonstrate that the resulting theorems enable proofs of human-written benchmark problems and raise success rates when supplied as prompt lemmas to language models. This supplies direct evidence that useful formal knowledge can be generated internally rather than supplied externally.

Core claim

By repeatedly running proof search on unsolved problems and then extracting theorems judged useful from the proofs found, an agent builds a growing library that lets it solve an increasing fraction of problems, including a set of human-written benchmarks, while the same library also raises the success rate of LLM-based provers when the extracted theorems are inserted into prompts.

What carries the argument

The alternating proof-search and useful-theorem-extraction loop that re-uses extracted theorems as lemmas in subsequent searches.

If this is right

  • The agent produces a library of tens of thousands of theorems without any initial human examples.
  • Proofs of human-written benchmark problems become reachable using only the internally generated library.
  • Inserting the extracted theorems into LLM prompts measurably raises proof success rates on the same benchmarks.
  • The overall procedure demonstrates that a closed loop of search and extraction can enlarge the set of provable statements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loop could be run continuously, letting the library grow without a fixed stopping point.
  • Theorems discovered this way might be filtered or ranked by generality to produce compact core libraries for new domains.
  • If the extraction criterion is made independent of the benchmark set, the method could serve as a generator of training data for downstream provers.

Load-bearing premise

The rule that decides which theorems count as useful during extraction yields lemmas that remain helpful on problems and models outside the original search runs.

What would settle it

Running the same extraction procedure on a different formal system or benchmark set and finding that the extracted theorems give no improvement in proof success rate or LLM prompting accuracy.

Figures

Figures reproduced from arXiv: 2606.28747 by Kazuki Ota, Takayuki Osa, Tatsuya Harada.

Figure 1
Figure 1. Figure 1: Hilbert system and proof tree for A → A. Top: the three axioms and Modus Ponens. The axioms respectively mean weakening, distribution of implication, and contraposition. Bottom: a proof tree deriving A → A from two Ax1 instances, one Ax2 instance, and two Modus Ponens steps. for the theorem A → A. It uses two instances of Ax1, one instance of Ax2, and two applications of Modus Ponens. In this way, a Hilber… view at source ↗
Figure 2
Figure 2. Figure 2: Stack-machine execution for the proof tree of A → A. The proof tree in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example proof-search episode used to generate goals and training data. Marked states are single-formula stacks, and the corresponding formulas f1, f3, and f7 are added to the goal buffer G. For each reached theorem, we use all preceding state-action prefixes as training examples: Dtrain ← Dtrain ∪ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Number of discovered theorems in the Hilbert axiom system. The count increases over generations, indicating that the agent can grow a theorem library within the Hilbert axiom system [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: compares the proof success rate with and without the extracted theorems provided as prompt lemmas. For GPT-4o and the GPT-5 variants, proof success rates improve when the extracted theorems are provided as prompt lemmas. This result shows that the extracted theorems can help LLMs find proofs in the Hilbert system. Overall, these results show that theorems extracted through our approach function as transfer… view at source ↗
read the original abstract

Recent artificial intelligence (AI) systems have shown remarkable progress in mathematical reasoning. Many existing approaches, including large language models (LLMs), draw on human prior knowledge in the form of mathematical text, code, or theorem libraries. Although these approaches are highly effective in practice, it remains an open question whether an agent can autonomously discover useful theorems without such human priors. We study this question in a formal axiomatic system by developing an agent that starts from axioms and inference rules alone and gradually grows a library of useful theorems. Concretely, we propose a self-supervised theorem-discovery algorithm that alternates between proof search and useful-theorem extraction, building a theorem library whose entries are reused as lemmas for subsequent proof search. Experiments show that the agent discovers tens of thousands of theorems and finds proofs for human-written benchmark problems, suggesting that its discoveries include theorems meaningful from a human mathematical perspective. Furthermore, the discovered theorems improve LLM proof performance when provided as prompt lemmas, indicating that they can serve as external knowledge for LLM reasoning. Our results provide evidence that useful theorems can emerge from proof search without relying on human-provided theorem libraries. More broadly, they suggest a path toward self-evolving AI systems for mathematics whose discoveries remain formally verifiable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents a self-supervised theorem-discovery algorithm in a formal axiomatic system that begins with axioms and inference rules alone. The agent alternates between proof search and extraction of 'useful' theorems to grow a reusable library, claiming to discover tens of thousands of theorems, prove human-written benchmark problems, and improve LLM performance when the theorems are supplied as prompt lemmas. This is offered as evidence that useful theorems can emerge from proof search without human-provided libraries.

Significance. If the results hold under rigorous controls, the work would be significant for demonstrating autonomous, formally verifiable mathematical discovery in AI. The emphasis on machine-checkable proofs and the reported transfer to LLM prompting are strengths that distinguish it from purely heuristic approaches.

major comments (3)
  1. [Methods] Methods section: the precise scoring function and criterion used to label theorems as 'useful' during extraction are not defined, preventing assessment of whether the labels reflect generalizable mathematical utility or search-procedure artifacts (e.g., frequency within the agent's own trajectories).
  2. [Experiments] Experiments section: no baselines, statistical controls, number of discarded theorems, or ablation results comparing the full pipeline against proof search alone are reported, which is load-bearing for the central claim that the extracted theorems improve performance on held-out human benchmarks and LLM prompting.
  3. [Results] Results section: the generalization claim (that discovered theorems aid proof search on human-written benchmarks rather than reflecting the specific search distribution) rests on an untested assumption about the extraction criterion; without the missing ablations or controls, the experimental outcomes cannot be evaluated.
minor comments (1)
  1. [Abstract] Abstract: the formal axiomatic system and exact benchmark set are not named, which would aid immediate context for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important gaps in methodological detail and experimental rigor that we will address through revisions. Below we respond to each major comment.

read point-by-point responses
  1. Referee: [Methods] Methods section: the precise scoring function and criterion used to label theorems as 'useful' during extraction are not defined, preventing assessment of whether the labels reflect generalizable mathematical utility or search-procedure artifacts (e.g., frequency within the agent's own trajectories).

    Authors: We agree that the precise scoring function and extraction criterion must be explicitly defined. The original manuscript described the process at a high level without the mathematical formulation. In the revised version we will add the exact scoring function (including any weighting of frequency, proof length, or reuse potential) and the decision threshold used to label theorems as useful. revision: yes

  2. Referee: [Experiments] Experiments section: no baselines, statistical controls, number of discarded theorems, or ablation results comparing the full pipeline against proof search alone are reported, which is load-bearing for the central claim that the extracted theorems improve performance on held-out human benchmarks and LLM prompting.

    Authors: The referee correctly identifies that these controls are necessary to support the central claim. We will expand the Experiments section to report: (i) a baseline of pure proof search without any theorem extraction, (ii) statistical measures such as mean and variance across multiple independent runs, (iii) the total number of candidate theorems considered and the number ultimately discarded, and (iv) ablation experiments that isolate the contribution of the extracted library. revision: yes

  3. Referee: [Results] Results section: the generalization claim (that discovered theorems aid proof search on human-written benchmarks rather than reflecting the specific search distribution) rests on an untested assumption about the extraction criterion; without the missing ablations or controls, the experimental outcomes cannot be evaluated.

    Authors: We accept that the generalization claim cannot be fully evaluated without the requested ablations. Once the new baseline and ablation results are included, we will revise the Results section to present those findings and, where appropriate, qualify the interpretation of generalization. If the ablations indicate that gains are largely artifacts of the search distribution, we will adjust the strength of the claim accordingly. revision: partial

Circularity Check

0 steps flagged

Experimental demonstrations of theorem discovery and transfer; no definitional or self-referential reduction in core claims

full rationale

The paper describes an iterative algorithm alternating proof search with extraction of 'useful' theorems (reused as lemmas), validated empirically by solving human-written benchmark problems and improving LLM performance when the discovered theorems are supplied as prompts. No equations, fitted parameters, or self-citations are presented that would make the reported success metrics equivalent to the extraction criterion by construction. The usefulness labeling is an explicit component of the proposed method rather than a hidden tautology, and the generalization claims rest on held-out experimental outcomes rather than re-labeling of the same quantities used for selection. This is consistent with a score of 2 for a minor self-citation possibility that is not load-bearing on the central experimental results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; concrete free parameters, axioms, and entities cannot be enumerated without the methods section. The domain assumption that a formal system with inference rules exists and admits automated proof search is implicit.

axioms (1)
  • domain assumption A formal axiomatic system exists with inference rules that permit automated proof search
    Stated directly in the abstract as the starting point for the agent.

pith-pipeline@v0.9.1-grok · 5749 in / 1310 out tokens · 27950 ms · 2026-06-30T09:45:57.563725+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    , title =

    Post, Emil L. , title =. American Journal of Mathematics , volume =

  10. [10]

    2015 , isbn =

    Mendelson, Elliott , title =. 2015 , isbn =

  11. [11]

    1950 , note =

    Hilbert, David and Ackermann, Wilhelm , title =. 1950 , note =

  12. [12]

    1996 , note =

    Church, Alonzo , title =. 1996 , note =

  13. [13]

    , title =

    Cook, Stephen A. , title =. Proceedings of the Third Annual ACM Symposium on Theory of Computing , pages =

  14. [14]

    and Reckhow, Robert A

    Cook, Stephen A. and Reckhow, Robert A. , title =. Journal of Symbolic Logic , volume =

  15. [15]

    Transactions of the American Mathematical Society , volume =

    Hindley, Roger , title =. Transactions of the American Mathematical Society , volume =

  16. [16]

    Journal of Computer and System Sciences , volume =

    Milner, Robin , title =. Journal of Computer and System Sciences , volume =

  17. [17]

    Principal Type-Schemes for Functional Programs , booktitle =

    Damas, Lu. Principal Type-Schemes for Functional Programs , booktitle =

  18. [18]

    , title =

    Howard, William A. , title =. To H. B. Curry: Essays on Combinatory Logic, Lambda Calculus and Formalism , editor =

  19. [19]

    Lectures on the Curry--Howard Isomorphism , series =

    S. Lectures on the Curry--Howard Isomorphism , series =. 2006 , isbn =

  20. [20]

    and Polu, Stanislas , title =

    Han, Jesse Michael and Rute, Jason and Wu, Yuhuai and Ayers, Edward W. and Polu, Stanislas , title =. International Conference on Learning Representations , year =

  21. [21]

    International Conference on Learning Representations , year =

    Polu, Stanislas and Han, Jesse Michael and Zheng, Kunhao and Baksys, Mantas and Babuschkin, Igor and Sutskever, Ilya , title =. International Conference on Learning Representations , year =

  22. [22]

    and Li, Wenda and Rabe, Markus N

    Wu, Yuhuai and Jiang, Albert Q. and Li, Wenda and Rabe, Markus N. and Staats, Charles and Jamnik, Mateja and Szegedy, Christian , title =. Advances in Neural Information Processing Systems , volume =

  23. [23]

    and Gu, Alex and Chalamala, Rahul and Song, Peiyang and Yu, Shixing and Godil, Saad and Prenger, Ryan and Anandkumar, Anima , title =

    Yang, Kaiyu and Swope, Aidan M. and Gu, Alex and Chalamala, Rahul and Song, Peiyang and Yu, Shixing and Godil, Saad and Prenger, Ryan and Anandkumar, Anima , title =. Advances in Neural Information Processing Systems , volume =

  24. [24]

    Reinforcement Learning of Theorem Proving , booktitle =

    Kaliszyk, Cezary and Urban, Josef and Michalewski, Henryk and Ol. Reinforcement Learning of Theorem Proving , booktitle =

  25. [25]

    Advances in Neural Information Processing Systems , volume =

    Wu, Minchao and Norrish, Michael and Walder, Christian and Dezfouli, Amir , title =. Advances in Neural Information Processing Systems , volume =

  26. [26]

    Advances in Neural Information Processing Systems , volume =

    Lample, Guillaume and Lachaux, Marie-Anne and Lavril, Thibaut and Martinet, Xavier and Hayat, Amaury and Ebner, Gabriel and Rodriguez, Aur. Advances in Neural Information Processing Systems , volume =

  27. [27]

    International Conference on Learning Representations , year =

    Ghosh, Dibya and Gupta, Abhishek and Reddy, Ashwin and Fu, Justin and Devin, Coline and Eysenbach, Benjamin and Levine, Sergey , title =. International Conference on Learning Representations , year =

  28. [28]

    Advances in Neural Information Processing Systems , volume =

    Andrychowicz, Marcin and Wolski, Filip and Ray, Alex and Schneider, Jonas and Fong, Rachel and Welinder, Peter and McGrew, Bob and Tobin, Josh and Abbeel, Pieter and Zaremba, Wojciech , title =. Advances in Neural Information Processing Systems , volume =

  29. [29]

    Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs , booktitle =

    Jiang, Albert Qiaochu and Welleck, Sean and Zhou, Jin Peng and Lacroix, Timoth. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs , booktitle =

  30. [30]

    Learning to Find Proofs and Theorems by Learning to Refine Search Strategies: The Case of Loop Invariant Synthesis , booktitle =

    Laurent, Jonathan and Platzer, Andr. Learning to Find Proofs and Theorems by Learning to Refine Search Strategies: The Case of Loop Invariant Synthesis , booktitle =

  31. [31]

    and Deng, Jia and Biderman, Stella and Welleck, Sean , title =

    Azerbayev, Zhangir and Schoelkopf, Hailey and Paster, Keiran and Dos Santos, Marco and McAleer, Stephen and Jiang, Albert Q. and Deng, Jia and Biderman, Stella and Welleck, Sean , title =. International Conference on Learning Representations , year =

  32. [32]

    2023 , note =

    Introducing. 2023 , note =

  33. [33]
  34. [34]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Yang, An and Zhang, Beichen and Hui, Binyuan and others , title =. 2024 , howpublished =. 2409.12122 , archivePrefix =

  35. [35]

    Qwen3 Technical Report

    Yang, An and Li, Anfeng and Yang, Baosong and others , title =. 2025 , howpublished =. 2505.09388 , archivePrefix =

  36. [36]

    Nature , volume =

    Guo, Daya and Yang, Dejian and Zhang, Haowei and others , title =. Nature , volume =. 2025 , doi =

  37. [37]

    2025 , note =

    Luong, Thang and Lockhart, Edward , title =. 2025 , note =

  38. [38]

    Nature , volume =

    Farquhar, Sebastian and Kossen, Jannik and Kuhn, Lorenz and Gal, Yarin , title =. Nature , volume =. 2024 , doi =

  39. [39]

    and Zhang, Edwin , title =

    Kalai, Adam Tauman and Nachum, Ofir and Vempala, Santosh S. and Zhang, Edwin , title =

  40. [40]

    International Conference on Learning Representations , year =

    Wang, Haiming and Xin, Huajian and Zheng, Chuanyang and Liu, Zhengying and Cao, Qingxing and Huang, Yinya and Xiong, Jing and Shi, Han and Xie, Enze and Yin, Jian and Li, Zhenguo and Liang, Xiaodan , title =. International Conference on Learning Representations , year =

  41. [41]

    DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,

    Xin, Huajian and Guo, Daya and Shao, Zhihong and Ren, Zhizhou and Zhu, Qihao and Liu, Bo and Ruan, Chong and Li, Wenda and Liang, Xiaodan , title =. 2024 , howpublished =. 2405.14333 , archivePrefix =

  42. [42]

    2025 , howpublished =

    Zimmer, Matthieu and Ji, Xiaotong and Tutunov, Rasul and Bordg, Anthony and Wang, Jun and Bou Ammar, Haitham , title =. 2025 , howpublished =. 2507.02726 , archivePrefix =

  43. [43]

    Hubert, Thomas and Mehta, Rishi and Sartran, Laurent and Horvath, Miklos Z. and Zuzic, Goran and Wieser, Eric and Huang, Aja and Schrittwieser, Julian and Schroecker, Yannick and Masoom, Hussain and Bertolli, Ottavia and Zahavy, Tom and Mandhane, Amol and Yung, Jessica and Beloshapka, Iuliya and Ibarz, Borja and Veeriah, Vivek and Yu, Lei and Nash, Oliver...

  44. [44]

    and Wu, Yuhuai and Le, Quoc V

    Trinh, Trieu H. and Wu, Yuhuai and Le, Quoc V. and He, He and Luong, Thang , title =. Nature , volume =. 2024 , doi =

  45. [45]

    Nature , volume =

    Romera-Paredes, Bernardino and Barekatain, Mohammadamin and Novikov, Alexander and others , title =. Nature , volume =. 2024 , doi =

  46. [46]

    , title =

    Poesia, Gabriel and Broman, David and Haber, Nick and Goodman, Noah D. , title =. Advances in Neural Information Processing Systems , volume =

  47. [47]

    Proceedings of the 42nd International Conference on Machine Learning , series =

    Dong, Kefan and Ma, Tengyu , title =. Proceedings of the 42nd International Conference on Machine Learning , series =

  48. [48]

    The 5th Workshop on Mathematical Reasoning and AI (MATH-AI) at NeurIPS , year =

    Kasriel, Timothe and Lu, Thomas and Ding, Qinghua and He, Jingxuan and Song, Dawn , title =. The 5th Workshop on Mathematical Reasoning and AI (MATH-AI) at NeurIPS , year =

  49. [49]

    Kleene, Stephen Cole , title =

  50. [50]

    Advances in Neural Information Processing Systems , volume =

    Zhao, Andrew and Wu, Yiran and Yue, Yang and Wu, Tong and Xu, Quentin and Yue, Yang and Lin, Matthieu and Wang, Shenzhi and Wu, Qingyun and Zheng, Zilong and Huang, Gao , title =. Advances in Neural Information Processing Systems , volume =