pith. sign in

arxiv: 2605.22763 · v1 · pith:W6IWLF6Enew · submitted 2026-05-21 · 💻 cs.AI

Advancing Mathematics Research with AI-Driven Formal Proof Search

Pith reviewed 2026-05-22 05:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI for mathematicsformal proofsLean theorem proverErdős problemsOEIS conjecturesautomated theorem provingproof searchmathematical reasoning
0
0 comments X

The pith

An AI agent using large language models generates and verifies formal proofs in Lean to solve open Erdős problems and OEIS conjectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can help solve genuinely open mathematical questions by writing formal proofs that a computer can check in the Lean theorem prover. Their strongest agent found proofs for 9 out of 353 listed Erdős problems at a few hundred dollars each and settled 44 out of 492 OEIS conjectures. A simpler agent that keeps generating candidate proofs with the language model and checking them in Lean also succeeded on the Erdős set, though it cost more on the hardest cases. The same approach is already being tried inside active research programs in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. Readers should care because the method supplies an independent, machine-checkable record instead of relying only on human judgment for difficult claims.

Core claim

The central claim is that an AI agent can autonomously resolve open problems by producing Lean-verifiable formal proofs, as shown by the resolution of 9 Erdős problems and 44 OEIS conjectures, with the technique now applied across multiple branches of mathematics.

What carries the argument

An agent that alternates large-language-model generation of candidate Lean proofs with automated verification inside the Lean kernel.

If this is right

  • Formal verification can be inserted into the workflow of active mathematical research at modest per-problem cost.
  • The same agent design works across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics.
  • A basic alternating generation-and-verification loop already reproduces the Erdős successes, though harder problems raise its cost.
  • Large-scale testing on hundreds of open statements is now feasible with current language-model capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the cost stays low, smaller research groups could run their own formal-proof searches on local conjectures without large budgets.
  • The approach could be tried in other formal systems besides Lean to broaden the set of checkable problems.
  • Hybrid loops that let human mathematicians edit the agent's intermediate steps might solve still harder open questions.
  • Success on Erdős and OEIS lists suggests the method could be aimed next at problems that have resisted both human and computer attack for decades.

Load-bearing premise

The targeted Erdős problems and OEIS conjectures were still open before this work, and the Lean outputs are complete formal proofs produced without essential human intervention in the derivation steps.

What would settle it

An independent Lean expert confirming that one of the reported proofs is incomplete or that any of the nine Erdős problems had already been solved before the agent ran would disprove the claim of autonomous resolution.

Figures

Figures reproduced from arXiv: 2605.22763 by Adam Zsolt Wagner, Aja Huang, Andrew Ferrauiolo, Anja Surina, Anton Kovsharov, Arun Suggala, Codrut Grosu, Eric Wieser, Francisco J. R. Ruiz, George Tsoukalas, Gergely B\'erczi, Henryk Michalewski, Lei Yu, Matej Balog, Mikl\'os Z. Horv\'ath, Moritz Firsching, Pushmeet Kohli, Sergey Shirobokov, Swarat Chaudhuri, Thomas Hubert.

Figure 1
Figure 1. Figure 1: Pseudocode for ’s main components. The main loop creates a pool of asynchronous sketchers and raters and awaits the creation of a full (sorry-free) proof. Each prover subagent samples a parent sketch from the population using the P-UCB strategy and creates a stateful conversation session with an LLM (Gemini 3.1 Pro) instance. In this conversation, it receives instructions to perform tool calls from the LLM… view at source ↗
Figure 4
Figure 4. Figure 4: Sketcher agent prompt (condensed). Elided text is represented by [...]. Text in braces denotes template variables populated at runtime. For example, {code} is replaced by the current Lean file. 5 [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Rater agent prompt (condensed). Elided text is represented by [...]. Text in braces denotes template variables populated at runtime. For example, {player blocks} is replaced by the sketches to be compared. 4 [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Rater agent prompt (condensed). Elided text is represented by [...]. Text in braces denotes template variables populated at runtime. For example, {player blocks} is replaced by the sketches to be compared. 3 [PITH_FULL_IMAGE:figures/full_fig_p025_2.png] view at source ↗
read the original abstract

Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method's ability to solve open problems. Our most capable agent autonomously resolved 9 of 353 open Erd\H{o}s problems at the per-problem cost of a few hundred dollars, proved 44/492 OEIS conjectures, and is being deployed in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics research. A basic agent alternating LLM-based generation with Lean-based verification replicated the Erd\H{o}s successes but proved costlier on the hardest problems. These findings demonstrate the power of AI-aided formal proof search and shed light on the agent designs that enable it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports the first large-scale evaluation of LLM-based agents for generating formal proofs in Lean to solve open mathematical problems. The central claims are that the most capable agent autonomously resolved 9 of 353 open Erdős problems at a per-problem cost of a few hundred dollars, proved 44 of 492 OEIS conjectures, and is now deployed across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics; a simpler alternating generation-verification agent is shown to replicate some successes but at higher cost on hard instances.

Significance. If the reported successes are independently confirmed, the work would demonstrate that current LLM-driven formal proof search can address genuinely open problems at modest cost, providing a concrete benchmark for AI-assisted mathematics. The use of Lean verification supplies a high standard of correctness that distinguishes this from informal LLM reasoning claims, and the multi-domain deployment suggests immediate research utility beyond the Erdős and OEIS benchmarks.

major comments (2)
  1. [Abstract and results section] Abstract and results section: the headline claim that the agent 'autonomously resolved 9 of 353 open Erdős problems' is load-bearing for the autonomy and cost assertions, yet the manuscript supplies neither per-problem literature-search logs confirming the problems were previously unsolved nor proof-size statistics or artifact links demonstrating that the Lean outputs are complete, hole-free formal proofs rather than partial derivations requiring human post-editing. Without these, the success counts cannot be fully assessed.
  2. [Methods and evaluation sections] Methods and evaluation sections: the reported success counts (9/353 Erdős, 44/492 OEIS) are presented without error bars, failure-mode breakdowns, or independent audit of the Lean artifacts. This omission directly affects the reliability of the comparative claim that the basic agent 'replicated the Erdős successes but proved costlier on the hardest problems.'
minor comments (2)
  1. [§2] The description of agent architectures would benefit from a clearer tabular comparison of the 'most capable' versus 'basic' designs, including exact prompting strategies and Lean interaction loops.
  2. [Figures and tables] Figure captions and table legends should explicitly state whether the reported costs include only inference or also human oversight time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify the presentation of our results on LLM-based formal proof search. We address each major comment below and describe the revisions we will incorporate to improve transparency and verifiability.

read point-by-point responses
  1. Referee: [Abstract and results section] Abstract and results section: the headline claim that the agent 'autonomously resolved 9 of 353 open Erdős problems' is load-bearing for the autonomy and cost assertions, yet the manuscript supplies neither per-problem literature-search logs confirming the problems were previously unsolved nor proof-size statistics or artifact links demonstrating that the Lean outputs are complete, hole-free formal proofs rather than partial derivations requiring human post-editing. Without these, the success counts cannot be fully assessed.

    Authors: We agree that explicit documentation strengthens the autonomy and correctness claims. In the revised manuscript we will add a dedicated subsection (and supplementary table) that lists, for each of the nine solved Erdős problems, the pre-experiment literature references establishing that the problem remained open, together with links to the corresponding Lean proof files in a public repository. We will also report proof-size statistics (number of tactics, lines of code, and kernel-verification time) for every success. All reported proofs were produced and checked end-to-end by the agent with no human post-editing; the added artifacts will make this verifiable. revision: yes

  2. Referee: [Methods and evaluation sections] Methods and evaluation sections: the reported success counts (9/353 Erdős, 44/492 OEIS) are presented without error bars, failure-mode breakdowns, or independent audit of the Lean artifacts. This omission directly affects the reliability of the comparative claim that the basic agent 'replicated the Erdős successes but proved costlier on the hardest problems.'

    Authors: We concur that quantitative uncertainty measures and failure analysis improve interpretability. The revision will include binomial 95 % confidence intervals for both success rates and a breakdown of the dominant failure modes (timeout, tactic failure, search exhaustion) across the full set of attempts. We will also expand the agent-comparison section with per-problem cost figures for the hardest instances to support the cost-effectiveness claim. All Lean artifacts and run logs will be released publicly to enable independent audit; performing a third-party audit ourselves lies outside the scope of the present study. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are empirical evaluations on external benchmarks

full rationale

The paper reports counts of resolved open Erdős problems and OEIS conjectures obtained by running LLM-based agents and verifying outputs in Lean. These success metrics are generated by direct application to independently defined external problems rather than by fitting parameters to subsets of the target data or by self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked that reduce the headline claims to the inputs by construction. The methodology relies on external verification and open problem lists, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that Lean correctly certifies the generated proofs and that the chosen problem sets were open before the experiments; no new mathematical axioms or entities are introduced.

axioms (1)
  • domain assumption The Lean theorem prover provides sound verification of formal proofs
    All reported successes depend on Lean accepting the generated statements as valid proofs.

pith-pipeline@v0.9.0 · 5761 in / 1216 out tokens · 42052 ms · 2026-05-22T05:05:27.994522+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 13 internal anchors

  1. [1]

    Aristotle: IMO-level Automated Theorem Proving

    TudorAchim, AlexBest, AlbertoBietti, KevinDer, MathïsFédérico, SergeiGukov, Daniel Halpern-Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, et al. Aristotle: IMO-level automated theorem proving.arXiv preprint arXiv:2510.01346, 2025

  2. [2]

    Primitive sets and von mangoldt chains: Erdős problem #1196 and beyond, 2026

    BorisAlexeev,KevinBarreto,YanyangLi,JaredDukerLichtman,LiamPrice,JibranIqbal Shah, Quanyu Tang, and Terence Tao. Primitive sets and von mangoldt chains: Erdős problem #1196 and beyond, 2026

  3. [3]

    Short proofs in combinatorics and number theory.arXiv preprint arXiv:2603.29961, 2026

    Boris Alexeev, Moe Putterman, Mehtaab Sawhney, Mark Sellke, and Gregory Valiant. Short proofs in combinatorics and number theory.arXiv preprint arXiv:2603.29961, 2026

  4. [4]

    Short proofs in combinatorics, probability and number theory II

    Boris Alexeev, Moe Putterman, Mehtaab Sawhney, Mark Sellke, and Gregory Valiant. Short proofs in combinatorics, probability and number theory ii.arXiv preprint arXiv:2604.06609, 2026

  5. [5]

    Pantograph: A machine-to-machine interaction interface for advanced theorem proving, high level reasoning, and data extraction in lean 4, 2025

    LeniAniva,ChuyueSun,BrandoMiranda,ClarkBarrett,andSanmiKoyejo. Pantograph: A machine-to-machine interaction interface for advanced theorem proving, high level reasoning, and data extraction in lean 4, 2025

  6. [6]

    AXLE: Axiom lean engine, 2025

    Axiom. AXLE: Axiom lean engine, 2025. Accessed: 2025

  7. [7]

    A note on p-sets

    Stephan Baier. A note on p-sets. 2004

  8. [8]

    Erdosproblems.com

    Thomas Bloom. Erdosproblems.com. https://www.erdosproblems.com, 2026

  9. [9]

    Migliore, Rosa M

    Mats Boij, Juan C. Migliore, Rosa M. Miró-Roig, Uwe Nagel, and Fabrizio Zanello.On the Shape of a Pure O-Sequence, volume 218 ofMemoirs of the American Mathematical Society. American Mathematical Society, Providence, RI, 2012

  10. [10]

    J. A. Bondy and R. L. Hemminger. Graph reconstruction–a survey.Journal of Graph Theory, 1(3):227–268, 1977

  11. [11]

    The motivic class of the space of genus0maps to the flag variety.arXiv preprint arXiv:2601.07222, 2026

    Jim Bryan, Balázs Elek, Freddie Manners, George Salafatinos, and Ravi Vakil. The motivic class of the space of genus0maps to the flag variety.arXiv preprint arXiv:2601.07222, 2026

  12. [12]

    S. A. Burr, P. Erdős, R. L. Graham, and W. Wen-Ching Li. Complete sequences of sets of integer powers.Acta Arithmetica, 77(2):133–138, 1996

  13. [13]

    Last-Iterate Convergence of Anchored Gradient Descent

    Yang Cai and Weiqiang Zheng. Last-iterate convergence of anchored gradient descent. arXiv preprint arXiv:2604.12235, 2026. 11 Advancing Mathematics Research with AI-Driven Formal Proof Search

  14. [14]

    Efficient Bayesian inference for generalized Bradley–Terry models.Journal of Computational and Graphical Statistics, 21(1):174– 196, 2012

    François Caron and Arnaud Doucet. Efficient Bayesian inference for generalized Bradley–Terry models.Journal of Computational and Graphical Statistics, 21(1):174– 196, 2012

  15. [15]

    Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from experience.arXiv preprint arXiv:2512.17260, 2025

    Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, et al. Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from experience.arXiv preprint arXiv:2512.17260, 2025

  16. [16]

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

    DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

  17. [17]

    On Erd\H{o}s and S\'ark\"ozy's sequences with Property P

    Christian Elsholtz and Stefan Planitzer. On erdős and sárközy’s sequences with property p.arXiv preprint arXiv:1609.07935, 2016

  18. [18]

    On the divisibility properties of sequences of integers

    Paul Erdős and Alice Sárközi. On the divisibility properties of sequences of integers. Proceedings of The London Mathematical Society, pages 97–101, 1970

  19. [19]

    On conjectures of graffiti

    Siemion Fajtlowicz. On conjectures of graffiti. In J. Akiyama, Y. Egawa, and H. Enomoto, editors,Graph Theory and Applications, volume 38 ofAnnals of Discrete Mathematics, pages 113–118. Elsevier, 1988

  20. [20]

    Aletheia tackles firstproof autonomously.arXiv preprint arXiv:2602.21201, 2026

    Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov, Chiang- Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, et al. Aletheia tackles firstproof autonomously.arXiv preprint arXiv:2602.21201, 2026

  21. [21]

    Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung,JoonkyungLee,CarloPagano,SanghyunKim,FedericoPasqualotto,SergeiGukov, Jonathan N

    Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung,JoonkyungLee,CarloPagano,SanghyunKim,FedericoPasqualotto,SergeiGukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Ko...

  22. [22]

    A strict separation between two notions of quadratically structured functions

    Moritz Firsching and Bogdan Georgiev. A strict separation between two notions of quadratically structured functions. In preparation, 2026

  23. [23]

    Horváth, Yaël Dillies, Calle Sönne, Eric Wieser, Fred Zhang, Thomas Hubert, Blaise Agüera y Arcas, and Pushmeet Kohli

    Moritz Firsching, Paul Lezeau, Salvatore Mercuri, Miklós Z. Horváth, Yaël Dillies, Calle Sönne, Eric Wieser, Fred Zhang, Thomas Hubert, Blaise Agüera y Arcas, and Pushmeet Kohli. Formal conjectures: An open and evolving benchmark for verified discovery in mathematics, 2026

  24. [24]

    Safeverify.https://github.com/GasStationManager/Saf eVerify, 2025

    GasStationManager. Safeverify.https://github.com/GasStationManager/Saf eVerify, 2025. [Accessed: 2026-05-12]

  25. [25]

    Mathematical exploration and discovery at scale

    Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathe- matical exploration and discovery at scale.arXiv preprint arXiv:2511.02864, 2025

  26. [26]

    Gemini 3.1 deep think, 2026

    Google DeepMind. Gemini 3.1 deep think, 2026. Accessed: 2026-04-30

  27. [27]

    100 open problems

    Ben Green. 100 open problems. https://people.maths.ox.ac.uk/greenbj/papers/open- problems.pdf, 2024. 12 Advancing Mathematics Research with AI-Driven Formal Proof Search

  28. [28]

    A Milestone in Formalization: The Sphere Packing Problem in Dimension 8

    Sidharth Hariharan, Christopher Birkbeck, Seewoo Lee, Ho Kiu Gareth Ma, Bhavik Mehta, Auguste Poiroux, and Maryna Viazovska. A milestone in formalization: The sphere packing problem in dimension 8.arXiv preprint arXiv:2604.23468, 2026

  29. [29]

    Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025

    Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z Horváth, Goran Žužić, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, Ottavia Bertolli, Tom Zahavy, Amol Mandhane, Jessica Yung, Iuliya Beloshapka, Borja Ibarz, Vivek Veeriah, Lei Yu, Oliver Nash, Paul Lezeau, Salvatore Mercuri, Calle Sönne, Bhavik Mehta, Alex Davies, ...

  30. [30]

    Learning and planning in complex action spaces

    Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, and David Silver. Learning and planning in complex action spaces. In International Conference on Machine Learning, 2021

  31. [31]

    software engineer

    Geoffrey Huntley. Ralph wiggum as a "software engineer".https://ghuntley.com /ralph, 2025. Blog post

  32. [32]

    Equality in Fill's spectral gap problem

    Vishesh Jain and Clayton Mizgerd. Equality in fill’s spectral gap problem.arXiv preprint arXiv:2604.03937, 2026

  33. [33]

    Point convergence of nesterov’s accelerated gradient method: An ai-assisted proof.arXiv preprint arXiv:2510.23513, 2025

    Uijeong Jang and Ernest K Ryu. Point convergence of nesterov’s accelerated gradient method: An ai-assisted proof.arXiv preprint arXiv:2510.23513, 2025

  34. [34]

    Draft, sketch, and prove: Guiding formal theorem provers with informal proofs.arXiv preprint arXiv:2210.12283, 2022

    Albert Q Jiang, Sean Welleck, Jin Peng Zhou, Wenda Li, Jiacheng Liu, Mateja Jamnik, Timothée Lacroix, Yuhuai Wu, and Guillaume Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs.arXiv preprint arXiv:2210.12283, 2022

  35. [35]

    Thor: Wielding hammers to integrate language models and automated theorem provers.Advances in Neural Information Processing Systems, 35:8360–8373, 2022

    Albert Qiaochu Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygóźdź, Piotr Miłoś, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding hammers to integrate language models and automated theorem provers.Advances in Neural Information Processing Systems, 35:8360–8373, 2022

  36. [36]

    Paul J. Kelly. A congruence theorem for trees.Pacific Journal of Mathematics, 7(1):961– 968, 1957

  37. [37]

    John R. Koza. Genetic programming as a means for programming computers by natural selection.Statistics and Computing, 4(2):87–112, 1994

  38. [38]

    A Tensor-Algebraic No-Go Theorem for High-Dimensional Photonic GHZ States

    Mario Krenn, Moritz Firsching, George Tsoukalas, Rishikesh Gajjala, Xuemei Gu, and Swarat Chaudhuri. A Tensor-Algebraic No-Go Theorem for High-Dimensional Photonic GHZ States. In preparation, 2026

  39. [39]

    Quantum experiments and graphs: Multiparty states as coherent superpositions of perfect matchings.Physical Review Letters, 119(24), December 2017

    Mario Krenn, Xuemei Gu, and Anton Zeilinger. Quantum experiments and graphs: Multiparty states as coherent superpositions of perfect matchings.Physical Review Letters, 119(24), December 2017. 13 Advancing Mathematics Research with AI-Driven Formal Proof Search

  40. [40]

    Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

    Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

  41. [41]

    Luce.Individual Choice Behavior: A Theoretical Analysis

    R.D. Luce.Individual Choice Behavior: A Theoretical Analysis. Wiley, 1959

  42. [42]

    Gauss: An agent for autoformalization, 2026

    Math Inc. Gauss: An agent for autoformalization, 2026. Accessed: 2026

  43. [43]

    Thelean4theoremproverandprogramming language

    LeonardodeMouraandSebastianUllrich. Thelean4theoremproverandprogramming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021

  44. [44]

    Reinforced generation of combinatorial structures: Hardness of approximation.arXiv preprint arXiv:2509.18057, 2025

    Ansh Nagda, Prabhakar Raghavan, and Abhradeep Thakurta. Reinforced generation of combinatorial structures: Hardness of approximation.arXiv preprint arXiv:2509.18057, 2025

  45. [45]

    Reinforced Generation of Combinatorial Structures: Ramsey Numbers

    Ansh Nagda, Prabhakar Raghavan, and Abhradeep Thakurta. Reinforced generation of combinatorial structures: Ramsey numbers.arXiv preprint arXiv:2603.09172, 2026

  46. [46]

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

  47. [47]

    R. L. Plackett. The analysis of permutations.Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(2):193–202, 1975

  48. [48]

    Generative language modeling for automated theorem proving

    StanislasPoluandIlyaSutskever. Generativelanguagemodelingforautomatedtheorem proving.arXiv preprint arXiv:2009.03393, 2020

  49. [49]

    On infinite sets with no3on a line.arXiv preprint arXiv:2602.21275, 2026

    Moe Putterman, Mehtaab Sawhney, and Gregory Valiant. On infinite sets with no3on a line.arXiv preprint arXiv:2602.21275, 2026

  50. [50]

    Colouring versus density in integers and hales–jewett cubes.Journal of the London Mathematical Society, 110(5):e12987, 2024

    Christian Reiher, Vojtěch Rödl, and Marcelo Sales. Colouring versus density in integers and hales–jewett cubes.Journal of the London Mathematical Society, 110(5):e12987, 2024

  51. [51]

    Pawan Kumar, Emilien Dupont, Francisco J

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468– 475, 2024

  52. [52]

    Ryu, Kun Yuan, and Wotao Yin

    Ernest K. Ryu, Kun Yuan, and Wotao Yin. Ode analysis of stochastic gradient methods with optimism and anchoring for minimax problems.arXiv preprint arXiv:1905.10899, 2019. 14 Advancing Mathematics Research with AI-Driven Formal Proof Search

  53. [53]

    Extremal descendant integrals on moduli spaces of curves: An inequality discovered and proved in collaboration with ai.arXiv preprint arXiv:2512.14575, 2025

    Johannes Schmitt. Extremal descendant integrals on moduli spaces of curves: An inequality discovered and proved in collaboration with ai.arXiv preprint arXiv:2512.14575, 2025

  54. [54]

    On a problem of erdős and sárközy.Journal of Combinatorial Theory, Series A, 94(1):191–195, 2001

    Tomasz Schoen. On a problem of erdős and sárközy.Journal of Combinatorial Theory, Series A, 94(1):191–195, 2001

  55. [55]

    A general reinforcement learning algorithm that masters chess, shogi, and go through self-play, 2018

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play, 2018

  56. [56]

    Neil J. Sloane. The on-line encyclopedia of integer sequences. InProceedings of the 14th Symposium on Towards Mechanized Mathematical Assistants: 6th International Conference, Calculemus ’07 / MKM ’07, page 130, Berlin, Heidelberg, 2007. Springer- Verlag

  57. [57]

    Richard P. Stanley. Hilbert functions of graded algebras.Advances in Mathematics, 28(1):57–83, 1978

  58. [58]

    An Improved Last-Iterate Convergence Rate for Anchored Gradient Descent Ascent

    Anja Surina, Arun Suggala, George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Francisco JR Ruiz, Pushmeet Kohli, and Swarat Chaudhuri. An improved last-iterate convergencerateforanchoredgradientdescentascent.arXivpreprintarXiv:2604.03782, 2026

  59. [59]

    Ai contributions to Erdős problems.https://github .com/teorth/erdosproblems/wiki/AI-contributions-to-Erd%C5%91s-p roblems, 2026

    Terence Tao and contributors. Ai contributions to Erdős problems.https://github .com/teorth/erdosproblems/wiki/AI-contributions-to-Erd%C5%91s-p roblems, 2026. Accessed: 2026-04-23

  60. [60]

    The Lean Mathematical Library

    The Mathlib Community. The Lean Mathematical Library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, New Orleans, LA, USA, 2020. ACM

  61. [61]

    Stanislaw M. Ulam. A collection of mathematical problems.New York and London: Interscience Publishers, 1960

  62. [62]

    Ulam.A Collection of Mathematical Problems, volume 8 ofInterscience Tracts in Pure and Applied Mathematics

    Stanislaw M. Ulam.A Collection of Mathematical Problems, volume 8 ofInterscience Tracts in Pure and Applied Mathematics. Interscience Publishers, New York, 1960

  63. [63]

    Hilbert: Recursively building formal proofs with informal reasoning.arXiv preprint arXiv:2509.22819, 2025

    Sumanth Varambally, Thomas Voice, Yanchao Sun, Zhifeng Chen, Rose Yu, and Ke Ye. Hilbert: Recursively building formal proofs with informal reasoning.arXiv preprint arXiv:2509.22819, 2025

  64. [64]

    Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

    Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, et al. Kimina-prover preview: Towards large formal reasoning models with reinforcement learning.arXiv preprint arXiv:2504.11354, 2025

  65. [65]

    15 Advancing Mathematics Research with AI-Driven Formal Proof Search Accelerating scientific research with gemini: Case studies and common techniques

    David P Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, Moham- madHossein Bateni, Simina Branzei, Michael P Brenner, Lin Chen, Ying Feng, et al. 15 Advancing Mathematics Research with AI-Driven Formal Proof Search Accelerating scientific research with gemini: Case studies and common techniques. arXiv preprint arXiv:2602.03837, 2026

  66. [66]

    Formal mathematical reasoning: A new frontier in ai.arXiv preprint 2412.16075, 2024

    Kaiyu Yang, Gabriel Poesia, Jingxuan He, Wenda Li, Kristin Lauter, Swarat Chaudhuri, and Dawn Song. Formal mathematical reasoning: A new frontier in ai.arXiv preprint 2412.16075, 2024

  67. [67]

    Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

    Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

  68. [68]

    Log-concavity of level Hilbert functions and pure o-sequences.Journal of Commutative Algebra, 16(2):245–256, 2024

    Fabrizio Zanello. Log-concavity of level Hilbert functions and pure o-sequences.Journal of Commutative Algebra, 16(2):245–256, 2024

  69. [69]

    Daniel Zheng, Ingrid von Glehn, Yori Zwols, Iuliya Beloshapka, Lars Buesing, Daniel M. Roy, Martin Wattenberg, Bogdan Georgiev, Tatiana Schmidt, Andrew Cowie, Fernanda Viegas, Dimitri Kanevsky, Vineet Kahlon, Hartmut Maennel, Sophia Alj, George Holland, Alex Davies, and Pushmeet Kohli. AI Co-Mathematician: Accelerating mathematicians with agentic AI.arXiv...

  70. [70]

    Population Database and Matchmaking

    Database sampling:The controller selects a root proof sketch𝑆root by sampling from the database, along with𝑀= 2auxiliary inspiration sketches 𝑆insp. The selection strat- egy balances exploitation of high-rated sketches with exploration of diverse candidates (see “Population Database and Matchmaking”)

  71. [71]

    decompose unsolved goals,

    Prompt construction:A prompt X is assembled to guide the LLM. It integrates the formal problem specification, the Lean source code and natural language plan of𝑆root, and structured feedback derived from AlphaProof’s previous attempts on{𝑆insp}. As in AlphaEvolve, the controller encourages diversity by stochastically injecting instructions such as “decompo...

  72. [72]

    To scale to large Lean files, the subagent outputs mutations via asearch_replace tool in a compact diff format rather than rewriting the entire file

    Prover subagent:The assembled prompt X is dispatched to the LLM (Gemini 3.1 Pro), initiating a multi-turn episode. To scale to large Lean files, the subagent outputs mutations via asearch_replace tool in a compact diff format rather than rewriting the entire file. The subagent can also query AlphaProof to test specific subgoals mid- episode; the feedback ...

  73. [73]

    Global GoalCaching

    Validation:Once the candidate sketch𝑆′ passes the sandbox check, it undergoes formal validation. The system extracts all remainingsorry subgoals and cross-references them against a global goal cache using a deep hash of their exact Lean state (see “Global GoalCaching”). Ifa subgoal waspreviously resolved, theproof isretrieved immediately; otherwise, it is...

  74. [74]

    matches

    Database registration:The candidate 𝑆′, along with per-subgoal feedback from Al- phaProof, is registered in the database. Its fitness is then determined asynchronously via Elo matchmaking. AlphaProof has a Test-Time Reinforcement Learning (TTRL) mode in which it learns to solve a problem by solving its AI-generated variants at inference time; however, we ...

  75. [75]

    4𝑁1(𝑋) +𝑁 3(𝑋) ≤ 3|𝑋| + 2𝑁2(𝑋), which follows from writing𝑁𝑘(𝑥)= Í 𝑥∈ℤ 1𝑋 (𝑥) 1𝑋 (𝑥+ 𝑘) and a pointwise check of the indicator function1𝑋 across all4-point windows (𝑥, 𝑥+ 1, 𝑥+ 2, 𝑥+ 3); that is, summing the local inequality1𝑋 (𝑥) + 1𝑋 (𝑥+ 1) + 1𝑋 (𝑥+

  76. [76]

    + 1𝑋 (𝑥) 1𝑋 (𝑥+ 2) + 1𝑋 (𝑥+ 1)1𝑋 (𝑥+ 3) ≥ 1𝑋 (𝑥) 1𝑋 (𝑥+ 1) + 21𝑋 (𝑥+ 1)1𝑋 (𝑥+ 2) + 1𝑋 (𝑥+2)1 𝑋 (𝑥+3) +1 𝑋 (𝑥)1 𝑋 (𝑥+3)and then summing over all𝑥∈ℤ

  77. [77]

    Consider each pair(𝑥, 𝑥+ 2) ∈𝑋 2 and we proceed by cases on𝑥+ 1

    2𝑁2(𝑋) ≤𝑁 3(𝑋) + 2𝑉2(𝑋) + 2𝐼(𝑋) . Consider each pair(𝑥, 𝑥+ 2) ∈𝑋 2 and we proceed by cases on𝑥+ 1. Let 𝐺={𝑥∈𝑋|𝑥+ 1 ∉𝑋∧𝑥+ 2 ∈𝑋} . If 𝑥+ 1 ∈𝑋 , then the pair is counted exactly by𝑉2, so𝑁 2(𝑋)=𝑉 2(𝑋) + |𝐺|. If𝑥+1∉𝑋, then •If𝑥−1∈𝑋, then(𝑥−1, 𝑥+2)is a pair counted in𝑁 3(𝑋). •If𝑥+3∈𝑋, then(𝑥, 𝑥+3)is a pair counted in𝑁 3(𝑋). 41 Advancing Mathematics Research wit...

  78. [78]

    For any 𝛿∉{ 0,−𝑘} , the Sidon property ensures that there is at most one pair(𝑎, 𝑐) with 𝑎−𝑐=𝛿 and at most one pair(𝑑, 𝑏) with 𝑑−𝑏=𝛿+𝑘

    The cases𝛿= 0and 𝛿=−𝑘 each contribute≤ |𝐴| (one coordinate determines the rest). For any 𝛿∉{ 0,−𝑘} , the Sidon property ensures that there is at most one pair(𝑎, 𝑐) with 𝑎−𝑐=𝛿 and at most one pair(𝑑, 𝑏) with 𝑑−𝑏=𝛿+𝑘 . Consequently, each gap of size𝑘in𝐷identifies exactly one quadruple. Thus|quad 𝑘| ≤𝑁 𝑘(𝐷) +2𝑛

  79. [79]

    Combining these bounds shows that𝑁𝑘(𝐷) and 𝑁𝑘(𝑆) are related up to𝑂(𝑛) error

    Each "good" element 𝑠∈𝑆 with 𝑠+𝑘∈𝑆 (excluding ≤ 2𝑛 doubles of the form2𝑎) produces≥4quadruples, giving4𝑁 𝑘(𝑆) ≤ |quad 𝑘| +8𝑛. Combining these bounds shows that𝑁𝑘(𝐷) and 𝑁𝑘(𝑆) are related up to𝑂(𝑛) error. Apply- ing fact (2) to𝐷=𝐴−𝐴gives 4𝑁1(𝐷) +𝑁 3(𝐷) ≤3|𝐷| +2𝑁 2(𝐷) and then substitute the above two quadruple transfer bounds for𝑘= 1, 2, 3and collecting al...

  80. [80]

    Í 𝑅𝑛,𝑘 ≤2𝑛2 𝑛/2 =𝑜 2𝑛+1 𝑛6 . 44 Advancing Mathematics Research with AI-Driven Formal Proof Search Since all residual terms are bounded by𝑜 2𝑛+1 𝑛6 , we conclude that 𝑎𝑛 =𝑓 𝑛 +𝑜 2𝑛+1 𝑛6 = 2𝑛+1 𝑛 1+ 1 𝑛 + 3 𝑛2 + 13 𝑛3 + 75 𝑛4 + 541 𝑛5 +𝑜 1 𝑛5 This establishes the conjectured asymptotic expansion.□ Theorem.(A conjecture of OEIS A228143, 2018) Let𝑠𝑚 = Í𝑚 𝑘=0 ...

Showing first 80 references.