Advancing Mathematics Research with AI-Driven Formal Proof Search

Adam Zsolt Wagner; Aja Huang; Andrew Ferraiuolo; Anja Surina; Anton Kovsharov; Arun Suggala; Codrut Grosu; Edward Lockhart; Eric Wieser; Francisco J. R. Ruiz

arxiv: 2605.22763 · v2 · pith:W6IWLF6Enew · submitted 2026-05-21 · 💻 cs.AI

Advancing Mathematics Research with AI-Driven Formal Proof Search

George Tsoukalas , Anton Kovsharov , Sergey Shirobokov , Anja Surina , Moritz Firsching , Gergely B\'erczi , Francisco J. R. Ruiz , Arun Suggala

show 13 more authors

Adam Zsolt Wagner Eric Wieser Lei Yu Aja Huang Mikl\'os Z. Horv\'ath Andrew Ferraiuolo Henryk Michalewski Edward Lockhart Codrut Grosu Thomas Hubert Matej Balog Pushmeet Kohli Swarat Chaudhuri

This is my paper

Pith reviewed 2026-05-22 05:05 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI for mathematicsformal proofsLean theorem proverErdős problemsOEIS conjecturesautomated theorem provingproof searchmathematical reasoning

0 comments

The pith

An AI agent using large language models generates and verifies formal proofs in Lean to solve open Erdős problems and OEIS conjectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can help solve genuinely open mathematical questions by writing formal proofs that a computer can check in the Lean theorem prover. Their strongest agent found proofs for 9 out of 353 listed Erdős problems at a few hundred dollars each and settled 44 out of 492 OEIS conjectures. A simpler agent that keeps generating candidate proofs with the language model and checking them in Lean also succeeded on the Erdős set, though it cost more on the hardest cases. The same approach is already being tried inside active research programs in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. Readers should care because the method supplies an independent, machine-checkable record instead of relying only on human judgment for difficult claims.

Core claim

The central claim is that an AI agent can autonomously resolve open problems by producing Lean-verifiable formal proofs, as shown by the resolution of 9 Erdős problems and 44 OEIS conjectures, with the technique now applied across multiple branches of mathematics.

What carries the argument

An agent that alternates large-language-model generation of candidate Lean proofs with automated verification inside the Lean kernel.

If this is right

Formal verification can be inserted into the workflow of active mathematical research at modest per-problem cost.
The same agent design works across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics.
A basic alternating generation-and-verification loop already reproduces the Erdős successes, though harder problems raise its cost.
Large-scale testing on hundreds of open statements is now feasible with current language-model capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the cost stays low, smaller research groups could run their own formal-proof searches on local conjectures without large budgets.
The approach could be tried in other formal systems besides Lean to broaden the set of checkable problems.
Hybrid loops that let human mathematicians edit the agent's intermediate steps might solve still harder open questions.
Success on Erdős and OEIS lists suggests the method could be aimed next at problems that have resisted both human and computer attack for decades.

Load-bearing premise

The targeted Erdős problems and OEIS conjectures were still open before this work, and the Lean outputs are complete formal proofs produced without essential human intervention in the derivation steps.

What would settle it

An independent Lean expert confirming that one of the reported proofs is incomplete or that any of the nine Erdős problems had already been solved before the agent ran would disprove the claim of autonomous resolution.

Figures

Figures reproduced from arXiv: 2605.22763 by Adam Zsolt Wagner, Aja Huang, Andrew Ferraiuolo, Anja Surina, Anton Kovsharov, Arun Suggala, Codrut Grosu, Edward Lockhart, Eric Wieser, Francisco J. R. Ruiz, George Tsoukalas, Gergely B\'erczi, Henryk Michalewski, Lei Yu, Matej Balog, Mikl\'os Z. Horv\'ath, Moritz Firsching, Pushmeet Kohli, Sergey Shirobokov, Swarat Chaudhuri, Thomas Hubert.

**Figure 1.** Figure 1: Pseudocode for ’s main components. The main loop creates a pool of asynchronous sketchers and raters and awaits the creation of a full (sorry-free) proof. Each prover subagent samples a parent sketch from the population using the P-UCB strategy and creates a stateful conversation session with an LLM (Gemini 3.1 Pro) instance. In this conversation, it receives instructions to perform tool calls from the LLM… view at source ↗

**Figure 4.** Figure 4: Sketcher agent prompt (condensed). Elided text is represented by [...]. Text in braces denotes template variables populated at runtime. For example, {code} is replaced by the current Lean file. 5 [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 3.** Figure 3: Rater agent prompt (condensed). Elided text is represented by [...]. Text in braces denotes template variables populated at runtime. For example, {player blocks} is replaced by the sketches to be compared. 4 [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

**Figure 2.** Figure 2: Rater agent prompt (condensed). Elided text is represented by [...]. Text in braces denotes template variables populated at runtime. For example, {player blocks} is replaced by the sketches to be compared. 3 [PITH_FULL_IMAGE:figures/full_fig_p025_2.png] view at source ↗

read the original abstract

Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method's ability to solve open problems. Our most capable agent autonomously resolved 9 of 353 open Erd\H{o}s problems at the per-problem cost of a few hundred dollars, proved 44/492 OEIS conjectures, and is being deployed in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics research. A basic agent alternating LLM-based generation with Lean-based verification replicated the Erd\H{o}s successes but proved costlier on the hardest problems. These findings demonstrate the power of AI-aided formal proof search and shed light on the agent designs that enable it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper scales LLM agents to hundreds of open Erdős problems and OEIS conjectures with Lean verification, claiming 9 solutions at low cost, but the autonomy claims need per-problem confirmation on openness and proof completeness.

read the letter

The paper's main contribution is running LLM agents at scale on open Erdős problems and OEIS conjectures, claiming to resolve 9 of 353 Erdős ones in Lean at low per-problem cost. This is new because earlier LLM formalization stayed smaller or used already-solved statements. They do a solid job with the experimental setup. Selecting hundreds of externally defined open problems and verifying in Lean keeps things non-circular and reproducible. The comparison of basic versus capable agents shows where design choices matter, and the note on deployments in combinatorics and other fields adds some real-world context. The math and data presentation looks straightforward without obvious fitting issues. The soft spots are around the headline results. We need independent checks that those 9 problems were truly open and that the Lean outputs are complete formal proofs rather than partial ones needing human fixes. The abstract does not provide those per-problem details or proof-size stats, so the autonomy and cost claims could be overstated if any cases fall short. No error bars or failure mode breakdown is another gap, though not central. The stress-test concern holds up based on what's in the abstract. This paper is for folks working on AI for theorem proving or using formal tools in math research. A reader focused on empirical evaluations of LLMs in hard domains would get value from the scale and the verification approach. It shows honest engagement with the literature on LLM limitations by emphasizing formal search. I would recommend sending it for peer review. The empirical scope is substantial enough to deserve referee time, even with the need for added evidence on the 9 cases.

Referee Report

2 major / 2 minor

Summary. The manuscript reports the first large-scale evaluation of LLM-based agents for generating formal proofs in Lean to solve open mathematical problems. The central claims are that the most capable agent autonomously resolved 9 of 353 open Erdős problems at a per-problem cost of a few hundred dollars, proved 44 of 492 OEIS conjectures, and is now deployed across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics; a simpler alternating generation-verification agent is shown to replicate some successes but at higher cost on hard instances.

Significance. If the reported successes are independently confirmed, the work would demonstrate that current LLM-driven formal proof search can address genuinely open problems at modest cost, providing a concrete benchmark for AI-assisted mathematics. The use of Lean verification supplies a high standard of correctness that distinguishes this from informal LLM reasoning claims, and the multi-domain deployment suggests immediate research utility beyond the Erdős and OEIS benchmarks.

major comments (2)

[Abstract and results section] Abstract and results section: the headline claim that the agent 'autonomously resolved 9 of 353 open Erdős problems' is load-bearing for the autonomy and cost assertions, yet the manuscript supplies neither per-problem literature-search logs confirming the problems were previously unsolved nor proof-size statistics or artifact links demonstrating that the Lean outputs are complete, hole-free formal proofs rather than partial derivations requiring human post-editing. Without these, the success counts cannot be fully assessed.
[Methods and evaluation sections] Methods and evaluation sections: the reported success counts (9/353 Erdős, 44/492 OEIS) are presented without error bars, failure-mode breakdowns, or independent audit of the Lean artifacts. This omission directly affects the reliability of the comparative claim that the basic agent 'replicated the Erdős successes but proved costlier on the hardest problems.'

minor comments (2)

[§2] The description of agent architectures would benefit from a clearer tabular comparison of the 'most capable' versus 'basic' designs, including exact prompting strategies and Lean interaction loops.
[Figures and tables] Figure captions and table legends should explicitly state whether the reported costs include only inference or also human oversight time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify the presentation of our results on LLM-based formal proof search. We address each major comment below and describe the revisions we will incorporate to improve transparency and verifiability.

read point-by-point responses

Referee: [Abstract and results section] Abstract and results section: the headline claim that the agent 'autonomously resolved 9 of 353 open Erdős problems' is load-bearing for the autonomy and cost assertions, yet the manuscript supplies neither per-problem literature-search logs confirming the problems were previously unsolved nor proof-size statistics or artifact links demonstrating that the Lean outputs are complete, hole-free formal proofs rather than partial derivations requiring human post-editing. Without these, the success counts cannot be fully assessed.

Authors: We agree that explicit documentation strengthens the autonomy and correctness claims. In the revised manuscript we will add a dedicated subsection (and supplementary table) that lists, for each of the nine solved Erdős problems, the pre-experiment literature references establishing that the problem remained open, together with links to the corresponding Lean proof files in a public repository. We will also report proof-size statistics (number of tactics, lines of code, and kernel-verification time) for every success. All reported proofs were produced and checked end-to-end by the agent with no human post-editing; the added artifacts will make this verifiable. revision: yes
Referee: [Methods and evaluation sections] Methods and evaluation sections: the reported success counts (9/353 Erdős, 44/492 OEIS) are presented without error bars, failure-mode breakdowns, or independent audit of the Lean artifacts. This omission directly affects the reliability of the comparative claim that the basic agent 'replicated the Erdős successes but proved costlier on the hardest problems.'

Authors: We concur that quantitative uncertainty measures and failure analysis improve interpretability. The revision will include binomial 95 % confidence intervals for both success rates and a breakdown of the dominant failure modes (timeout, tactic failure, search exhaustion) across the full set of attempts. We will also expand the agent-comparison section with per-problem cost figures for the hardest instances to support the cost-effectiveness claim. All Lean artifacts and run logs will be released publicly to enable independent audit; performing a third-party audit ourselves lies outside the scope of the present study. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are empirical evaluations on external benchmarks

full rationale

The paper reports counts of resolved open Erdős problems and OEIS conjectures obtained by running LLM-based agents and verifying outputs in Lean. These success metrics are generated by direct application to independently defined external problems rather than by fitting parameters to subsets of the target data or by self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked that reduce the headline claims to the inputs by construction. The methodology relies on external verification and open problem lists, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that Lean correctly certifies the generated proofs and that the chosen problem sets were open before the experiments; no new mathematical axioms or entities are introduced.

axioms (1)

domain assumption The Lean theorem prover provides sound verification of formal proofs
All reported successes depend on Lean accepting the generated statements as valid proofs.

pith-pipeline@v0.9.0 · 5761 in / 1216 out tokens · 42052 ms · 2026-05-22T05:05:27.994522+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Code evolution for link prediction in complex networks
cs.SI 2026-06 unverdicted novelty 6.0

Code evolution produces link prediction algorithms with average AUC of 0.915 versus 0.783 for human-designed methods across 580 networks, with better scalability.
Recursions for Mock Theta Functions
math.NT 2026-06 unverdicted novelty 5.0

Derives weighted recursions for coefficients of mock theta functions f and ω via holomorphic projection on vector-valued Rankin-Cohen brackets, exploiting vanishing cusp form spaces.
Skills for the future software profession: beyond agentic AI!
cs.SE 2026-06 unverdicted novelty 2.0

Round-table discussions with researchers and practitioners indicate verification and validation skills will become central for software engineers in the agentic AI era.