Formalizing Mathematics at Scale

Ahmad Rammal; Amaury Hayat; Charles Arnal; Fabian Gloeckle; Julia Kempe; Niket Patel; Remi Munos; Vivien Cabannes

arxiv: 2605.29955 · v1 · pith:Z5CG7KAYnew · submitted 2026-05-28 · 💻 cs.AI

Formalizing Mathematics at Scale

Ahmad Rammal , Niket Patel , Fabian Gloeckle , Amaury Hayat , Julia Kempe , Remi Munos , Charles Arnal , Vivien Cabannes This is my paper

Pith reviewed 2026-06-29 07:32 UTC · model grok-4.3

classification 💻 cs.AI

keywords autoformalizationLean 4multi-agent systemsformal mathematicstheorem verificationtextbook formalizationAtlas libraryLLM agents

0 comments

The pith

A multi-agent system translates 26 math textbooks into a verified Lean library of 45,000 declarations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AutoformBot, a system that coordinates thousands of LLM agents equipped with Lean verification tools to convert informal textbook prose into formal definitions and proofs. It applies this process to 26 open-access texts covering analysis, algebra, topology, combinatorics, and probability, yielding Atlas with over 45,000 Lean 4 declarations and 500,000 lines of code. The central demonstration is that the resulting library is machine-verified, which the authors take as evidence that autoformalization of core graduate mathematics can be performed at scale. This feasibility claim rests on the system's handling of dependencies, collaboration, and verification without requiring extensive manual correction.

Core claim

AutoformBot orchestrates LLM agents with formal verification, dependency-aware scheduling, and version control to translate textbook content into Lean 4. The process applied to 26 textbooks produces Atlas, a library containing more than 45,000 declarations that pass Lean's checker. The authors state that this scale of autoformalization for graduate-level mathematics is now economically and technically feasible and opens the door to automated verification of research-level mathematics whether generated by humans or machines.

What carries the argument

AutoformBot, a multi-agent orchestration framework that equips LLM agents with Lean verification tools, dependency scheduling, and collaborative version control to convert informal mathematics into verified formal code.

If this is right

Textbook content across standard graduate fields can be turned into machine-checked libraries without manual formalization of each step.
Verified libraries of this size become practical to build and maintain through automated agent coordination.
Automated verification extends to both human-written and machine-generated mathematics at the level of entire textbooks.
Dependency management and version control can be handled at the scale of hundreds of thousands of lines without central human oversight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same orchestration might be applied to research papers rather than only textbooks, though coverage of open problems would require additional mechanisms.
The released Atlas library could serve as training data to improve future formalization models beyond the current system.
Extending the approach to fields with heavier use of diagrams or non-textual arguments would test whether the text-to-formal pipeline generalizes.
If the cost per declaration continues to drop, formalization could become a standard preprocessing step for new mathematical results.

Load-bearing premise

The Lean verifier catches all errors in the generated formalizations, including any subtle mismatches between the original textbook intent and the translated statements.

What would settle it

An error in one of the Atlas declarations that passes Lean's type checker but contradicts the corresponding textbook statement.

read the original abstract

We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoformBot scales autoformalization to 26 textbooks and 45k Lean declarations with a multi-agent setup, but provides no metrics on how faithfully the output matches the source texts.

read the letter

The main takeaway is that this paper shows a multi-agent LLM system can generate a verified Lean library from 26 open textbooks across several fields, reaching 45,000 declarations and half a million lines of code. They release both the AutoformBot framework and the Atlas library.

The engineering is the part that stands out. Coordinating thousands of agents with dependency-aware scheduling and version control to handle the full volume of graduate-level material is a concrete step beyond smaller prior efforts. Applying the same pipeline to analysis, algebra, topology, combinatorics, and probability demonstrates the approach is not tied to one narrow domain. Shipping the code and the resulting library gives others something they can actually run and inspect.

The clear limitation is the missing evidence on translation fidelity. Lean confirms that the generated code is internally consistent, but the paper reports no human audits, alignment scores, or failure rates on whether the formal statements and proofs preserve the original textbook content without omissions or shifts in assumptions. That leaves the claim of economic feasibility for graduate-level material resting on an unquantified premise.

This is for people working on AI-assisted formal mathematics or large-scale library construction. Readers who need practical details on orchestrating agents for formalization tasks will get usable information from the architecture description.

The work deserves peer review. The scale increase and the released artifacts are real contributions that referees can evaluate directly, even though additional fidelity measurements would make the results stronger.

Referee Report

2 major / 2 minor

Summary. The paper introduces AutoformBot, a multi-agent LLM orchestration system equipped with formal verification, dependency scheduling, and version control. It applies the system to 26 open-access textbooks across analysis, algebra, topology, combinatorics, and probability to produce Atlas, a Lean 4 library containing over 45,000 declarations and 500,000 lines of code. The authors release both AutoformBot and Atlas as open artifacts and conclude that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible.

Significance. If the generated formalizations accurately reflect the source textbooks, the work provides a concrete empirical demonstration of large-scale autoformalization together with fully released artifacts. The open release of both the multi-agent framework and the resulting library constitutes a clear strength that can support downstream research in verified mathematics.

major comments (2)

[Abstract] Abstract: the description of Atlas as a 'verified library' produced from textbook prose is load-bearing for the feasibility claim, yet the manuscript reports no fidelity metrics, human audits, or alignment statistics between the informal source statements and the generated Lean declarations. Lean verification alone confirms internal consistency of the output code but does not establish semantic fidelity or rule out systematic translation gaps.
[Results] Results section describing Atlas: no quantitative figures are supplied for coverage completeness (e.g., fraction of textbook theorems successfully formalized), post-verification error rates, or total human effort required, leaving the strength of evidence for the central scalability claim moderate.

minor comments (2)

[Methods] The multi-agent workflow diagram would benefit from explicit annotation of the dependency-aware scheduler and version-control integration steps.
[Results] A short table summarizing per-textbook declaration counts and verification success rates would improve readability of the scale claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on fidelity and quantitative evidence. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the description of Atlas as a 'verified library' produced from textbook prose is load-bearing for the feasibility claim, yet the manuscript reports no fidelity metrics, human audits, or alignment statistics between the informal source statements and the generated Lean declarations. Lean verification alone confirms internal consistency of the output code but does not establish semantic fidelity or rule out systematic translation gaps.

Authors: We agree that the manuscript does not report explicit fidelity metrics or human audits, and that Lean verification alone establishes internal consistency rather than semantic alignment with the source textbooks. In the revised version we will add a dedicated evaluation subsection reporting human audit results on a representative sample of declarations (including alignment accuracy and identified translation issues) together with any available system-level success statistics. We will also revise the abstract to distinguish between Lean verification and source fidelity. revision: yes
Referee: [Results] Results section describing Atlas: no quantitative figures are supplied for coverage completeness (e.g., fraction of textbook theorems successfully formalized), post-verification error rates, or total human effort required, leaving the strength of evidence for the central scalability claim moderate.

Authors: We acknowledge the absence of these quantitative figures in the current results section. The manuscript emphasizes aggregate scale rather than per-textbook coverage or error rates. In revision we will add coverage estimates (where textbook theorem counts can be determined), post-verification error rates extracted from AutoformBot logs, and an account of human oversight effort. These additions will strengthen the evidence presented for scalability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems demonstration with released artifacts

full rationale

The paper describes construction and application of AutoformBot to generate Atlas from 26 textbooks, yielding 45k Lean declarations. The central claim of feasibility is an empirical outcome from running the system, not a derivation that reduces to fitted parameters, self-definitions, or self-citation chains. Lean verification is an external checker independent of the paper's inputs. No equations, uniqueness theorems, or ansatzes are invoked that collapse to the paper's own construction. The result is self-contained against external benchmarks via released code and library.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The feasibility claim rests on the correctness of Lean 4 verification and the assumption that the agent system can translate textbook content without introducing unverifiable gaps. No free parameters or invented physical entities are involved.

axioms (1)

standard math Lean 4's type checker and kernel correctly implement the underlying logic and detect all proof errors.
The entire verification claim depends on Lean 4 being a sound proof assistant; invoked when stating that Atlas consists of 'verified' declarations.

invented entities (1)

AutoformBot multi-agent orchestration system no independent evidence
purpose: Coordinate LLMs, verification tools, and dependency scheduling for textbook translation
New system introduced by the authors to achieve the reported scale.

pith-pipeline@v0.9.1-grok · 5709 in / 1283 out tokens · 27098 ms · 2026-06-29T07:32:59.772072+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 27 canonical work pages · 10 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Blumberg, Martin Hairer, Joe Kileel, Tamara G

Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, and Lauren Williams. First P roof, 2026. https://arxiv.org/abs/2602.05192

work page arXiv 2026
[3]

Aristotle: IMO-level Automated Theorem Proving

Tudor Achim, Alex Best, Alberto Bietti, Mathis Federico, Sergei Gukov, Daniel Halpern-Leistner, et al. Aristotle: IMO -level automated theorem proving. arXiv preprint arXiv:2510.01346, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

NuminaMath-LEAN

AI-MO . NuminaMath-LEAN . https://huggingface.co/datasets/AI-MO/NuminaMath-LEAN, 2025. Hugging Face dataset. 100K competition mathematics problems formalized in Lean 4. Accessed: 2026-04-29

2025
[5]

Model context protocol, 2024

Anthropic . Model context protocol, 2024. https://modelcontextprotocol.io

2024
[6]

Claude O pus 4.6 system card

Anthropic . Claude O pus 4.6 system card. https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf, 2026

2026
[7]

Formalization of De Giorgi--Nash--Moser Theory in Lean

Scott Armstrong and Julia Kempe. Formalization of D e G iorgi-- N ash-- M oser T heory in L ean, 2026. https://arxiv.org/abs/2604.05984

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Takeaways from the F irst- P roof trenches

Scott Armstrong, Julia Kempe, and Remi Munos. Takeaways from the F irst- P roof trenches. https://kempejulia1.github.io/1stProof-Attempt/1stProofTakeaways.pdf, 2026. Accessed: 2026-05-06

2026
[9]

Ayers, Dragomir Radev, and Jeremy Avigad

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. ProofNet : Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433, 2023

work page arXiv 2023
[10]

The role of the Mizar mathematical library for interactive proof development in Mizar

Grzegorz Bancerek, Czes aw Byli \'n ski, Adam Grabowski, Artur Korni owicz, Roman Matuszewski, Adam Naumowicz, and Karol P a k. The role of the Mizar mathematical library for interactive proof development in Mizar . Journal of Automated Reasoning, 61 0 (1--4): 0 9--32, 2018. doi:10.1007/s10817-017-9440-6

work page doi:10.1007/s10817-017-9440-6 2018
[11]

Interactive Theorem Proving and Program Development: Coq'Art: The Calculus of Inductive Constructions

Yves Bertot and Pierre Cast \'e ran. Interactive Theorem Proving and Program Development: Coq'Art: The Calculus of Inductive Constructions. Texts in Theoretical Computer Science. Springer, 2004. ISBN 978-3-540-20854-9. doi:10.1007/978-3-662-07964-5

work page doi:10.1007/978-3-662-07964-5 2004
[12]

Bors-ng: A merge bot for GitHub pull requests

bors-ng contributors . Bors-ng: A merge bot for GitHub pull requests. https://github.com/bors-ng/bors-ng, 2017. GitHub repository

2017
[13]

A brief overview of Agda -- a functional language with dependent types

Ana Bove, Peter Dybjer, and Ulf Norell. A brief overview of Agda -- a functional language with dependent types. In Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel, editors, Theorem Proving in Higher Order Logics, volume 5674 of Lecture Notes in Computer Science, pages 73--78. Springer, 2009. doi:10.1007/978-3-642-03359-9_6

work page doi:10.1007/978-3-642-03359-9_6 2009
[14]

Loogle: Lean search engine, 2023

Joachim Breitner. Loogle: Lean search engine, 2023. https://loogle.lean-lang.org

2023
[15]

Fel's conjecture on syzygies of numerical semigroups

Evan Chen, Chris Cummins, Dejan Grubisic, Leopold Haller, Letong Hong, Andranik Kurghinyan, Kenny Lau, Hugh Leather, Seewoo Lee, Aram Markosyan, et al. Fel's conjecture on syzygies of numerical semigroups. arXiv preprint arXiv:2602.03716, 2026

work page arXiv 2026
[16]

Seed-Prover : Deep and broad reasoning for automated theorem proving

Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, et al. Seed-Prover : Deep and broad reasoning for automated theorem proving. arXiv preprint arXiv:2507.23726, 2025

work page arXiv 2025
[17]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

The Lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The Lean 4 theorem prover and programming language. In International Conference on Automated Deduction (CADE), 2021

2021
[19]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

DeepSeek-AI . DeepSeek-Prover-V2 : Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv preprint arXiv:2504.21801, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

lean-lsp-mcp: MCP server for the Lean language server, 2025

Oliver Dressler. lean-lsp-mcp: MCP server for the Lean language server, 2025. https://github.com/oOo0oOo/lean-lsp-mcp

2025
[21]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z

Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis H...

work page arXiv 2026
[22]

Abel: Sample efficient online reinforcement learning for neural theorem proving

Fabian Gloeckle, Jannis Limperg, Gabriel Synnaeve, and Amaury Hayat. Abel: Sample efficient online reinforcement learning for neural theorem proving. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24, 2024

2024
[23]

Automatic Textbook Formalization

Fabian Gloeckle, Ahmad Rammal, Charles Arnal, Remi Munos, Vivien Cabannes, Gabriel Synnaeve, and Amaury Hayat. Automatic textbook formalization. arXiv preprint arXiv:2604.03071, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Gemini 3.1 pro model card

Google DeepMind . Gemini 3.1 pro model card. https://deepmind.google/models/model-cards/gemini-3-1-pro/, 2026

2026
[25]

Mizar in a nutshell

Adam Grabowski, Artur Korni owicz, and Adam Naumowicz. Mizar in a nutshell. Journal of Formalized Reasoning, 3 0 (2): 0 153--245, 2010. doi:10.6092/issn.1972-5787/1980

work page doi:10.6092/issn.1972-5787/1980 2010
[26]

Progress in Formalizing Sphere Packing in Dimension 8

Sidharth Hariharan, Christopher Birkbeck, Seewoo Lee, Ho Kiu Gareth Ma, Bhavik Mehta, Auguste Poiroux, and Maryna Viazovska. A milestone in formalization: The sphere packing problem in dimension 8. arXiv preprint, 2026. https://arxiv.org/abs/2604.23468

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

HOL Light : An overview

John Harrison. HOL Light : An overview. In Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel, editors, Theorem Proving in Higher Order Logics, volume 5674 of Lecture Notes in Computer Science, pages 60--66. Springer, 2009. doi:10.1007/978-3-642-03359-9_4

work page doi:10.1007/978-3-642-03359-9_4 2009
[28]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[29]

MetaGPT : Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, et al. MetaGPT : Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations (ICLR), 2024

2024
[30]

Olympiad-level formal mathematical reasoning with reinforcement learning

Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl \'o s Z. Horv \'a th, Goran Z u z i \'c , Eric Wieser, Aja Huang, Julian Schrittwieser, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature, 651: 0 607--613, 2025. doi:10.1038/s41586-025-09833-y

work page doi:10.1038/s41586-025-09833-y 2025
[31]

AI for Mathematics: Progress, Challenges, and Prospects

Haocheng Ju and Bin Dong. AI for mathematics: Progress, challenges, and prospects. arXiv preprint arXiv:2601.13209, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Hypertree proof search for neural theorem proving

Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving. Advances in neural information processing systems, 35: 0 26337--26349, 2022

2022
[33]

The Network Structure of Mathlib

Xinze Li, Nanyun Peng, Simone Severini, and Patrick Shafto. The network structure of M athlib. arXiv preprint arXiv:2604.24797, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Goedel-Prover : A frontier model for open-source automated theorem proving

Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-Prover : A frontier model for open-source automated theorem proving. arXiv preprint arXiv:2502.07640, 2025

work page arXiv 2025
[35]

Liu et al.Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics

Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, et al. Numina-lean-agent: An open and general agentic reasoning system for formal mathematics. arXiv preprint arXiv:2601.14027, 2026

work page arXiv 2026
[36]

Strong pnt

Math Inc. Strong pnt. https://math-inc.github.io/strongpnt/, 2026. AI-generated Lean formalization of the strong Prime Number Theorem

2026
[37]

Megill and David A

Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Mathematical Proofs. Lulu Press, Morrisville, North Carolina, 2019. ISBN 978-0-359-70223-7. https://us.metamath.org/downloads/metamath.pdf

2019
[38]

Lean REPL , 2024

Scott Morrison. Lean REPL , 2024. https://github.com/leanprover-community/repl

2024
[39]

Paulson, and Markus Wenzel

Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL: A Proof Assistant for Higher-Order Logic, volume 2283 of Lecture Notes in Computer Science. Springer, 2002. ISBN 978-3-540-43376-7. doi:10.1007/3-540-45949-9

work page doi:10.1007/3-540-45949-9 2002
[40]

GPT-5.4 thinking system card

OpenAI . GPT-5.4 thinking system card. https://openai.com/index/gpt-5-4-thinking-system-card/, 2026

2026
[41]

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[42]

Formal mathematics statement curriculum learning

Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning. arXiv preprint arXiv:2202.01344, 2022

work page arXiv 2022
[43]

The Lean mathematical library

The Mathlib Community . The Lean mathematical library. arXiv preprint arXiv:1910.09336, 2020

work page arXiv 1910
[44]

Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition

George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition. Advances in Neural Information Processing Systems, 37: 0 11545--11569, 2024

2024
[45]

130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026

Josef Urban. 130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026

2026
[46]

Mathlib conventions

Vasily Vilin. Mathlib conventions. https://github.com/Vilin97/mathlib-conventions, 2026. Community-maintained collection of Mathlib coding conventions

2026
[47]

Kaiyu Yang, Aidan M

Haiming Wang, Huajian Xin, Chuanyang Zheng, Li Lin, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jianwei Yin, Zhenguo Li, and Heng Liao. LEGO-Prover : Neural theorem proving with growing libraries. arXiv preprint arXiv:2310.00656, 2024

work page arXiv 2024
[48]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, et al. Kimina-Prover preview: Towards large formal reasoning models with reinforcement learning. arXiv preprint arXiv:2504.11354, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

M2F : Automated formalization of mathematical literature at scale, 2026

Zichen Wang, Wanli Ma, Zhenyu Ming, Gong Zhang, Kun Yuan, and Zaiwen Wen. M2F : Automated formalization of mathematical literature at scale, 2026

2026
[50]

Jiang, Wenda Li, Markus N

Yuhuai Wu, Albert Q. Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022

2022
[51]

DeepSeek-Prover : Advancing theorem proving in LLMs through large-scale synthetic data

Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. DeepSeek-Prover : Advancing theorem proving in LLMs through large-scale synthetic data. In International Conference on Machine Learning (ICML), 2024

2024
[52]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[53]

Lean workbook: A large-scale Lean problem set formalized from natural language math problems

Huaiyuan Ying, Zijian Shi, Zhengying He, Bohan Gao, Chenyang Chen, Zhicheng Yu, Songrun Song, Qicheng Fan, Yinya Li, Lin Li, et al. Lean workbook: A large-scale Lean problem set formalized from natural language math problems. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[54]

mini F2F : A cross-system benchmark for formal O lympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. mini F2F : A cross-system benchmark for formal O lympiad-level mathematics. In International Conference on Learning Representations (ICLR), 2022

2022
[55]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT-Bench and Chatbot Arena . In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

[2] [2]

Blumberg, Martin Hairer, Joe Kileel, Tamara G

Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, and Lauren Williams. First P roof, 2026. https://arxiv.org/abs/2602.05192

work page arXiv 2026

[3] [3]

Aristotle: IMO-level Automated Theorem Proving

Tudor Achim, Alex Best, Alberto Bietti, Mathis Federico, Sergei Gukov, Daniel Halpern-Leistner, et al. Aristotle: IMO -level automated theorem proving. arXiv preprint arXiv:2510.01346, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

NuminaMath-LEAN

AI-MO . NuminaMath-LEAN . https://huggingface.co/datasets/AI-MO/NuminaMath-LEAN, 2025. Hugging Face dataset. 100K competition mathematics problems formalized in Lean 4. Accessed: 2026-04-29

2025

[5] [5]

Model context protocol, 2024

Anthropic . Model context protocol, 2024. https://modelcontextprotocol.io

2024

[6] [6]

Claude O pus 4.6 system card

Anthropic . Claude O pus 4.6 system card. https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf, 2026

2026

[7] [7]

Formalization of De Giorgi--Nash--Moser Theory in Lean

Scott Armstrong and Julia Kempe. Formalization of D e G iorgi-- N ash-- M oser T heory in L ean, 2026. https://arxiv.org/abs/2604.05984

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Takeaways from the F irst- P roof trenches

Scott Armstrong, Julia Kempe, and Remi Munos. Takeaways from the F irst- P roof trenches. https://kempejulia1.github.io/1stProof-Attempt/1stProofTakeaways.pdf, 2026. Accessed: 2026-05-06

2026

[9] [9]

Ayers, Dragomir Radev, and Jeremy Avigad

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. ProofNet : Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433, 2023

work page arXiv 2023

[10] [10]

The role of the Mizar mathematical library for interactive proof development in Mizar

Grzegorz Bancerek, Czes aw Byli \'n ski, Adam Grabowski, Artur Korni owicz, Roman Matuszewski, Adam Naumowicz, and Karol P a k. The role of the Mizar mathematical library for interactive proof development in Mizar . Journal of Automated Reasoning, 61 0 (1--4): 0 9--32, 2018. doi:10.1007/s10817-017-9440-6

work page doi:10.1007/s10817-017-9440-6 2018

[11] [11]

Interactive Theorem Proving and Program Development: Coq'Art: The Calculus of Inductive Constructions

Yves Bertot and Pierre Cast \'e ran. Interactive Theorem Proving and Program Development: Coq'Art: The Calculus of Inductive Constructions. Texts in Theoretical Computer Science. Springer, 2004. ISBN 978-3-540-20854-9. doi:10.1007/978-3-662-07964-5

work page doi:10.1007/978-3-662-07964-5 2004

[12] [12]

Bors-ng: A merge bot for GitHub pull requests

bors-ng contributors . Bors-ng: A merge bot for GitHub pull requests. https://github.com/bors-ng/bors-ng, 2017. GitHub repository

2017

[13] [13]

A brief overview of Agda -- a functional language with dependent types

Ana Bove, Peter Dybjer, and Ulf Norell. A brief overview of Agda -- a functional language with dependent types. In Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel, editors, Theorem Proving in Higher Order Logics, volume 5674 of Lecture Notes in Computer Science, pages 73--78. Springer, 2009. doi:10.1007/978-3-642-03359-9_6

work page doi:10.1007/978-3-642-03359-9_6 2009

[14] [14]

Loogle: Lean search engine, 2023

Joachim Breitner. Loogle: Lean search engine, 2023. https://loogle.lean-lang.org

2023

[15] [15]

Fel's conjecture on syzygies of numerical semigroups

Evan Chen, Chris Cummins, Dejan Grubisic, Leopold Haller, Letong Hong, Andranik Kurghinyan, Kenny Lau, Hugh Leather, Seewoo Lee, Aram Markosyan, et al. Fel's conjecture on syzygies of numerical semigroups. arXiv preprint arXiv:2602.03716, 2026

work page arXiv 2026

[16] [16]

Seed-Prover : Deep and broad reasoning for automated theorem proving

Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, et al. Seed-Prover : Deep and broad reasoning for automated theorem proving. arXiv preprint arXiv:2507.23726, 2025

work page arXiv 2025

[17] [17]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

The Lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The Lean 4 theorem prover and programming language. In International Conference on Automated Deduction (CADE), 2021

2021

[19] [19]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

DeepSeek-AI . DeepSeek-Prover-V2 : Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv preprint arXiv:2504.21801, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

lean-lsp-mcp: MCP server for the Lean language server, 2025

Oliver Dressler. lean-lsp-mcp: MCP server for the Lean language server, 2025. https://github.com/oOo0oOo/lean-lsp-mcp

2025

[21] [21]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z

Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis H...

work page arXiv 2026

[22] [22]

Abel: Sample efficient online reinforcement learning for neural theorem proving

Fabian Gloeckle, Jannis Limperg, Gabriel Synnaeve, and Amaury Hayat. Abel: Sample efficient online reinforcement learning for neural theorem proving. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24, 2024

2024

[23] [23]

Automatic Textbook Formalization

Fabian Gloeckle, Ahmad Rammal, Charles Arnal, Remi Munos, Vivien Cabannes, Gabriel Synnaeve, and Amaury Hayat. Automatic textbook formalization. arXiv preprint arXiv:2604.03071, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Gemini 3.1 pro model card

Google DeepMind . Gemini 3.1 pro model card. https://deepmind.google/models/model-cards/gemini-3-1-pro/, 2026

2026

[25] [25]

Mizar in a nutshell

Adam Grabowski, Artur Korni owicz, and Adam Naumowicz. Mizar in a nutshell. Journal of Formalized Reasoning, 3 0 (2): 0 153--245, 2010. doi:10.6092/issn.1972-5787/1980

work page doi:10.6092/issn.1972-5787/1980 2010

[26] [26]

Progress in Formalizing Sphere Packing in Dimension 8

Sidharth Hariharan, Christopher Birkbeck, Seewoo Lee, Ho Kiu Gareth Ma, Bhavik Mehta, Auguste Poiroux, and Maryna Viazovska. A milestone in formalization: The sphere packing problem in dimension 8. arXiv preprint, 2026. https://arxiv.org/abs/2604.23468

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

HOL Light : An overview

John Harrison. HOL Light : An overview. In Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel, editors, Theorem Proving in Higher Order Logics, volume 5674 of Lecture Notes in Computer Science, pages 60--66. Springer, 2009. doi:10.1007/978-3-642-03359-9_4

work page doi:10.1007/978-3-642-03359-9_4 2009

[28] [28]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems (NeurIPS), 2021

2021

[29] [29]

MetaGPT : Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, et al. MetaGPT : Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations (ICLR), 2024

2024

[30] [30]

Olympiad-level formal mathematical reasoning with reinforcement learning

Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl \'o s Z. Horv \'a th, Goran Z u z i \'c , Eric Wieser, Aja Huang, Julian Schrittwieser, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature, 651: 0 607--613, 2025. doi:10.1038/s41586-025-09833-y

work page doi:10.1038/s41586-025-09833-y 2025

[31] [31]

AI for Mathematics: Progress, Challenges, and Prospects

Haocheng Ju and Bin Dong. AI for mathematics: Progress, challenges, and prospects. arXiv preprint arXiv:2601.13209, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Hypertree proof search for neural theorem proving

Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving. Advances in neural information processing systems, 35: 0 26337--26349, 2022

2022

[33] [33]

The Network Structure of Mathlib

Xinze Li, Nanyun Peng, Simone Severini, and Patrick Shafto. The network structure of M athlib. arXiv preprint arXiv:2604.24797, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Goedel-Prover : A frontier model for open-source automated theorem proving

Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-Prover : A frontier model for open-source automated theorem proving. arXiv preprint arXiv:2502.07640, 2025

work page arXiv 2025

[35] [35]

Liu et al.Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics

Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, et al. Numina-lean-agent: An open and general agentic reasoning system for formal mathematics. arXiv preprint arXiv:2601.14027, 2026

work page arXiv 2026

[36] [36]

Strong pnt

Math Inc. Strong pnt. https://math-inc.github.io/strongpnt/, 2026. AI-generated Lean formalization of the strong Prime Number Theorem

2026

[37] [37]

Megill and David A

Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Mathematical Proofs. Lulu Press, Morrisville, North Carolina, 2019. ISBN 978-0-359-70223-7. https://us.metamath.org/downloads/metamath.pdf

2019

[38] [38]

Lean REPL , 2024

Scott Morrison. Lean REPL , 2024. https://github.com/leanprover-community/repl

2024

[39] [39]

Paulson, and Markus Wenzel

Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL: A Proof Assistant for Higher-Order Logic, volume 2283 of Lecture Notes in Computer Science. Springer, 2002. ISBN 978-3-540-43376-7. doi:10.1007/3-540-45949-9

work page doi:10.1007/3-540-45949-9 2002

[40] [40]

GPT-5.4 thinking system card

OpenAI . GPT-5.4 thinking system card. https://openai.com/index/gpt-5-4-thinking-system-card/, 2026

2026

[41] [41]

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[42] [42]

Formal mathematics statement curriculum learning

Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning. arXiv preprint arXiv:2202.01344, 2022

work page arXiv 2022

[43] [43]

The Lean mathematical library

The Mathlib Community . The Lean mathematical library. arXiv preprint arXiv:1910.09336, 2020

work page arXiv 1910

[44] [44]

Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition

George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition. Advances in Neural Information Processing Systems, 37: 0 11545--11569, 2024

2024

[45] [45]

130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026

Josef Urban. 130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026

2026

[46] [46]

Mathlib conventions

Vasily Vilin. Mathlib conventions. https://github.com/Vilin97/mathlib-conventions, 2026. Community-maintained collection of Mathlib coding conventions

2026

[47] [47]

Kaiyu Yang, Aidan M

Haiming Wang, Huajian Xin, Chuanyang Zheng, Li Lin, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jianwei Yin, Zhenguo Li, and Heng Liao. LEGO-Prover : Neural theorem proving with growing libraries. arXiv preprint arXiv:2310.00656, 2024

work page arXiv 2024

[48] [48]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, et al. Kimina-Prover preview: Towards large formal reasoning models with reinforcement learning. arXiv preprint arXiv:2504.11354, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

M2F : Automated formalization of mathematical literature at scale, 2026

Zichen Wang, Wanli Ma, Zhenyu Ming, Gong Zhang, Kun Yuan, and Zaiwen Wen. M2F : Automated formalization of mathematical literature at scale, 2026

2026

[50] [50]

Jiang, Wenda Li, Markus N

Yuhuai Wu, Albert Q. Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022

2022

[51] [51]

DeepSeek-Prover : Advancing theorem proving in LLMs through large-scale synthetic data

Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. DeepSeek-Prover : Advancing theorem proving in LLMs through large-scale synthetic data. In International Conference on Machine Learning (ICML), 2024

2024

[52] [52]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[53] [53]

Lean workbook: A large-scale Lean problem set formalized from natural language math problems

Huaiyuan Ying, Zijian Shi, Zhengying He, Bohan Gao, Chenyang Chen, Zhicheng Yu, Songrun Song, Qicheng Fan, Yinya Li, Lin Li, et al. Lean workbook: A large-scale Lean problem set formalized from natural language math problems. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[54] [54]

mini F2F : A cross-system benchmark for formal O lympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. mini F2F : A cross-system benchmark for formal O lympiad-level mathematics. In International Conference on Learning Representations (ICLR), 2022

2022

[55] [55]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT-Bench and Chatbot Arena . In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023