Formalizing Mathematics at Scale
Pith reviewed 2026-06-29 07:32 UTC · model grok-4.3
The pith
A multi-agent system translates 26 math textbooks into a verified Lean library of 45,000 declarations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AutoformBot orchestrates LLM agents with formal verification, dependency-aware scheduling, and version control to translate textbook content into Lean 4. The process applied to 26 textbooks produces Atlas, a library containing more than 45,000 declarations that pass Lean's checker. The authors state that this scale of autoformalization for graduate-level mathematics is now economically and technically feasible and opens the door to automated verification of research-level mathematics whether generated by humans or machines.
What carries the argument
AutoformBot, a multi-agent orchestration framework that equips LLM agents with Lean verification tools, dependency scheduling, and collaborative version control to convert informal mathematics into verified formal code.
If this is right
- Textbook content across standard graduate fields can be turned into machine-checked libraries without manual formalization of each step.
- Verified libraries of this size become practical to build and maintain through automated agent coordination.
- Automated verification extends to both human-written and machine-generated mathematics at the level of entire textbooks.
- Dependency management and version control can be handled at the scale of hundreds of thousands of lines without central human oversight.
Where Pith is reading between the lines
- The same orchestration might be applied to research papers rather than only textbooks, though coverage of open problems would require additional mechanisms.
- The released Atlas library could serve as training data to improve future formalization models beyond the current system.
- Extending the approach to fields with heavier use of diagrams or non-textual arguments would test whether the text-to-formal pipeline generalizes.
- If the cost per declaration continues to drop, formalization could become a standard preprocessing step for new mathematical results.
Load-bearing premise
The Lean verifier catches all errors in the generated formalizations, including any subtle mismatches between the original textbook intent and the translated statements.
What would settle it
An error in one of the Atlas declarations that passes Lean's type checker but contradicts the corresponding textbook statement.
read the original abstract
We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AutoformBot, a multi-agent LLM orchestration system equipped with formal verification, dependency scheduling, and version control. It applies the system to 26 open-access textbooks across analysis, algebra, topology, combinatorics, and probability to produce Atlas, a Lean 4 library containing over 45,000 declarations and 500,000 lines of code. The authors release both AutoformBot and Atlas as open artifacts and conclude that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible.
Significance. If the generated formalizations accurately reflect the source textbooks, the work provides a concrete empirical demonstration of large-scale autoformalization together with fully released artifacts. The open release of both the multi-agent framework and the resulting library constitutes a clear strength that can support downstream research in verified mathematics.
major comments (2)
- [Abstract] Abstract: the description of Atlas as a 'verified library' produced from textbook prose is load-bearing for the feasibility claim, yet the manuscript reports no fidelity metrics, human audits, or alignment statistics between the informal source statements and the generated Lean declarations. Lean verification alone confirms internal consistency of the output code but does not establish semantic fidelity or rule out systematic translation gaps.
- [Results] Results section describing Atlas: no quantitative figures are supplied for coverage completeness (e.g., fraction of textbook theorems successfully formalized), post-verification error rates, or total human effort required, leaving the strength of evidence for the central scalability claim moderate.
minor comments (2)
- [Methods] The multi-agent workflow diagram would benefit from explicit annotation of the dependency-aware scheduler and version-control integration steps.
- [Results] A short table summarizing per-textbook declaration counts and verification success rates would improve readability of the scale claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on fidelity and quantitative evidence. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the description of Atlas as a 'verified library' produced from textbook prose is load-bearing for the feasibility claim, yet the manuscript reports no fidelity metrics, human audits, or alignment statistics between the informal source statements and the generated Lean declarations. Lean verification alone confirms internal consistency of the output code but does not establish semantic fidelity or rule out systematic translation gaps.
Authors: We agree that the manuscript does not report explicit fidelity metrics or human audits, and that Lean verification alone establishes internal consistency rather than semantic alignment with the source textbooks. In the revised version we will add a dedicated evaluation subsection reporting human audit results on a representative sample of declarations (including alignment accuracy and identified translation issues) together with any available system-level success statistics. We will also revise the abstract to distinguish between Lean verification and source fidelity. revision: yes
-
Referee: [Results] Results section describing Atlas: no quantitative figures are supplied for coverage completeness (e.g., fraction of textbook theorems successfully formalized), post-verification error rates, or total human effort required, leaving the strength of evidence for the central scalability claim moderate.
Authors: We acknowledge the absence of these quantitative figures in the current results section. The manuscript emphasizes aggregate scale rather than per-textbook coverage or error rates. In revision we will add coverage estimates (where textbook theorem counts can be determined), post-verification error rates extracted from AutoformBot logs, and an account of human oversight effort. These additions will strengthen the evidence presented for scalability. revision: yes
Circularity Check
No circularity: empirical systems demonstration with released artifacts
full rationale
The paper describes construction and application of AutoformBot to generate Atlas from 26 textbooks, yielding 45k Lean declarations. The central claim of feasibility is an empirical outcome from running the system, not a derivation that reduces to fitted parameters, self-definitions, or self-citation chains. Lean verification is an external checker independent of the paper's inputs. No equations, uniqueness theorems, or ansatzes are invoked that collapse to the paper's own construction. The result is self-contained against external benchmarks via released code and library.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Lean 4's type checker and kernel correctly implement the underlying logic and detect all proof errors.
invented entities (1)
-
AutoformBot multi-agent orchestration system
no independent evidence
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Blumberg, Martin Hairer, Joe Kileel, Tamara G
Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, and Lauren Williams. First P roof, 2026. https://arxiv.org/abs/2602.05192
-
[3]
Aristotle: IMO-level Automated Theorem Proving
Tudor Achim, Alex Best, Alberto Bietti, Mathis Federico, Sergei Gukov, Daniel Halpern-Leistner, et al. Aristotle: IMO -level automated theorem proving. arXiv preprint arXiv:2510.01346, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
NuminaMath-LEAN
AI-MO . NuminaMath-LEAN . https://huggingface.co/datasets/AI-MO/NuminaMath-LEAN, 2025. Hugging Face dataset. 100K competition mathematics problems formalized in Lean 4. Accessed: 2026-04-29
2025
-
[5]
Model context protocol, 2024
Anthropic . Model context protocol, 2024. https://modelcontextprotocol.io
2024
-
[6]
Claude O pus 4.6 system card
Anthropic . Claude O pus 4.6 system card. https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf, 2026
2026
-
[7]
Formalization of De Giorgi--Nash--Moser Theory in Lean
Scott Armstrong and Julia Kempe. Formalization of D e G iorgi-- N ash-- M oser T heory in L ean, 2026. https://arxiv.org/abs/2604.05984
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Takeaways from the F irst- P roof trenches
Scott Armstrong, Julia Kempe, and Remi Munos. Takeaways from the F irst- P roof trenches. https://kempejulia1.github.io/1stProof-Attempt/1stProofTakeaways.pdf, 2026. Accessed: 2026-05-06
2026
-
[9]
Ayers, Dragomir Radev, and Jeremy Avigad
Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. ProofNet : Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433, 2023
-
[10]
The role of the Mizar mathematical library for interactive proof development in Mizar
Grzegorz Bancerek, Czes aw Byli \'n ski, Adam Grabowski, Artur Korni owicz, Roman Matuszewski, Adam Naumowicz, and Karol P a k. The role of the Mizar mathematical library for interactive proof development in Mizar . Journal of Automated Reasoning, 61 0 (1--4): 0 9--32, 2018. doi:10.1007/s10817-017-9440-6
-
[11]
Yves Bertot and Pierre Cast \'e ran. Interactive Theorem Proving and Program Development: Coq'Art: The Calculus of Inductive Constructions. Texts in Theoretical Computer Science. Springer, 2004. ISBN 978-3-540-20854-9. doi:10.1007/978-3-662-07964-5
-
[12]
Bors-ng: A merge bot for GitHub pull requests
bors-ng contributors . Bors-ng: A merge bot for GitHub pull requests. https://github.com/bors-ng/bors-ng, 2017. GitHub repository
2017
-
[13]
A brief overview of Agda -- a functional language with dependent types
Ana Bove, Peter Dybjer, and Ulf Norell. A brief overview of Agda -- a functional language with dependent types. In Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel, editors, Theorem Proving in Higher Order Logics, volume 5674 of Lecture Notes in Computer Science, pages 73--78. Springer, 2009. doi:10.1007/978-3-642-03359-9_6
-
[14]
Loogle: Lean search engine, 2023
Joachim Breitner. Loogle: Lean search engine, 2023. https://loogle.lean-lang.org
2023
-
[15]
Fel's conjecture on syzygies of numerical semigroups
Evan Chen, Chris Cummins, Dejan Grubisic, Leopold Haller, Letong Hong, Andranik Kurghinyan, Kenny Lau, Hugh Leather, Seewoo Lee, Aram Markosyan, et al. Fel's conjecture on syzygies of numerical semigroups. arXiv preprint arXiv:2602.03716, 2026
-
[16]
Seed-Prover : Deep and broad reasoning for automated theorem proving
Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, et al. Seed-Prover : Deep and broad reasoning for automated theorem proving. arXiv preprint arXiv:2507.23726, 2025
-
[17]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
The Lean 4 theorem prover and programming language
Leonardo de Moura and Sebastian Ullrich. The Lean 4 theorem prover and programming language. In International Conference on Automated Deduction (CADE), 2021
2021
-
[19]
DeepSeek-AI . DeepSeek-Prover-V2 : Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv preprint arXiv:2504.21801, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
lean-lsp-mcp: MCP server for the Lean language server, 2025
Oliver Dressler. lean-lsp-mcp: MCP server for the Lean language server, 2025. https://github.com/oOo0oOo/lean-lsp-mcp
2025
-
[21]
Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis H...
-
[22]
Abel: Sample efficient online reinforcement learning for neural theorem proving
Fabian Gloeckle, Jannis Limperg, Gabriel Synnaeve, and Amaury Hayat. Abel: Sample efficient online reinforcement learning for neural theorem proving. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24, 2024
2024
-
[23]
Automatic Textbook Formalization
Fabian Gloeckle, Ahmad Rammal, Charles Arnal, Remi Munos, Vivien Cabannes, Gabriel Synnaeve, and Amaury Hayat. Automatic textbook formalization. arXiv preprint arXiv:2604.03071, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Gemini 3.1 pro model card
Google DeepMind . Gemini 3.1 pro model card. https://deepmind.google/models/model-cards/gemini-3-1-pro/, 2026
2026
-
[25]
Adam Grabowski, Artur Korni owicz, and Adam Naumowicz. Mizar in a nutshell. Journal of Formalized Reasoning, 3 0 (2): 0 153--245, 2010. doi:10.6092/issn.1972-5787/1980
-
[26]
Progress in Formalizing Sphere Packing in Dimension 8
Sidharth Hariharan, Christopher Birkbeck, Seewoo Lee, Ho Kiu Gareth Ma, Bhavik Mehta, Auguste Poiroux, and Maryna Viazovska. A milestone in formalization: The sphere packing problem in dimension 8. arXiv preprint, 2026. https://arxiv.org/abs/2604.23468
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
John Harrison. HOL Light : An overview. In Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel, editors, Theorem Proving in Higher Order Logics, volume 5674 of Lecture Notes in Computer Science, pages 60--66. Springer, 2009. doi:10.1007/978-3-642-03359-9_4
-
[28]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems (NeurIPS), 2021
2021
-
[29]
MetaGPT : Meta programming for a multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, et al. MetaGPT : Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations (ICLR), 2024
2024
-
[30]
Olympiad-level formal mathematical reasoning with reinforcement learning
Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl \'o s Z. Horv \'a th, Goran Z u z i \'c , Eric Wieser, Aja Huang, Julian Schrittwieser, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature, 651: 0 607--613, 2025. doi:10.1038/s41586-025-09833-y
-
[31]
AI for Mathematics: Progress, Challenges, and Prospects
Haocheng Ju and Bin Dong. AI for mathematics: Progress, challenges, and prospects. arXiv preprint arXiv:2601.13209, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
Hypertree proof search for neural theorem proving
Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving. Advances in neural information processing systems, 35: 0 26337--26349, 2022
2022
-
[33]
The Network Structure of Mathlib
Xinze Li, Nanyun Peng, Simone Severini, and Patrick Shafto. The network structure of M athlib. arXiv preprint arXiv:2604.24797, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Goedel-Prover : A frontier model for open-source automated theorem proving
Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-Prover : A frontier model for open-source automated theorem proving. arXiv preprint arXiv:2502.07640, 2025
-
[35]
Liu et al.Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics
Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, et al. Numina-lean-agent: An open and general agentic reasoning system for formal mathematics. arXiv preprint arXiv:2601.14027, 2026
-
[36]
Strong pnt
Math Inc. Strong pnt. https://math-inc.github.io/strongpnt/, 2026. AI-generated Lean formalization of the strong Prime Number Theorem
2026
-
[37]
Megill and David A
Norman D. Megill and David A. Wheeler. Metamath: A Computer Language for Mathematical Proofs. Lulu Press, Morrisville, North Carolina, 2019. ISBN 978-0-359-70223-7. https://us.metamath.org/downloads/metamath.pdf
2019
-
[38]
Lean REPL , 2024
Scott Morrison. Lean REPL , 2024. https://github.com/leanprover-community/repl
2024
-
[39]
Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL: A Proof Assistant for Higher-Order Logic, volume 2283 of Lecture Notes in Computer Science. Springer, 2002. ISBN 978-3-540-43376-7. doi:10.1007/3-540-45949-9
-
[40]
GPT-5.4 thinking system card
OpenAI . GPT-5.4 thinking system card. https://openai.com/index/gpt-5-4-thinking-system-card/, 2026
2026
-
[41]
Generative Language Modeling for Automated Theorem Proving
Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[42]
Formal mathematics statement curriculum learning
Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning. arXiv preprint arXiv:2202.01344, 2022
-
[43]
The Mathlib Community . The Lean mathematical library. arXiv preprint arXiv:1910.09336, 2020
-
[44]
Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition
George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition. Advances in Neural Information Processing Systems, 37: 0 11545--11569, 2024
2024
-
[45]
130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026
Josef Urban. 130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?, 2026
2026
-
[46]
Mathlib conventions
Vasily Vilin. Mathlib conventions. https://github.com/Vilin97/mathlib-conventions, 2026. Community-maintained collection of Mathlib coding conventions
2026
-
[47]
Haiming Wang, Huajian Xin, Chuanyang Zheng, Li Lin, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jianwei Yin, Zhenguo Li, and Heng Liao. LEGO-Prover : Neural theorem proving with growing libraries. arXiv preprint arXiv:2310.00656, 2024
-
[48]
Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning
Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, et al. Kimina-Prover preview: Towards large formal reasoning models with reinforcement learning. arXiv preprint arXiv:2504.11354, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
M2F : Automated formalization of mathematical literature at scale, 2026
Zichen Wang, Wanli Ma, Zhenyu Ming, Gong Zhang, Kun Yuan, and Zaiwen Wen. M2F : Automated formalization of mathematical literature at scale, 2026
2026
-
[50]
Jiang, Wenda Li, Markus N
Yuhuai Wu, Albert Q. Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022
2022
-
[51]
DeepSeek-Prover : Advancing theorem proving in LLMs through large-scale synthetic data
Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. DeepSeek-Prover : Advancing theorem proving in LLMs through large-scale synthetic data. In International Conference on Machine Learning (ICML), 2024
2024
-
[52]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[53]
Lean workbook: A large-scale Lean problem set formalized from natural language math problems
Huaiyuan Ying, Zijian Shi, Zhengying He, Bohan Gao, Chenyang Chen, Zhicheng Yu, Songrun Song, Qicheng Fan, Yinya Li, Lin Li, et al. Lean workbook: A large-scale Lean problem set formalized from natural language math problems. In Advances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[54]
mini F2F : A cross-system benchmark for formal O lympiad-level mathematics
Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. mini F2F : A cross-system benchmark for formal O lympiad-level mathematics. In International Conference on Learning Representations (ICLR), 2022
2022
-
[55]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT-Bench and Chatbot Arena . In Advances in Neural Information Processing Systems (NeurIPS), 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.