arxiv: 2604.03071 · v1 · submitted 2026-04-03 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Automatic Textbook Formalization

Fabian Gloeckle , Ahmad Rammal , Charles Arnal , Remi Munos , Vivien Cabannes , Gabriel Synnaeve , Amaury Hayat

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:03 UTC · model grok-4.3

classification 💻 cs.AI

keywords formalizationLeanalgebraic combinatoricsmulti-agent systemsautomated theorem provingtextbook formalizationAI agentsformal mathematics

0 comments

The pith

An AI system has automatically formalized a full 500-page graduate algebraic combinatorics textbook into Lean code in one week.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports a case study in which thousands of AI agents worked together to translate an entire graduate textbook on algebraic combinatorics into the Lean proof assistant. The output is a standalone library of 130,000 lines containing 5,900 declarations, produced with only minimal human oversight after the initial setup. This effort advances beyond earlier small-scale formalizations of undergraduate material to a complete graduate text completed in days rather than years. The authors note that the inference cost is comparable to what a team of human experts would require, and they release the code, the Lean library, and a side-by-side comparison site for public use.

Core claim

A multi-agent AI system successfully produced a complete, standalone formalization of a graduate-level algebraic combinatorics textbook exceeding 500 pages, resulting in 130K lines of Lean code and 5,900 declarations. The work was carried out in one week by 30K agents collaborating in parallel on a shared version-controlled codebase, establishing a new scale for automatic textbook formalization with only minimal ongoing human input.

What carries the argument

A parallel multi-agent system in which 30,000 instances of an AI model collaborate on a shared version-controlled codebase to generate, verify, and integrate Lean formalizations of textbook content.

If this is right

Graduate-level mathematics textbooks can be turned into machine-checked libraries on a timescale of days.
The monetary cost of such formalizations can match or fall below the salary cost of equivalent human expert teams.
Further efficiencies in the agent process are expected even without improvements to the underlying model.
The resulting open-source Lean library and comparison website enable community inspection and extension of the formalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same parallel-agent method could be tested on textbooks from other areas of mathematics to check how domain-specific the success is.
If the generated libraries prove reliable, they could serve as starting points for connecting previously separate formal developments in Lean.
Rapid generation of large verified texts raises the question of how to maintain and update them when new results appear in the source material.

Load-bearing premise

The multi-agent AI process produces output that is correct, complete, and faithful to the textbook content with only minimal human oversight after initial setup.

What would settle it

A human mathematician auditing the Lean library and identifying a substantial number of theorems whose formal statements or proofs deviate from or omit key content in the original textbook.

read the original abstract

We present a case study where an automatic AI system formalizes a textbook with more than 500 pages of graduate-level algebraic combinatorics to Lean. The resulting formalization represents a new milestone in textbook formalization scale and proficiency, moving from early results in undergraduate topology and restructuring of existing library content to a full standalone formalization of a graduate textbook. The formalization comprises 130K lines of code and 5900 Lean declarations and was conducted within one week by a total of 30K Claude 4.5 Opus agents collaborating in parallel on a shared code base via version control, simultaneously setting a record in multi-agent software engineering with usable results. The inference cost matches or undercuts what we estimate as the salaries required for a team of human experts, and we expect there is still the potential for large efficiencies to be made without the need for better models. We make our code, the resulting Lean code base and a side-by-side blueprint website available open-source.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They got a multi-agent setup to turn a full 500-page grad combinatorics textbook into 130k lines of Lean in a week, which is new scale, but the amount of human cleanup is not quantified.

read the letter

The headline result is a working formalization of an entire graduate textbook on algebraic combinatorics, done mostly by 30k Claude agents in parallel over a week. That moves the practical frontier past the undergrad examples and library-repackaging work that came before. They also open-sourced the Lean code, the agent setup, and a side-by-side blueprint site, which lets others check the claims directly. Cost is reported as comparable to what a human team would charge, and the numbers (5900 declarations) are concrete enough to be useful benchmarks for the field right now. What the paper does well is show that current models plus version-control coordination can produce something large and usable without starting from scratch each time. The soft spot is the missing audit on human intervention. The abstract and stress-test note both leave open how many lines were edited by hand, how often agents needed guidance on definitions or sorry statements, and whether the final code was spot-checked against the textbook statements. Without those counts, the “automatic” and “minimal oversight” claims stay hard to evaluate. The soundness therefore rests on the authors’ internal review plus whatever the community finds once the repo is public. This paper is for people already working on AI-assisted formalization or large-scale Lean projects. Readers who want to see what is feasible today with off-the-shelf models will get value from the released artifacts and the scale numbers. It is not a theoretical advance, but the engineering milestone is real enough that a serious editor should send it to referees who can look at the code and ask for the missing intervention logs. I would recommend peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper presents a case study of an automatic multi-agent AI system that uses 30,000 Claude 4.5 Opus agents collaborating via version control to formalize a graduate-level algebraic combinatorics textbook exceeding 500 pages into Lean. The resulting artifact comprises 130K lines of code and 5900 declarations, completed in one week, with open-sourcing of the code base, Lean formalization, and a side-by-side blueprint website; the authors position this as a new milestone in textbook formalization scale and multi-agent software engineering, with inference costs competitive with human expert salaries.

Significance. If the formalization is largely automatic, faithful, and complete, the work marks a substantial advance in automated formal mathematics by scaling from prior undergraduate or library-restructuring efforts to a full standalone graduate textbook. The open-sourcing of the 130K-line Lean base and blueprint website, together with the demonstrated parallel multi-agent workflow, supplies concrete artifacts and a replicable engineering pattern that could accelerate future large-scale formalization projects.

major comments (2)

[Abstract and results section] Abstract and results section: the central claim that the formalization was conducted 'with only minimal human oversight after initial setup' lacks any quantitative audit of human intervention volume (e.g., number of human commits, lines edited post-generation, or frequency of 'sorry' resolutions). This metric is load-bearing for the 'automatic' descriptor and the cost-comparison argument.
[Results and evaluation] Results and evaluation: no quantitative breakdown is supplied of coverage (fraction of textbook statements formalized), error rates in generated code, or independent verification that the 5900 Lean declarations faithfully reproduce the textbook content. Soundness therefore rests entirely on the authors' internal assessment rather than reported measurements.

minor comments (2)

[Methods] The description of the shared codebase and version-control workflow would benefit from additional detail on how merge conflicts and parallel edits were managed across 30K agents.
Consider adding a summary table of key metrics (agent count, LOC, declarations, wall-clock time, estimated cost) for quick reference.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their positive assessment of our work and for the constructive comments that help clarify the presentation of our results. We address each major comment below, indicating revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and results section] Abstract and results section: the central claim that the formalization was conducted 'with only minimal human oversight after initial setup' lacks any quantitative audit of human intervention volume (e.g., number of human commits, lines edited post-generation, or frequency of 'sorry' resolutions). This metric is load-bearing for the 'automatic' descriptor and the cost-comparison argument.

Authors: We agree that a quantitative audit of human intervention would provide stronger evidence for the automatic character of the process. The multi-agent system was explicitly designed to operate with only initial setup and minimal subsequent oversight, relying on version control for agent collaboration. While detailed per-intervention logs were not collected in advance, our records show that human commits formed a small minority of total activity and that agents resolved the large majority of 'sorry' statements autonomously. We will add a new paragraph to the Results section with these available metrics and a description of the oversight process to support the claim. revision: yes
Referee: [Results and evaluation] Results and evaluation: no quantitative breakdown is supplied of coverage (fraction of textbook statements formalized), error rates in generated code, or independent verification that the 5900 Lean declarations faithfully reproduce the textbook content. Soundness therefore rests entirely on the authors' internal assessment rather than reported measurements.

Authors: We appreciate the call for more explicit evaluation metrics. The 5900 declarations provide complete coverage of the textbook's definitions, theorems, and proofs, and the final Lean code base compiles successfully with no remaining 'sorry' placeholders, indicating an effective error rate of zero in the delivered artifact. We will revise the evaluation section to state these points explicitly and include a high-level mapping of chapters to formalized content. Independent verification by unaffiliated parties was not performed in this case study; we will note this limitation and point to the open-sourced repository as enabling future community review. revision: partial

standing simulated objections not resolved

Independent verification of faithfulness to the textbook content by parties external to the project

Circularity Check

0 steps flagged

No circularity: empirical case study with no derivation chain

full rationale

The paper reports an engineering outcome: a multi-agent AI system produced a 130K-line Lean formalization of a 500-page graduate textbook in one week. No mathematical derivations, predictions of fitted quantities, or self-referential definitions are present. Claims rest on reported metrics, open-sourced code, and cost comparisons rather than any chain that reduces to its own inputs by construction. No load-bearing self-citations or ansatzes are invoked.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The effort depends on the soundness of the Lean theorem prover and the capabilities of the underlying language model; no new mathematical axioms or entities are introduced.

axioms (1)

standard math The Lean theorem prover correctly validates the formalized statements.
All claims of successful formalization rest on Lean's soundness.

pith-pipeline@v0.9.0 · 5477 in / 1179 out tokens · 44958 ms · 2026-05-13T20:03:48.034396+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The formalization comprises 130K lines of code and 5900 Lean declarations and was conducted within one week by a total of 30K Claude 4.5 Opus agents collaborating in parallel on a shared code base via version control
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery theorem unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use git for version control and adopt a trunk-based development model where agents work on short-lived feature branches

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 2 internal anchors

[1]

https://arxiv.org/abs/2510.01346. Lars Becker, María Inés de Frutos-Fernández, Leo Diedering, Floris van Doorn, Sébastien Gouëzel, Asgar Jamneshan, Evgenia Karunus, Edward van de Meent, Pietro Monticone, Jasper Mulder-Sohn, Jim Portegies, Joris Roos, Michael Rothgang, Rajula Srivastava, James Sundstrom, Jeremy Tan, and Christoph Thiele. A blueprint for th...

work page internal anchor Pith review arXiv 2025
[2]

The equational theories project: Advancing collaborative mathematical research at scale.arXiv preprint arXiv:2512.07087,

Matthew Bolan, Joachim Breitner, Jose Brox, Nicholas Carlini, Mario Carneiro, Floris van Doorn, Martin Dvorak, Andrés Goens, Aaron Hill, Harald Husum, et al. The equational theories project: Advancing collaborative mathematical research at scale.arXiv preprint arXiv:2512.07087,

work page arXiv
[3]

The future of mathematics? Talk slides, Pittsburgh, January 2020.https://www.andrew.cmu.edu/ user/avigad/meetings/fomm2020/slides/fomm_buzzard.pdf

Kevin Buzzard. The future of mathematics? Talk slides, Pittsburgh, January 2020.https://www.andrew.cmu.edu/ user/avigad/meetings/fomm2020/slides/fomm_buzzard.pdf. Slides from a talk given in Pittsburgh (Jan 2020). Nicholas Carlini. Building a c compiler with a team of parallel claudes, February 2026.https://www.anthropic.com/ engineering/building-c-compil...

work page 2020
[4]

Fel’s conjecture on syzygies of numerical semigroups, 2026.https://arxiv.org/abs/2602.03716

Evan Chen, Chris Cummins, GSM, Dejan Grubisic, Leopold Haller, Letong Hong, Andranik Kurghinyan, Kenny Lau, Hugh Leather, Seewoo Lee, Aram Markosyan, Ken Ono, Manooshree Patel, Gaurang Pendharkar, Vedant Rathi, Alex Schneidman, Volker Seeker, Shubho Sengupta, Ishan Sinha, Jimmy Xin, and Jujian Zhang. Fel’s conjecture on syzygies of numerical semigroups, 2...

work page arXiv 2026
[5]

Ben Cottier, Ben Snodin, David Owen, and Tom Adamczewski

https://arxiv.org/abs/2309.14870. Ben Cottier, Ben Snodin, David Owen, and Tom Adamczewski. Llm inference prices have fallen rapidly but unequally across tasks, 2025.https://epoch.ai/data-insights/llm-inference-price-trends. Accessed: 2026-03-02. Kefan Dong and Tengyu Ma. Stp: Self-play llm theorem provers with iterative conjecturing and proving,

work page arXiv 2025
[6]

Ege Erdil and Tamay Besiroglu

https://arxiv.org/abs/2502.00212. Ege Erdil and Tamay Besiroglu. Algorithmic progress in computer vision, 2023.https://arxiv.org/abs/2212.05153. David Gerard. Cursor lies about vibe-coding a web browser with ai, January 2026.https://pivot-to-ai.com/2026/01/ 27/cursor-lies-about-vibe-coding-a-web-browser-with-ai/. Blog post. Georges Gonthier. A computer-ch...

work page arXiv 2023
[7]

doi: 10.1007/978-3-642-39634-2\_14.https://inria.hal.science/hal-00816699

Springer. doi: 10.1007/978-3-642-39634-2\_14.https://inria.hal.science/hal-00816699. Darij Grinberg. An introduction to algebraic combinatorics.arXiv preprint arXiv:2506.00738,

work page doi:10.1007/978-3-642-39634-2
[8]

The price of progress: Algorithmic efficiency and the falling cost of AI inference,

Hans Gundlach, Jayson Lynch, Matthias Mertens, and Neil Thompson. The price of progress: Algorithmic efficiency and the falling cost of ai inference, 2025.https://arxiv.org/abs/2511.23455. Emmanuel Gunther, Miguel Pagano, Pedro Sánchez Terraf, and Matías Steinberg. The independence of the continuum hypothesis in isabelle/zf.Archive of Formal Proofs, March

work page arXiv 2025
[9]

Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z

doi: 10.1017/fmp.2017.1. Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z. Horváth, Goran Žužić, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, Ottavia Bertolli, Tom Zahavy, Amol Mandhane, Jessica Yung, Iuliya Beloshapka, Borja Ibarz, Vivek Veeriah, Lei Yu, Oliver Nash, Paul Lezeau, Salvatore Mercuri, Calle Sönne, B...

work page doi:10.1017/fmp.2017.1 2017
[10]

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

https://arxiv.org/abs/2210.12283. Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline,

work page arXiv
[11]

Hongjian Li

https://github.com/KellerJordan/modded-nanogpt. Hongjian Li. Extension of the fundamental theorem of algebra to polynomial matrix equations overq-circulant matrices. arXiv preprint arXiv:2601.13775,

work page arXiv
[12]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Wilson Lin. Scaling long-running autonomous coding, January 2026.https://cursor.com/blog/scaling-agents. Blog post. Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint...

work page arXiv 2026
[13]

Numina-lean-agent: An open and general agentic reasoning system for formal mathematics.arXiv preprint arXiv:2601.14027,

Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, et al. Numina-lean-agent: An open and general agentic reasoning system for formal mathematics. arXiv preprint arXiv:2601.14027,

work page arXiv
[14]

At which training stage does code data help llms reasoning?, 2023.https://arxiv.org/abs/2309.16298

Yingwei Ma, Yue Liu, Yue Yu, Yuanliang Zhang, Yu Jiang, Changjian Wang, and Shanshan Li. At which training stage does code data help llms reasoning?, 2023.https://arxiv.org/abs/2309.16298. Math, Inc. Introducing Gauss, an agent for autoformalization.https://www.math.inc/gauss, September

work page arXiv 2023
[15]

Math, Inc

Accessed: 2026-03-12. Math, Inc. Completing the formal proof of higher-dimensional sphere packing, March 2026.https://www.math.inc/ sphere-packing. Announcing the formalization of Maryna Viazovska’s E8 and Leech lattice proofs using the Gauss autoformalization agent. 16 The mathlib Community. The lean mathematical library. InProceedings of the 9th ACM SIG...

work page 2026
[16]

mathlib Community, T

doi: 10.1145/3372885. 3373824.http://dx.doi.org/10.1145/3372885.3373824. Kim Morrison and the leanprover community. repl: A read-eval-print-loop for Lean

work page doi:10.1145/3372885
[17]

ISBN 978-0-13- 468951-7. Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposi...

work page internal anchor Pith review arXiv 2025
[18]

Watershed moment for ai–human collaboration in math.IEEE Spectrum, March 2026.https: //spectrum.ieee.org/ai-proof-verification

Benjamin Skuse. Watershed moment for ai–human collaboration in math.IEEE Spectrum, March 2026.https: //spectrum.ieee.org/ai-proof-verification. Josef Urban. 130k lines of formal topology in two weeks: Simple and cheap autoformalization for everyone?,

work page 2026
[19]

Floris van Doorn, Patrick Massot, and Oliver Nash

https://arxiv.org/abs/2601.03298. Floris van Doorn, Patrick Massot, and Oliver Nash. Formalising the h-principle and sphere eversion. InProceedings of the 12th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2023, page 121–134, New York, NY, USA,

work page arXiv 2023
[20]

ISBN 9798400700262

Association for Computing Machinery. ISBN 9798400700262. doi: 10.1145/3573105.3575688. https://doi.org/10.1145/3573105.3575688. Steven J. Vaughan-Nichols. Ok, so anthropic’s AI built a C compiler. that don’t impress me much, February

work page doi:10.1145/3573105.3575688
[21]

Opinion piece

https://www.theregister.com/2026/02/13/anthropic_c_compiler/. Opinion piece. VladimirVoevodsky. Theoriginsandmotivationsofunivalentfoundations: Apersonalmissiontodevelopcomputerproof verification to avoid mathematical mistakes, 2014.https://www.ias.edu/ideas/2014/voevodsky-origins. Published inThe Institute Letter(Summer 2014), Institute for Advanced Stud...

work page 2026
[22]

Video: https://www.ias.edu/video/voevodsky14/. Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, ...

work page arXiv 2026
[23]

Deepseek-prover-v1

Huajian Xin, ZZ Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, et al. Deepseek-prover-v1. 5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search.arXiv preprint arXiv:2408.08152,

work page arXiv
[24]

Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms, 2025.https://arxiv.org/abs/2502.19411

17 Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, and Julian McAuley. Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms, 2025.https://arxiv.org/abs/2502.19411. John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Li...

work page arXiv 2025
[25]

Unveiling the impact of coding data instruction fine-tuning on large language models reasoning, 2024.https: //arxiv.org/abs/2405.20535

Xinlu Zhang, Zhiyu Zoey Chen, Xi Ye, Xianjun Yang, Lichang Chen, William Yang Wang, and Linda Ruth Petzold. Unveiling the impact of coding data instruction fine-tuning on large language models reasoning, 2024.https: //arxiv.org/abs/2405.20535. 18 Appendix A Estimating Token Caching Statistics Unfortunately, our logs only contain the total number of input ...

work page arXiv 2024