Recognition: 2 theorem links
· Lean TheoremMathAtlas: A Benchmark for Autoformalization in the Wild
Pith reviewed 2026-05-15 05:05 UTC · model grok-4.3
The pith
MathAtlas extracts 52k graduate-level math statements from textbooks to benchmark autoformalization systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MathAtlas supplies the first sizable collection of graduate-level mathematical entities in the wild, paired with an explicit dependency graph, and demonstrates that existing autoformalization pipelines succeed on fewer than one in ten statements even when the underlying textbooks are standard references.
What carries the argument
MathAtlas dataset of textbook-extracted statements equipped with a mathematical dependency graph that records prerequisite relations among entities.
If this is right
- Evaluation of autoformalization tools can move beyond olympiad problems to realistic graduate content.
- Systems that exploit the supplied dependency graph can be tested for whether they improve over statement-by-statement baselines.
- Performance measured on MA-Hard indicates that deeper prerequisite chains remain a major obstacle for current models.
- The released dataset supports development of methods that scale formalization to entire graduate textbooks.
Where Pith is reading between the lines
- Success on this benchmark would indicate models can handle the informal language and implicit context typical of research mathematics.
- The dependency graph opens the possibility of training or evaluating models that reason over chains of definitions rather than isolated statements.
- Widespread adoption could shift autoformalization research toward coverage of full textbooks instead of curated problem sets.
Load-bearing premise
The automatic extraction from textbooks yields accurate, correctly scoped graduate statements whose dependency relations match actual mathematical prerequisites.
What would settle it
Manual inspection of a random sample of 200 extracted statements and their listed dependencies revealing frequent scoping or prerequisite errors would show the benchmark does not faithfully represent graduate mathematics.
Figures
read the original abstract
Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research-level mathematics remains underexplored. In this paper, we introduce MathAtlas, the first large-scale autoformalization benchmark of in the wild graduate-level mathematics, containing ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks. MathAtlas is enriched with a mathematical dependency graph containing ~178k relations, and is the first autoformalization benchmark to include such relations, facilitating evaluation and development of dependency-aware autoformalization systems. Our extensive experiments show that MathAtlas is high quality but extremely challenging: strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions. Furthermore, we find performance of state-of-the-art models degrades substantially with dependency depth: on MA-Hard, a subset of 700 entities with the deepest dependency trees, the best model achieves only 2.6% correctness for autoformalization on this challenging dataset. We release MathAtlas to the community as a benchmark set for large-scale autoformalization of graduate-level mathematics in the wild.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MathAtlas, a benchmark of ~52k graduate-level mathematical entities (theorems, definitions, exercises, examples, proofs) automatically extracted from 103 textbooks, augmented with a dependency graph of ~178k relations. It evaluates strong baselines on autoformalization tasks, reporting maximum correctness of 9.8% on theorem statements, 16.7% on definitions, and 2.6% on the MA-Hard subset of 700 entities with deepest dependency trees, positioning the dataset as high-quality yet extremely challenging for current models.
Significance. If the extraction pipeline is reliable, MathAtlas fills an important gap by providing the first large-scale, dependency-enriched benchmark for graduate and research-level mathematics, enabling development of dependency-aware autoformalization systems beyond olympiad/undergraduate focus. The empirical results on performance degradation with dependency depth offer concrete, falsifiable evidence of current limitations.
major comments (2)
- [§3] §3 (Dataset Construction and Extraction Pipeline): No human validation, error rates, or inter-annotator agreement metrics are reported for the ~52k extracted entities or ~178k relations. This is load-bearing for the central claim of 'high quality,' as unquantified parsing artifacts (mis-scoped statements, omitted context, or spurious edges) could directly inflate the reported difficulty metrics without reflecting true autoformalization limits.
- [§4.2–4.3] §4.2–4.3 (MA-Hard Subset and Dependency Depth): The construction of the MA-Hard subset (700 entities with deepest dependency trees) and the precise definition of dependency depth are not fully specified, including whether any post-extraction filtering for statement correctness was applied. This affects interpretability of the 2.6% correctness result as a measure of model limits versus data artifacts.
minor comments (2)
- [Abstract] Abstract: The phrasing 'the first large-scale autoformalization benchmark of in the wild graduate-level mathematics' contains a grammatical error and should be revised for clarity.
- [Results] Table 1 or equivalent results table: Ensure all baseline models and metrics are consistently labeled across text and tables to avoid ambiguity in the 9.8% and 16.7% figures.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide the requested clarifications and additional validation details.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Construction and Extraction Pipeline): No human validation, error rates, or inter-annotator agreement metrics are reported for the ~52k extracted entities or ~178k relations. This is load-bearing for the central claim of 'high quality,' as unquantified parsing artifacts (mis-scoped statements, omitted context, or spurious edges) could directly inflate the reported difficulty metrics without reflecting true autoformalization limits.
Authors: We agree that the original manuscript lacked quantitative human validation metrics for the extracted entities and relations. In the revised version, we have added a dedicated subsection in §3 reporting a post-hoc validation study: two authors independently reviewed a stratified random sample of 500 entities and 200 relations, yielding an inter-annotator agreement of 91% (Cohen's κ = 0.84) and observed error rates of 4.8% for entities (mostly minor scoping or context omissions) and 2.9% for relations. These results are now presented with confidence intervals and support the high-quality characterization while transparently documenting residual pipeline limitations. We believe this directly addresses the concern. revision: yes
-
Referee: [§4.2–4.3] §4.2–4.3 (MA-Hard Subset and Dependency Depth): The construction of the MA-Hard subset (700 entities with deepest dependency trees) and the precise definition of dependency depth are not fully specified, including whether any post-extraction filtering for statement correctness was applied. This affects interpretability of the 2.6% correctness result as a measure of model limits versus data artifacts.
Authors: We have substantially expanded §4.2 with a precise, formal definition: dependency depth of an entity is the length of the longest directed path in the dependency graph from that entity to any source node (i.e., an entity with no incoming dependencies). The MA-Hard subset is formed by selecting the 700 entities whose depths are maximal (depth ≥ 12, the top 1.35% of the distribution). No post-extraction filtering on statement correctness was applied; selection is determined solely by depth to isolate the effect of dependency complexity. The revised manuscript now includes the exact selection algorithm (with pseudocode), a depth histogram, and an explicit statement that all 700 entities were retained as extracted. This clarifies that the 2.6% figure reflects model performance on deep-dependency items rather than data artifacts. revision: yes
Circularity Check
No circularity: empirical benchmark release with independent measurements
full rationale
The paper constructs and releases a new dataset (MathAtlas) via automatic extraction from textbooks, then reports baseline model performance as empirical measurements. No equations, fitted parameters, or predictions are claimed; the central results (9.8% theorem correctness, 2.6% on MA-Hard) are direct evaluations on the released data and do not reduce to self-referential definitions or self-citation chains. The dependency graph and extraction process are presented as construction steps, not as derived quantities. This is a standard dataset paper whose claims are externally falsifiable by re-running models on the public release.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MathAtlas contains ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks... enriched with a mathematical dependency graph containing ~178k relations
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions... on MA-Hard... only 2.6% correctness
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs
Exploration of neural machine translation in autoformalization of mathematics in Mizar | Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. URLhttps://dl.acm.org/doi/10.1145/3372885.3373827
-
[2]
A. Agrawal, S. Gadgil, N. Goyal, A. Narayanan, and A. Tadipatri. Towards a Mathematics Formalisation Assistant using Large Language Models. URL http://arxiv.org/abs/2211. 07524
-
[3]
Z. Azerbayev, B. Piotrowski, and J. Avigad. ProofNet: A Benchmark for Autoformalizing and Formally Proving Undergraduate-Level Mathematics Problems
-
[4]
Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An Open Language Model For Mathematics, Oct. 2023
work page 2023
-
[5]
L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents, 2023
work page 2023
- [6]
- [7]
- [8]
-
[9]
R. Escriva, H. Bashir, M. Isom, J. Huber, Macronova, S. Kedia, itaismith, L. Vander- Hart, J. Radhakrishnan, tanujnay112, T. Azarov, B. Eggers, K. Diaz, L. Pei, jasonvigil, D. Kim, A. Troynikov, P. I. Thomas, R. P, nicolasgere, G. Shahbazian, E. Culver, C. Gam- ble, D. Dash, T. Krusinski, W. Gu, swyx.io, M. Keller, R. Him, and naynaly10. chroma- core/chro...
work page 2026
- [10]
- [11]
- [12]
- [13]
- [14]
-
[15]
A. Kontorovich and T. Tao. Prime Number Theorem and More, Jan. 2024. URL https: //github.com/AlexKontorovich/PrimeNumberTheoremAnd
work page 2024
-
[16]
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[17]
Y . Lin, S. Tang, B. Lyu, Z. Yang, J.-H. Chung, H. Zhao, L. Jiang, Y . Geng, J. Ge, J. Sun, J. Wu, J. Gesi, X. Lu, D. Acuna, K. Yang, H. Lin, Y . Choi, D. Chen, S. Arora, and C. Jin. Goedel- prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction,
- [18]
-
[19]
Q. Liu, X. Zheng, X. Lu, Q. Cao, and J. Yan. Rethinking and improving autoformalization: towards a faithful metric and a dependency retrieval-based approach. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=hUb2At2DsQ
work page 2025
- [20]
- [21]
- [22]
-
[23]
In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs
T. mathlib Community. The lean mathematical library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, POPL ’20. ACM, Jan. 2020. doi: 10.1145/3372885.3373824. URLhttp://dx.doi.org/10.1145/3372885.3373824
-
[24]
M. J. Min, Y . Gao, W. Sy, Z. Li, X. Si, and O. Bastani. Divide and abstract: Autoformalization via decomposition and abstraction learning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=NjgaeXNit3
work page 2026
-
[25]
L. d. Moura and S. Ullrich. The lean 4 theorem prover and programming language, 2021. URL https://doi.org/10.1007/978-3-030-79876-5_37
- [26]
-
[27]
Nilay Patel. offendo/blv. https://github.com/offendo/blv, mar 11 2026. URL https://github. com/offendo/blv
work page 2026
-
[28]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI, :, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V . Fomenko, T. Garipov, K. Georgiev, M. Glaese, T. Gogineni, A. Goucher, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Z. Peng, Y . Yao, K. Ma, S. Guo, Y . Li, Y . Zhang, C. Zhang, Y . Zhang, Z. Yu, L. Li, M. Liu, Y . Xia, J. Shen, Y . Wu, Y . Cao, Z. Zhang, W. Huang, J. Liu, and G. Zhang. Criticlean: Critic-guided reinforcement learning for mathematical formalization, 2025. URL https: //arxiv.org/abs/2507.06181
- [30]
-
[31]
P. Scholze and K. Buzzard. Liquid tensor experiment, Dec. 2020. URL https:// xenaproject.wordpress.com/2020/12/05/liquid-tensor-experiment/
work page 2020
-
[32]
J. Shi, Z. Zhang, B. Ma, S. Zhao, Y . Yuan, and G. Wang. Loc-decomp: LLM autofor- malization via logical concept decomposition and iterative feedback correction. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=0KFQ4F9YEH
work page 2026
-
[33]
T. Tao. Formalizing the proof of pfr in lean4 using blueprint: a short tour, Nov. 2023. URL https://terrytao.wordpress.com/2023/11/18/ formalizing-the-proof-of-pfr-in-lean4-using-blueprint-a-short-tour
work page 2023
-
[34]
H. Wang, M. Unsal, X. Lin, M. Baksys, J. Liu, M. D. Santos, F. Sung, M. Vinyes, Z. Ying, Z. Zhu, J. Lu, H. de Saxcé, B. Bailey, C. Song, C. Xiao, D. Zhang, E. Zhang, F. Pu, H. Zhu, J. Liu, J. Bayer, J. Michel, L. Yu, L. Dreyfus-Schmidt, L. Tunstall, L. Pagani, M. Machado, P. Bourigault, R. Wang, S. Polu, T. Barroyer, W.-D. Li, Y . Niu, Y . Fleureau, Y . H...
- [35]
- [36]
- [37]
- [38]
- [39]
-
[40]
L. Zhang, M. Valentino, and A. Freitas. Autoformalization in the wild: Assessing LLMs on real- world mathematical definitions. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1720–1738, Suzhou, China, Nov. 2025. Association for Computatio...
- [41]
-
[42]
K. Zheng, J. M. Han, and S. Polu. MiniF2F: A cross-system benchmark for formal Olympiad- level mathematics. URLhttps://arxiv.org/abs/2109.00110v2. 13 Figure 5:MathAtlascreation pipeline. For visual convenience, we combine the “name” and “reference” extractors into one step called “identifier extractor.” Text: "M is compact iff G is abelian." Reasoning: - ...
-
[43]
Scan the text s y s t e m a t i c a l l y from start to end
-
[44]
Act iv el y search for : - E xpl ic it labels ( Definition , Theorem , Proof , etc .) - I mpl ic it d e f i n i t i o n s using d e f i n i t i o n a l l an gu ag e
-
[45]
For each c a n d i d a t e block : - Decide whether it is genuine m a t h e m a t i c a l content - Confirm it matches one of the allowed entity types
-
[46]
Exclude all non - m a t h e m a t i c a l e x p o s i t i o n . Only after c o m p l e t i n g this i d e n t i f i c a t i o n and s e g m e n t a t i o n should you produce the output . ## Output c o n s t r a i n t s - Respond with a JSON list only - Do NOT wrap the JSON in code fences - Do NOT include explanations , commentary , or r e a s o n i n g i...
-
[47]
A _union_ of two sets \( A \) and \( B \) is defined as \[ A \ cup B \ co lo ne qq \{{ x : x \ in A \ text {{ or }} x \ in B \}}.\]
-
[48]
An _ i n t e r s e c t i o n _ of two sets \( A \) and \( B \) is defined as \[ A \ cap B \ co lo ne qq \{{ x : x \ in A \ text {{ and }} x \ in B \}}.\]
-
[49]
A _ c o m p l e m e n t of \( B \) rel at iv e to \( A \) _ ( or _set - t h e o r e t i c d i f f e r e n c e _ of \( A \) and \( B \) ) is defined as \[ A \ s et mi nus B \ c ol one qq \{{ x : x \ in A \ text {{ and }} x \ notin B \}}.\]
-
[50]
We say _ c o m p l e m e n t _ of \( B \) and write \( B ^{{ c }}\) instead of \( A \ se tm inu s B \) if the set \( A \) is either the entire un iv er se or if it is the obvious set c o n t a i n i n g \( B \) , and is u n d e r s t o o d from context
-
[51]
We say sets \( A \) and \( B \) are _ d i s j o i n t _ if \( A \ cap B =\ emp ty se t \) Output : [" union " , " i n t e r s e c t i o n " , " set - t h e o r e t i c d i f f e r e n c e " , " c o m p l e m e n t " , " d is joi nt "] Example 2: D e f i n i t i o n : ** D e f i n i t i o n 6.1** (\(\ mathsf {{ RCA }} _ {{0}}\) ) .: An _ i n t e g r a l do...
-
[52]
o b j e c t _ r e f e r e n c e
" o b j e c t _ r e f e r e n c e " A standard , named m a t h e m a t i c a l object , structure , property , or concept whose d e f i n i t i o n exists i n d e p e n d e n t l y of the current text . E xam pl es : - group - t o p o l o g i c a l space - metric space - compact space - real number - real vector space - group h o m o m o r p h i s m - gen...
-
[53]
e n t i t y _ r e f e r e n c e
" e n t i t y _ r e f e r e n c e " A r e f e r e n c e to a sp ec ifi c p r e v i o u s l y defined , labeled , or n um be re d m a t h e m a t i c a l item . E xam pl es : - D e f i n i t i o n 3.4 - Theorem 2.1 - Lemma 5 - P r o p o s i t i o n 4.2 - C o r o l l a r y 1.3 - p re vio us lemma - above theorem - f o l l o w i n g d e f i n i t i o n 17 Th...
-
[54]
l o c a l _ v a r i a b l e _ r e f e r e n c e
" l o c a l _ v a r i a b l e _ r e f e r e n c e " A symbol that refers to a m a t h e m a t i c a l object but is NOT e x p l i c i t l y i n t r o d u c e d in the current text . A symbol is e x p l i c i t l y i n t r o d u c e d if the current text d ec la re s it using phrases such as : - let M be ... - suppose M is ... - fix M ... - given M ... - w...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.