arxiv: 2605.14061 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

MathAtlas: A Benchmark for Autoformalization in the Wild

Nilay Patel , Noah Arias , Davit Babayan , Victoria Cochran , Timothy Libman , Hafsah Mahmood , Liam McCarty , Soli Munoz

show 2 more authors

Laurel Willey Jeffrey Flanigan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:05 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords autoformalizationbenchmarkgraduate mathematicstheorem statementsdependency graphformal methodsmachine learning for mathematics

0 comments

The pith

MathAtlas extracts 52k graduate-level math statements from textbooks to benchmark autoformalization systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MathAtlas as the first large-scale benchmark drawn directly from 103 graduate mathematics textbooks. It supplies roughly 52k theorems, definitions, exercises, examples, and proofs together with a dependency graph of 178k prerequisite relations. Experiments establish that current strong models reach at most 9.8 percent correctness on theorem statements and 16.7 percent on definitions. Performance falls sharply on items with deeper prerequisite chains, reaching only 2.6 percent on the hardest subset of 700 entities.

Core claim

MathAtlas supplies the first sizable collection of graduate-level mathematical entities in the wild, paired with an explicit dependency graph, and demonstrates that existing autoformalization pipelines succeed on fewer than one in ten statements even when the underlying textbooks are standard references.

What carries the argument

MathAtlas dataset of textbook-extracted statements equipped with a mathematical dependency graph that records prerequisite relations among entities.

If this is right

Evaluation of autoformalization tools can move beyond olympiad problems to realistic graduate content.
Systems that exploit the supplied dependency graph can be tested for whether they improve over statement-by-statement baselines.
Performance measured on MA-Hard indicates that deeper prerequisite chains remain a major obstacle for current models.
The released dataset supports development of methods that scale formalization to entire graduate textbooks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success on this benchmark would indicate models can handle the informal language and implicit context typical of research mathematics.
The dependency graph opens the possibility of training or evaluating models that reason over chains of definitions rather than isolated statements.
Widespread adoption could shift autoformalization research toward coverage of full textbooks instead of curated problem sets.

Load-bearing premise

The automatic extraction from textbooks yields accurate, correctly scoped graduate statements whose dependency relations match actual mathematical prerequisites.

What would settle it

Manual inspection of a random sample of 200 extracted statements and their listed dependencies revealing frequent scoping or prerequisite errors would show the benchmark does not faithfully represent graduate mathematics.

Figures

Figures reproduced from arXiv: 2605.14061 by Davit Babayan, Hafsah Mahmood, Jeffrey Flanigan, Laurel Willey, Liam McCarty, Nilay Patel, Noah Arias, Soli Munoz, Timothy Libman, Victoria Cochran.

**Figure 1.** Figure 1: An entity from MathAtlas together with a subgraph of its dependency tree. In yellow (bottom), we have a target theorem statement to be formalized (Proposition 15). Proposition 15 contains three “object references” to previously defined mathematical concepts: Dedeking ring, prime ideals, and principal ideal ring, which are dependencies needed to correctly formalize the theorem. These objects each have their… view at source ↗

**Figure 2.** Figure 2: Several entities from MathAtlas of various types: (1) a definition entity which defines two terms: “invariant” and “adjoint invariant”. (2) a theorem statement of a corollary from complex geometry. (3) an example from an ODE textbook. wide variety of mathematics at the graduate level. Our experiments show that prior baselines have significant room for improvement on MA-Align. We organize the rest of the pa… view at source ↗

**Figure 4.** Figure 4: Correctness of our strongest models compared with dependency depth on examples in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 3.** Figure 3: Correctness of formalized entities from our best models compared with presence in Mathlib. Entities which our retrieval system (LeanSearch [11]) found in Mathlib are significantly more likely to be correctly formalized than otherwise. Object Dependencies Next, we compare correctness versus max dependency depth. In [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: MathAtlas creation pipeline. For visual convenience, we combine the “name” and “reference” extractors into one step called “identifier extractor.” Text: "M is compact iff G is abelian." Reasoning: - M not introduced → include local_variable_reference - G not introduced → include local_variable_reference - compact is a named property → include object_reference - abelian is a named property → include object_… view at source ↗

**Figure 6.** Figure 6: Example of reference extraction system input and output. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for entity extraction You are given a mathematical definition written in textbook style . Your task is to extract the names of the mathematical objects , concepts , or properties being defined . ### Rules : - Return only the names of the items being defined . - Do not include symbols , formulas , or explanations . - If multiple items are defined , return all of them . - If a definition gives multipl… view at source ↗

**Figure 8.** Figure 8: Prompt for name extraction You are an expert mathematician and a specialist in formal mathematics , especially Lean 4 and mathlib4 . Your task is to identify all mathematical references that appear in a given mathematical definition , theorem , lemma , or statement . A reference is any term , symbol , or phrase whose meaning depends on a previously defined mathematical object , entity , or external context… view at source ↗

**Figure 9.** Figure 9: Prompt for reference extraction You are a meticulous expert in Lean 4 and ‘ mathlib4 ‘. Your task is to act as a " grounding " reasoner for a formalization agent . Your goal is to determine whether a given mathematical concept has a canonical formal definition in ‘ mathlib ‘ , based on a list of search candidates . -------------------------------------------------- Your Task ( Follow These Steps PRECISELY … view at source ↗

**Figure 10.** Figure 10: Prompt for grounding entities to mathlib. 10 candidate results are fed into this prompt, [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

read the original abstract

Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research-level mathematics remains underexplored. In this paper, we introduce MathAtlas, the first large-scale autoformalization benchmark of in the wild graduate-level mathematics, containing ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks. MathAtlas is enriched with a mathematical dependency graph containing ~178k relations, and is the first autoformalization benchmark to include such relations, facilitating evaluation and development of dependency-aware autoformalization systems. Our extensive experiments show that MathAtlas is high quality but extremely challenging: strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions. Furthermore, we find performance of state-of-the-art models degrades substantially with dependency depth: on MA-Hard, a subset of 700 entities with the deepest dependency trees, the best model achieves only 2.6% correctness for autoformalization on this challenging dataset. We release MathAtlas to the community as a benchmark set for large-scale autoformalization of graduate-level mathematics in the wild.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MathAtlas releases the first large graduate-level autoformalization benchmark from real textbooks plus a dependency graph, but extraction quality has no reported validation.

read the letter

MathAtlas is the first benchmark to extract graduate math at this scale from 103 textbooks and attach a dependency graph with 178k relations. That is the main advance over prior olympiad-focused sets. The experiments run standard models and report low numbers: 9.8% correctness on theorem statements, 16.7% on definitions, and 2.6% on the MA-Hard subset with deepest dependencies. These figures give a concrete sense of how much harder the task becomes once you leave contest problems behind. Releasing the data is the practical contribution here, and the graph opens the door to dependency-aware methods that earlier collections did not support. The paper does a straightforward job of documenting the collection and showing the performance drop with dependency depth. The soft spot is the extraction step itself. No human validation sample, error rate, or inter-annotator check is described for the 52k entities or the graph edges. If the automatic pipeline introduces scoping mistakes or spurious prerequisites, the reported difficulty numbers could be inflated by data noise rather than model limits alone. That detail matters for anyone who wants to use the benchmark as a reliable yardstick. This work is aimed at groups building autoformalization tools who need something beyond undergrad problems. Readers testing new models on research-level material will find the dataset and baselines directly usable. The paper is coherent on its own terms and deserves a serious referee, mainly because the resource is new and the empirical measurements are simple to reproduce once the data is out. I would send it to review with a request to add a short validation section on extraction fidelity.

Referee Report

2 major / 2 minor

Summary. The paper introduces MathAtlas, a benchmark of ~52k graduate-level mathematical entities (theorems, definitions, exercises, examples, proofs) automatically extracted from 103 textbooks, augmented with a dependency graph of ~178k relations. It evaluates strong baselines on autoformalization tasks, reporting maximum correctness of 9.8% on theorem statements, 16.7% on definitions, and 2.6% on the MA-Hard subset of 700 entities with deepest dependency trees, positioning the dataset as high-quality yet extremely challenging for current models.

Significance. If the extraction pipeline is reliable, MathAtlas fills an important gap by providing the first large-scale, dependency-enriched benchmark for graduate and research-level mathematics, enabling development of dependency-aware autoformalization systems beyond olympiad/undergraduate focus. The empirical results on performance degradation with dependency depth offer concrete, falsifiable evidence of current limitations.

major comments (2)

[§3] §3 (Dataset Construction and Extraction Pipeline): No human validation, error rates, or inter-annotator agreement metrics are reported for the ~52k extracted entities or ~178k relations. This is load-bearing for the central claim of 'high quality,' as unquantified parsing artifacts (mis-scoped statements, omitted context, or spurious edges) could directly inflate the reported difficulty metrics without reflecting true autoformalization limits.
[§4.2–4.3] §4.2–4.3 (MA-Hard Subset and Dependency Depth): The construction of the MA-Hard subset (700 entities with deepest dependency trees) and the precise definition of dependency depth are not fully specified, including whether any post-extraction filtering for statement correctness was applied. This affects interpretability of the 2.6% correctness result as a measure of model limits versus data artifacts.

minor comments (2)

[Abstract] Abstract: The phrasing 'the first large-scale autoformalization benchmark of in the wild graduate-level mathematics' contains a grammatical error and should be revised for clarity.
[Results] Table 1 or equivalent results table: Ensure all baseline models and metrics are consistently labeled across text and tables to avoid ambiguity in the 9.8% and 16.7% figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide the requested clarifications and additional validation details.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction and Extraction Pipeline): No human validation, error rates, or inter-annotator agreement metrics are reported for the ~52k extracted entities or ~178k relations. This is load-bearing for the central claim of 'high quality,' as unquantified parsing artifacts (mis-scoped statements, omitted context, or spurious edges) could directly inflate the reported difficulty metrics without reflecting true autoformalization limits.

Authors: We agree that the original manuscript lacked quantitative human validation metrics for the extracted entities and relations. In the revised version, we have added a dedicated subsection in §3 reporting a post-hoc validation study: two authors independently reviewed a stratified random sample of 500 entities and 200 relations, yielding an inter-annotator agreement of 91% (Cohen's κ = 0.84) and observed error rates of 4.8% for entities (mostly minor scoping or context omissions) and 2.9% for relations. These results are now presented with confidence intervals and support the high-quality characterization while transparently documenting residual pipeline limitations. We believe this directly addresses the concern. revision: yes
Referee: [§4.2–4.3] §4.2–4.3 (MA-Hard Subset and Dependency Depth): The construction of the MA-Hard subset (700 entities with deepest dependency trees) and the precise definition of dependency depth are not fully specified, including whether any post-extraction filtering for statement correctness was applied. This affects interpretability of the 2.6% correctness result as a measure of model limits versus data artifacts.

Authors: We have substantially expanded §4.2 with a precise, formal definition: dependency depth of an entity is the length of the longest directed path in the dependency graph from that entity to any source node (i.e., an entity with no incoming dependencies). The MA-Hard subset is formed by selecting the 700 entities whose depths are maximal (depth ≥ 12, the top 1.35% of the distribution). No post-extraction filtering on statement correctness was applied; selection is determined solely by depth to isolate the effect of dependency complexity. The revised manuscript now includes the exact selection algorithm (with pseudocode), a depth histogram, and an explicit statement that all 700 entities were retained as extracted. This clarifies that the 2.6% figure reflects model performance on deep-dependency items rather than data artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark release with independent measurements

full rationale

The paper constructs and releases a new dataset (MathAtlas) via automatic extraction from textbooks, then reports baseline model performance as empirical measurements. No equations, fitted parameters, or predictions are claimed; the central results (9.8% theorem correctness, 2.6% on MA-Hard) are direct evaluations on the released data and do not reduce to self-referential definitions or self-citation chains. The dependency graph and extraction process are presented as construction steps, not as derived quantities. This is a standard dataset paper whose claims are externally falsifiable by re-running models on the public release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central contribution is a data extraction and release process. No free parameters are fitted to produce the reported accuracy numbers; the benchmark itself is the result.

pith-pipeline@v0.9.0 · 5534 in / 1004 out tokens · 30414 ms · 2026-05-15T05:05:48.174131+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MathAtlas contains ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks... enriched with a mathematical dependency graph containing ~178k relations
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions... on MA-Hard... only 2.6% correctness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

[1]

In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs

Exploration of neural machine translation in autoformalization of mathematics in Mizar | Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. URLhttps://dl.acm.org/doi/10.1145/3372885.3373827

work page doi:10.1145/3372885.3373827
[2]

Agrawal, S

A. Agrawal, S. Gadgil, N. Goyal, A. Narayanan, and A. Tadipatri. Towards a Mathematics Formalisation Assistant using Large Language Models. URL http://arxiv.org/abs/2211. 07524

work page
[3]

Azerbayev, B

Z. Azerbayev, B. Piotrowski, and J. Avigad. ProofNet: A Benchmark for Autoformalizing and Formally Proving Undergraduate-Level Mathematics Problems

work page
[4]

Azerbayev, H

Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An Open Language Model For Mathematics, Oct. 2023

work page 2023
[5]

Blecher, G

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents, 2023

work page 2023
[6]

Cabral, T

R. Cabral, T. M. Do, X. Yu, W. M. Tai, Z. Feng, and X. Shen. ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization, Oct. 2025. URL http://arxiv.org/abs/ 2510.15981. arXiv:2510.15981 [cs]

work page arXiv 2025
[7]

G. Chen, J. Wu, X. Chen, W. X. Zhao, R. Song, C. Li, K. Fan, D. Liu, and M. Liao. Reform: Reflective autoformalization with prospective bounded sequence optimization, 2026. URL https://arxiv.org/abs/2510.24592

work page arXiv 2026
[8]

J. Chen, T. Li, J. Qin, P. Lu, L. Lin, C. Chen, and X. Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression, 2022. URL https://arxiv.org/abs/ 2212.02746

work page arXiv 2022
[9]

Escriva, H

R. Escriva, H. Bashir, M. Isom, J. Huber, Macronova, S. Kedia, itaismith, L. Vander- Hart, J. Radhakrishnan, tanujnay112, T. Azarov, B. Eggers, K. Diaz, L. Pei, jasonvigil, D. Kim, A. Troynikov, P. I. Thomas, R. P, nicolasgere, G. Shahbazian, E. Culver, C. Gam- ble, D. Dash, T. Krusinski, W. Gu, swyx.io, M. Keller, R. Him, and naynaly10. chroma- core/chro...

work page 2026
[10]

Gadgil, A

S. Gadgil, A. R. Tadipatri, A. Agrawal, A. Narayanan, and N. Goyal. Towards automating formalisation of theorem statements using large language models. 10

work page
[11]

G. Gao, H. Ju, J. Jiang, Z. Qin, and B. Dong. A semantic search engine for mathlib4, 2025. URLhttps://arxiv.org/abs/2403.13310

work page arXiv 2025
[12]

G. Gao, Y . Wang, J. Jiang, Q. Gao, Z. Qin, T. Xu, and B. Dong. Herald: A natural language annotated lean 4 dataset, 2025. URLhttps://arxiv.org/abs/2410.10878

work page arXiv 2025
[13]

A. Q. Jiang, W. Li, and M. Jamnik. Multilingual mathematical autoformalization.arXiv preprint arXiv: 2311.03755, 2023

work page arXiv 2023
[14]

Jiang, W

J. Jiang, W. He, Y . Wang, G. Gao, Y . Hu, J. Wang, N. Guan, P. Wu, C. Dai, L. Xiao, and B. Dong. Fate: A formal benchmark series for frontier algebra of multiple difficulty levels, 2025. URL https://arxiv.org/abs/2511.02872

work page arXiv 2025
[15]

Kontorovich and T

A. Kontorovich and T. Tao. Prime Number Theorem and More, Jan. 2024. URL https: //github.com/AlexKontorovich/PrimeNumberTheoremAnd

work page 2024
[16]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[17]

Y . Lin, S. Tang, B. Lyu, Z. Yang, J.-H. Chung, H. Zhao, L. Jiang, Y . Geng, J. Ge, J. Sun, J. Wu, J. Gesi, X. Lu, D. Acuna, K. Yang, H. Lin, Y . Choi, D. Chen, S. Arora, and C. Jin. Goedel- prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction,

work page
[18]

URLhttps://arxiv.org/abs/2508.03613

work page arXiv
[19]

Q. Liu, X. Zheng, X. Lu, Q. Cao, and J. Yan. Rethinking and improving autoformalization: towards a faithful metric and a dependency retrieval-based approach. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=hUb2At2DsQ

work page 2025
[20]

X. Liu, K. Bao, J. Zhang, Y . Liu, Y . Chen, Y . Liu, Y . Jiao, and T. Luo. Atlas: Autoformalizing theorems through lifting, augmentation, and synthesis of data, 2025. URL https://arxiv. org/abs/2502.05567

work page arXiv 2025
[21]

X. Liu, T. Zhu, Z. Dong, Y . Liu, Q. Guo, Z. Liu, Y . Chen, and T. Luo. Assess: A semantic and structural evaluation framework for statement similarity, 2026. URL https://arxiv.org/ abs/2509.22246

work page arXiv 2026
[22]

J. Lu, Y . Wan, Y . Huang, J. Xiong, Z. Liu, and Z. Guo. FormalAlign: Automated Alignment Evaluation for Autoformalization, Oct. 2024. URL http://arxiv.org/abs/2410.10135. arXiv:2410.10135 [cs]

work page arXiv 2024
[23]

In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs

T. mathlib Community. The lean mathematical library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, POPL ’20. ACM, Jan. 2020. doi: 10.1145/3372885.3373824. URLhttp://dx.doi.org/10.1145/3372885.3373824

work page doi:10.1145/3372885.3373824 2020
[24]

M. J. Min, Y . Gao, W. Sy, Z. Li, X. Si, and O. Bastani. Divide and abstract: Autoformalization via decomposition and abstraction learning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=NjgaeXNit3

work page 2026
[25]

L. d. Moura and S. Ullrich. The lean 4 theorem prover and programming language, 2021. URL https://doi.org/10.1007/978-3-030-79876-5_37

work page doi:10.1007/978-3-030-79876-5_37 2021
[26]

Murphy, K

L. Murphy, K. Yang, J. Sun, Z. Li, A. Anandkumar, and X. Si. Autoformalizing Euclidean geometry. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[27]

offendo/blv

Nilay Patel. offendo/blv. https://github.com/offendo/blv, mar 11 2026. URL https://github. com/offendo/blv

work page 2026
[28]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, :, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V . Fomenko, T. Garipov, K. Georgiev, M. Glaese, T. Gogineni, A. Goucher, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Z. Peng, Y . Yao, K. Ma, S. Guo, Y . Li, Y . Zhang, C. Zhang, Y . Zhang, Z. Yu, L. Li, M. Liu, Y . Xia, J. Shen, Y . Wu, Y . Cao, Z. Zhang, W. Huang, J. Liu, and G. Zhang. Criticlean: Critic-guided reinforcement learning for mathematical formalization, 2025. URL https: //arxiv.org/abs/2507.06181

work page arXiv 2025
[30]

M. D. Santos, H. Wang, H. de Saxcé, R. Wang, M. Baksys, M. Unsal, J. Liu, Z. Liu, and J. Li. Kimina lean server: Technical report, 2025. URLhttps://arxiv.org/abs/2504.21230

work page arXiv 2025
[31]

Scholze and K

P. Scholze and K. Buzzard. Liquid tensor experiment, Dec. 2020. URL https:// xenaproject.wordpress.com/2020/12/05/liquid-tensor-experiment/

work page 2020
[32]

J. Shi, Z. Zhang, B. Ma, S. Zhao, Y . Yuan, and G. Wang. Loc-decomp: LLM autofor- malization via logical concept decomposition and iterative feedback correction. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=0KFQ4F9YEH

work page 2026
[33]

T. Tao. Formalizing the proof of pfr in lean4 using blueprint: a short tour, Nov. 2023. URL https://terrytao.wordpress.com/2023/11/18/ formalizing-the-proof-of-pfr-in-lean4-using-blueprint-a-short-tour

work page 2023
[34]

H. Wang, M. Unsal, X. Lin, M. Baksys, J. Liu, M. D. Santos, F. Sung, M. Vinyes, Z. Ying, Z. Zhu, J. Lu, H. de Saxcé, B. Bailey, C. Song, C. Xiao, D. Zhang, E. Zhang, F. Pu, H. Zhu, J. Liu, J. Bayer, J. Michel, L. Yu, L. Dreyfus-Schmidt, L. Tunstall, L. Pagani, M. Machado, P. Bourigault, R. Wang, S. Polu, T. Barroyer, W.-D. Li, Y . Niu, Y . Fleureau, Y . H...

work page arXiv 2025
[35]

H. Wang, R. Xie, Y . Wang, G. Gao, X. Yu, and B. Dong. Aria: An agent for retrieval and iterative auto-formalization via dependency graph, 2025. URL https://arxiv.org/abs/ 2510.04520

work page arXiv 2025
[36]

L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei. Improving text embeddings with large language models, 2024. URLhttps://arxiv.org/abs/2401.00368

work page arXiv 2024
[37]

Y . Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats, M. Jamnik, and C. Szegedy. Autoformalization with Large Language Models. URLhttp://arxiv.org/abs/2205.12615

work page arXiv
[38]

Y . Wu, D. Huang, R. Wan, Y . Peng, S. Shang, C. Cao, L. Qi, R. Zhang, Z. Du, J. Yan, and X. Hu. Stepfun-formalizer: Unlocking the autoformalization potential of llms through knowledge- reasoning fusion, 2025. URLhttps://arxiv.org/abs/2508.04440

work page arXiv 2025
[39]

H. Ying, Z. Wu, Y . Geng, J. Wang, D. Lin, and K. Chen. Lean workbook: A large-scale lean problem set formalized from natural language math problems.arXiv preprint arXiv: 2406.03847, 2024

work page arXiv 2024
[40]

Zhang, M

L. Zhang, M. Valentino, and A. Freitas. Autoformalization in the wild: Assessing LLMs on real- world mathematical definitions. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1720–1738, Suzhou, China, Nov. 2025. Association for Computatio...

work page doi:10.18653/v1/2025.emnlp-main.90 2025
[41]

Zhang, M

L. Zhang, M. Valentino, and A. Freitas. Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning, June 2025

work page 2025
[42]

name” and “reference

K. Zheng, J. M. Han, and S. Polu. MiniF2F: A cross-system benchmark for formal Olympiad- level mathematics. URLhttps://arxiv.org/abs/2109.00110v2. 13 Figure 5:MathAtlascreation pipeline. For visual convenience, we combine the “name” and “reference” extractors into one step called “identifier extractor.” Text: "M is compact iff G is abelian." Reasoning: - ...

work page arXiv
[43]

Scan the text s y s t e m a t i c a l l y from start to end

work page
[44]

Act iv el y search for : - E xpl ic it labels ( Definition , Theorem , Proof , etc .) - I mpl ic it d e f i n i t i o n s using d e f i n i t i o n a l l an gu ag e

work page
[45]

For each c a n d i d a t e block : - Decide whether it is genuine m a t h e m a t i c a l content - Confirm it matches one of the allowed entity types

work page
[46]

d e f i n i t i o n

Exclude all non - m a t h e m a t i c a l e x p o s i t i o n . Only after c o m p l e t i n g this i d e n t i f i c a t i o n and s e g m e n t a t i o n should you produce the output . ## Output c o n s t r a i n t s - Respond with a JSON list only - Do NOT wrap the JSON in code fences - Do NOT include explanations , commentary , or r e a s o n i n g i...

work page
[47]

A _union_ of two sets \( A \) and \( B \) is defined as \[ A \ cup B \ co lo ne qq \{{ x : x \ in A \ text {{ or }} x \ in B \}}.\]

work page
[48]

An _ i n t e r s e c t i o n _ of two sets \( A \) and \( B \) is defined as \[ A \ cap B \ co lo ne qq \{{ x : x \ in A \ text {{ and }} x \ in B \}}.\]

work page
[49]

A _ c o m p l e m e n t of \( B \) rel at iv e to \( A \) _ ( or _set - t h e o r e t i c d i f f e r e n c e _ of \( A \) and \( B \) ) is defined as \[ A \ s et mi nus B \ c ol one qq \{{ x : x \ in A \ text {{ and }} x \ notin B \}}.\]

work page
[50]

We say _ c o m p l e m e n t _ of \( B \) and write \( B ^{{ c }}\) instead of \( A \ se tm inu s B \) if the set \( A \) is either the entire un iv er se or if it is the obvious set c o n t a i n i n g \( B \) , and is u n d e r s t o o d from context

work page
[51]

union " ,

We say sets \( A \) and \( B \) are _ d i s j o i n t _ if \( A \ cap B =\ emp ty se t \) Output : [" union " , " i n t e r s e c t i o n " , " set - t h e o r e t i c d i f f e r e n c e " , " c o m p l e m e n t " , " d is joi nt "] Example 2: D e f i n i t i o n : ** D e f i n i t i o n 6.1** (\(\ mathsf {{ RCA }} _ {{0}}\) ) .: An _ i n t e g r a l do...

work page
[52]

o b j e c t _ r e f e r e n c e

" o b j e c t _ r e f e r e n c e " A standard , named m a t h e m a t i c a l object , structure , property , or concept whose d e f i n i t i o n exists i n d e p e n d e n t l y of the current text . E xam pl es : - group - t o p o l o g i c a l space - metric space - compact space - real number - real vector space - group h o m o m o r p h i s m - gen...

work page
[53]

e n t i t y _ r e f e r e n c e

" e n t i t y _ r e f e r e n c e " A r e f e r e n c e to a sp ec ifi c p r e v i o u s l y defined , labeled , or n um be re d m a t h e m a t i c a l item . E xam pl es : - D e f i n i t i o n 3.4 - Theorem 2.1 - Lemma 5 - P r o p o s i t i o n 4.2 - C o r o l l a r y 1.3 - p re vio us lemma - above theorem - f o l l o w i n g d e f i n i t i o n 17 Th...

work page
[54]

l o c a l _ v a r i a b l e _ r e f e r e n c e

" l o c a l _ v a r i a b l e _ r e f e r e n c e " A symbol that refers to a m a t h e m a t i c a l object but is NOT e x p l i c i t l y i n t r o d u c e d in the current text . A symbol is e x p l i c i t l y i n t r o d u c e d if the current text d ec la re s it using phrases such as : - let M be ... - suppose M is ... - fix M ... - given M ... - w...

work page