pith. machine review for the scientific record. sign in

arxiv: 2605.14061 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

MathAtlas: A Benchmark for Autoformalization in the Wild

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:05 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords autoformalizationbenchmarkgraduate mathematicstheorem statementsdependency graphformal methodsmachine learning for mathematics
0
0 comments X

The pith

MathAtlas extracts 52k graduate-level math statements from textbooks to benchmark autoformalization systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MathAtlas as the first large-scale benchmark drawn directly from 103 graduate mathematics textbooks. It supplies roughly 52k theorems, definitions, exercises, examples, and proofs together with a dependency graph of 178k prerequisite relations. Experiments establish that current strong models reach at most 9.8 percent correctness on theorem statements and 16.7 percent on definitions. Performance falls sharply on items with deeper prerequisite chains, reaching only 2.6 percent on the hardest subset of 700 entities.

Core claim

MathAtlas supplies the first sizable collection of graduate-level mathematical entities in the wild, paired with an explicit dependency graph, and demonstrates that existing autoformalization pipelines succeed on fewer than one in ten statements even when the underlying textbooks are standard references.

What carries the argument

MathAtlas dataset of textbook-extracted statements equipped with a mathematical dependency graph that records prerequisite relations among entities.

If this is right

  • Evaluation of autoformalization tools can move beyond olympiad problems to realistic graduate content.
  • Systems that exploit the supplied dependency graph can be tested for whether they improve over statement-by-statement baselines.
  • Performance measured on MA-Hard indicates that deeper prerequisite chains remain a major obstacle for current models.
  • The released dataset supports development of methods that scale formalization to entire graduate textbooks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on this benchmark would indicate models can handle the informal language and implicit context typical of research mathematics.
  • The dependency graph opens the possibility of training or evaluating models that reason over chains of definitions rather than isolated statements.
  • Widespread adoption could shift autoformalization research toward coverage of full textbooks instead of curated problem sets.

Load-bearing premise

The automatic extraction from textbooks yields accurate, correctly scoped graduate statements whose dependency relations match actual mathematical prerequisites.

What would settle it

Manual inspection of a random sample of 200 extracted statements and their listed dependencies revealing frequent scoping or prerequisite errors would show the benchmark does not faithfully represent graduate mathematics.

Figures

Figures reproduced from arXiv: 2605.14061 by Davit Babayan, Hafsah Mahmood, Jeffrey Flanigan, Laurel Willey, Liam McCarty, Nilay Patel, Noah Arias, Soli Munoz, Timothy Libman, Victoria Cochran.

Figure 1
Figure 1. Figure 1: An entity from MathAtlas together with a subgraph of its dependency tree. In yellow (bottom), we have a target theorem statement to be formalized (Proposition 15). Proposition 15 contains three “object references” to previously defined mathematical concepts: Dedeking ring, prime ideals, and principal ideal ring, which are dependencies needed to correctly formalize the theorem. These objects each have their… view at source ↗
Figure 2
Figure 2. Figure 2: Several entities from MathAtlas of various types: (1) a definition entity which defines two terms: “invariant” and “adjoint invariant”. (2) a theorem statement of a corollary from complex geometry. (3) an example from an ODE textbook. wide variety of mathematics at the graduate level. Our experiments show that prior baselines have significant room for improvement on MA-Align. We organize the rest of the pa… view at source ↗
Figure 4
Figure 4. Figure 4: Correctness of our strongest models compared with dependency depth on examples in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Correctness of formalized entities from our best models compared with pres￾ence in Mathlib. Entities which our retrieval system (LeanSearch [11]) found in Mathlib are significantly more likely to be correctly formalized than otherwise. Object Dependencies Next, we compare correct￾ness versus max dependency depth. In [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: MathAtlas creation pipeline. For visual convenience, we combine the “name” and “reference” extractors into one step called “identifier extractor.” Text: "M is compact iff G is abelian." Reasoning: - M not introduced → include local_variable_reference - G not introduced → include local_variable_reference - compact is a named property → include object_reference - abelian is a named property → include object_… view at source ↗
Figure 6
Figure 6. Figure 6: Example of reference extraction system input and output. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for entity extraction You are given a mathematical definition written in textbook style . Your task is to extract the names of the mathematical objects , concepts , or properties being defined . ### Rules : - Return only the names of the items being defined . - Do not include symbols , formulas , or explanations . - If multiple items are defined , return all of them . - If a definition gives multipl… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for name extraction You are an expert mathematician and a specialist in formal mathematics , especially Lean 4 and mathlib4 . Your task is to identify all mathematical references that appear in a given mathematical definition , theorem , lemma , or statement . A reference is any term , symbol , or phrase whose meaning depends on a previously defined mathematical object , entity , or external context… view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for reference extraction You are a meticulous expert in Lean 4 and ‘ mathlib4 ‘. Your task is to act as a " grounding " reasoner for a formalization agent . Your goal is to determine whether a given mathematical concept has a canonical formal definition in ‘ mathlib ‘ , based on a list of search candidates . -------------------------------------------------- Your Task ( Follow These Steps PRECISELY … view at source ↗
Figure 10
Figure 10. Figure 10: Prompt for grounding entities to mathlib. 10 candidate results are fed into this prompt, [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
read the original abstract

Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research-level mathematics remains underexplored. In this paper, we introduce MathAtlas, the first large-scale autoformalization benchmark of in the wild graduate-level mathematics, containing ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks. MathAtlas is enriched with a mathematical dependency graph containing ~178k relations, and is the first autoformalization benchmark to include such relations, facilitating evaluation and development of dependency-aware autoformalization systems. Our extensive experiments show that MathAtlas is high quality but extremely challenging: strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions. Furthermore, we find performance of state-of-the-art models degrades substantially with dependency depth: on MA-Hard, a subset of 700 entities with the deepest dependency trees, the best model achieves only 2.6% correctness for autoformalization on this challenging dataset. We release MathAtlas to the community as a benchmark set for large-scale autoformalization of graduate-level mathematics in the wild.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MathAtlas, a benchmark of ~52k graduate-level mathematical entities (theorems, definitions, exercises, examples, proofs) automatically extracted from 103 textbooks, augmented with a dependency graph of ~178k relations. It evaluates strong baselines on autoformalization tasks, reporting maximum correctness of 9.8% on theorem statements, 16.7% on definitions, and 2.6% on the MA-Hard subset of 700 entities with deepest dependency trees, positioning the dataset as high-quality yet extremely challenging for current models.

Significance. If the extraction pipeline is reliable, MathAtlas fills an important gap by providing the first large-scale, dependency-enriched benchmark for graduate and research-level mathematics, enabling development of dependency-aware autoformalization systems beyond olympiad/undergraduate focus. The empirical results on performance degradation with dependency depth offer concrete, falsifiable evidence of current limitations.

major comments (2)
  1. [§3] §3 (Dataset Construction and Extraction Pipeline): No human validation, error rates, or inter-annotator agreement metrics are reported for the ~52k extracted entities or ~178k relations. This is load-bearing for the central claim of 'high quality,' as unquantified parsing artifacts (mis-scoped statements, omitted context, or spurious edges) could directly inflate the reported difficulty metrics without reflecting true autoformalization limits.
  2. [§4.2–4.3] §4.2–4.3 (MA-Hard Subset and Dependency Depth): The construction of the MA-Hard subset (700 entities with deepest dependency trees) and the precise definition of dependency depth are not fully specified, including whether any post-extraction filtering for statement correctness was applied. This affects interpretability of the 2.6% correctness result as a measure of model limits versus data artifacts.
minor comments (2)
  1. [Abstract] Abstract: The phrasing 'the first large-scale autoformalization benchmark of in the wild graduate-level mathematics' contains a grammatical error and should be revised for clarity.
  2. [Results] Table 1 or equivalent results table: Ensure all baseline models and metrics are consistently labeled across text and tables to avoid ambiguity in the 9.8% and 16.7% figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide the requested clarifications and additional validation details.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction and Extraction Pipeline): No human validation, error rates, or inter-annotator agreement metrics are reported for the ~52k extracted entities or ~178k relations. This is load-bearing for the central claim of 'high quality,' as unquantified parsing artifacts (mis-scoped statements, omitted context, or spurious edges) could directly inflate the reported difficulty metrics without reflecting true autoformalization limits.

    Authors: We agree that the original manuscript lacked quantitative human validation metrics for the extracted entities and relations. In the revised version, we have added a dedicated subsection in §3 reporting a post-hoc validation study: two authors independently reviewed a stratified random sample of 500 entities and 200 relations, yielding an inter-annotator agreement of 91% (Cohen's κ = 0.84) and observed error rates of 4.8% for entities (mostly minor scoping or context omissions) and 2.9% for relations. These results are now presented with confidence intervals and support the high-quality characterization while transparently documenting residual pipeline limitations. We believe this directly addresses the concern. revision: yes

  2. Referee: [§4.2–4.3] §4.2–4.3 (MA-Hard Subset and Dependency Depth): The construction of the MA-Hard subset (700 entities with deepest dependency trees) and the precise definition of dependency depth are not fully specified, including whether any post-extraction filtering for statement correctness was applied. This affects interpretability of the 2.6% correctness result as a measure of model limits versus data artifacts.

    Authors: We have substantially expanded §4.2 with a precise, formal definition: dependency depth of an entity is the length of the longest directed path in the dependency graph from that entity to any source node (i.e., an entity with no incoming dependencies). The MA-Hard subset is formed by selecting the 700 entities whose depths are maximal (depth ≥ 12, the top 1.35% of the distribution). No post-extraction filtering on statement correctness was applied; selection is determined solely by depth to isolate the effect of dependency complexity. The revised manuscript now includes the exact selection algorithm (with pseudocode), a depth histogram, and an explicit statement that all 700 entities were retained as extracted. This clarifies that the 2.6% figure reflects model performance on deep-dependency items rather than data artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark release with independent measurements

full rationale

The paper constructs and releases a new dataset (MathAtlas) via automatic extraction from textbooks, then reports baseline model performance as empirical measurements. No equations, fitted parameters, or predictions are claimed; the central results (9.8% theorem correctness, 2.6% on MA-Hard) are direct evaluations on the released data and do not reduce to self-referential definitions or self-citation chains. The dependency graph and extraction process are presented as construction steps, not as derived quantities. This is a standard dataset paper whose claims are externally falsifiable by re-running models on the public release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central contribution is a data extraction and release process. No free parameters are fitted to produce the reported accuracy numbers; the benchmark itself is the result.

pith-pipeline@v0.9.0 · 5534 in / 1004 out tokens · 30414 ms · 2026-05-15T05:05:48.174131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

  1. [1]

    In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs

    Exploration of neural machine translation in autoformalization of mathematics in Mizar | Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. URLhttps://dl.acm.org/doi/10.1145/3372885.3373827

  2. [2]

    Agrawal, S

    A. Agrawal, S. Gadgil, N. Goyal, A. Narayanan, and A. Tadipatri. Towards a Mathematics Formalisation Assistant using Large Language Models. URL http://arxiv.org/abs/2211. 07524

  3. [3]

    Azerbayev, B

    Z. Azerbayev, B. Piotrowski, and J. Avigad. ProofNet: A Benchmark for Autoformalizing and Formally Proving Undergraduate-Level Mathematics Problems

  4. [4]

    Azerbayev, H

    Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An Open Language Model For Mathematics, Oct. 2023

  5. [5]

    Blecher, G

    L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents, 2023

  6. [6]

    Cabral, T

    R. Cabral, T. M. Do, X. Yu, W. M. Tai, Z. Feng, and X. Shen. ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization, Oct. 2025. URL http://arxiv.org/abs/ 2510.15981. arXiv:2510.15981 [cs]

  7. [7]

    G. Chen, J. Wu, X. Chen, W. X. Zhao, R. Song, C. Li, K. Fan, D. Liu, and M. Liao. Reform: Reflective autoformalization with prospective bounded sequence optimization, 2026. URL https://arxiv.org/abs/2510.24592

  8. [8]

    J. Chen, T. Li, J. Qin, P. Lu, L. Lin, C. Chen, and X. Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression, 2022. URL https://arxiv.org/abs/ 2212.02746

  9. [9]

    Escriva, H

    R. Escriva, H. Bashir, M. Isom, J. Huber, Macronova, S. Kedia, itaismith, L. Vander- Hart, J. Radhakrishnan, tanujnay112, T. Azarov, B. Eggers, K. Diaz, L. Pei, jasonvigil, D. Kim, A. Troynikov, P. I. Thomas, R. P, nicolasgere, G. Shahbazian, E. Culver, C. Gam- ble, D. Dash, T. Krusinski, W. Gu, swyx.io, M. Keller, R. Him, and naynaly10. chroma- core/chro...

  10. [10]

    Gadgil, A

    S. Gadgil, A. R. Tadipatri, A. Agrawal, A. Narayanan, and N. Goyal. Towards automating formalisation of theorem statements using large language models. 10

  11. [11]

    G. Gao, H. Ju, J. Jiang, Z. Qin, and B. Dong. A semantic search engine for mathlib4, 2025. URLhttps://arxiv.org/abs/2403.13310

  12. [12]

    G. Gao, Y . Wang, J. Jiang, Q. Gao, Z. Qin, T. Xu, and B. Dong. Herald: A natural language annotated lean 4 dataset, 2025. URLhttps://arxiv.org/abs/2410.10878

  13. [13]

    A. Q. Jiang, W. Li, and M. Jamnik. Multilingual mathematical autoformalization.arXiv preprint arXiv: 2311.03755, 2023

  14. [14]

    Jiang, W

    J. Jiang, W. He, Y . Wang, G. Gao, Y . Hu, J. Wang, N. Guan, P. Wu, C. Dai, L. Xiao, and B. Dong. Fate: A formal benchmark series for frontier algebra of multiple difficulty levels, 2025. URL https://arxiv.org/abs/2511.02872

  15. [15]

    Kontorovich and T

    A. Kontorovich and T. Tao. Prime Number Theorem and More, Jan. 2024. URL https: //github.com/AlexKontorovich/PrimeNumberTheoremAnd

  16. [16]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  17. [17]

    Y . Lin, S. Tang, B. Lyu, Z. Yang, J.-H. Chung, H. Zhao, L. Jiang, Y . Geng, J. Ge, J. Sun, J. Wu, J. Gesi, X. Lu, D. Acuna, K. Yang, H. Lin, Y . Choi, D. Chen, S. Arora, and C. Jin. Goedel- prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction,

  18. [18]

    URLhttps://arxiv.org/abs/2508.03613

  19. [19]

    Q. Liu, X. Zheng, X. Lu, Q. Cao, and J. Yan. Rethinking and improving autoformalization: towards a faithful metric and a dependency retrieval-based approach. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=hUb2At2DsQ

  20. [20]

    X. Liu, K. Bao, J. Zhang, Y . Liu, Y . Chen, Y . Liu, Y . Jiao, and T. Luo. Atlas: Autoformalizing theorems through lifting, augmentation, and synthesis of data, 2025. URL https://arxiv. org/abs/2502.05567

  21. [21]

    X. Liu, T. Zhu, Z. Dong, Y . Liu, Q. Guo, Z. Liu, Y . Chen, and T. Luo. Assess: A semantic and structural evaluation framework for statement similarity, 2026. URL https://arxiv.org/ abs/2509.22246

  22. [22]

    J. Lu, Y . Wan, Y . Huang, J. Xiong, Z. Liu, and Z. Guo. FormalAlign: Automated Alignment Evaluation for Autoformalization, Oct. 2024. URL http://arxiv.org/abs/2410.10135. arXiv:2410.10135 [cs]

  23. [23]

    In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs

    T. mathlib Community. The lean mathematical library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, POPL ’20. ACM, Jan. 2020. doi: 10.1145/3372885.3373824. URLhttp://dx.doi.org/10.1145/3372885.3373824

  24. [24]

    M. J. Min, Y . Gao, W. Sy, Z. Li, X. Si, and O. Bastani. Divide and abstract: Autoformalization via decomposition and abstraction learning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=NjgaeXNit3

  25. [25]

    L. d. Moura and S. Ullrich. The lean 4 theorem prover and programming language, 2021. URL https://doi.org/10.1007/978-3-030-79876-5_37

  26. [26]

    Murphy, K

    L. Murphy, K. Yang, J. Sun, Z. Li, A. Anandkumar, and X. Si. Autoformalizing Euclidean geometry. InInternational Conference on Machine Learning (ICML), 2024

  27. [27]

    offendo/blv

    Nilay Patel. offendo/blv. https://github.com/offendo/blv, mar 11 2026. URL https://github. com/offendo/blv

  28. [28]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, :, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V . Fomenko, T. Garipov, K. Georgiev, M. Glaese, T. Gogineni, A. Goucher, ...

  29. [29]

    Z. Peng, Y . Yao, K. Ma, S. Guo, Y . Li, Y . Zhang, C. Zhang, Y . Zhang, Z. Yu, L. Li, M. Liu, Y . Xia, J. Shen, Y . Wu, Y . Cao, Z. Zhang, W. Huang, J. Liu, and G. Zhang. Criticlean: Critic-guided reinforcement learning for mathematical formalization, 2025. URL https: //arxiv.org/abs/2507.06181

  30. [30]

    M. D. Santos, H. Wang, H. de Saxcé, R. Wang, M. Baksys, M. Unsal, J. Liu, Z. Liu, and J. Li. Kimina lean server: Technical report, 2025. URLhttps://arxiv.org/abs/2504.21230

  31. [31]

    Scholze and K

    P. Scholze and K. Buzzard. Liquid tensor experiment, Dec. 2020. URL https:// xenaproject.wordpress.com/2020/12/05/liquid-tensor-experiment/

  32. [32]

    J. Shi, Z. Zhang, B. Ma, S. Zhao, Y . Yuan, and G. Wang. Loc-decomp: LLM autofor- malization via logical concept decomposition and iterative feedback correction. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=0KFQ4F9YEH

  33. [33]

    T. Tao. Formalizing the proof of pfr in lean4 using blueprint: a short tour, Nov. 2023. URL https://terrytao.wordpress.com/2023/11/18/ formalizing-the-proof-of-pfr-in-lean4-using-blueprint-a-short-tour

  34. [34]

    H. Wang, M. Unsal, X. Lin, M. Baksys, J. Liu, M. D. Santos, F. Sung, M. Vinyes, Z. Ying, Z. Zhu, J. Lu, H. de Saxcé, B. Bailey, C. Song, C. Xiao, D. Zhang, E. Zhang, F. Pu, H. Zhu, J. Liu, J. Bayer, J. Michel, L. Yu, L. Dreyfus-Schmidt, L. Tunstall, L. Pagani, M. Machado, P. Bourigault, R. Wang, S. Polu, T. Barroyer, W.-D. Li, Y . Niu, Y . Fleureau, Y . H...

  35. [35]

    H. Wang, R. Xie, Y . Wang, G. Gao, X. Yu, and B. Dong. Aria: An agent for retrieval and iterative auto-formalization via dependency graph, 2025. URL https://arxiv.org/abs/ 2510.04520

  36. [36]

    L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei. Improving text embeddings with large language models, 2024. URLhttps://arxiv.org/abs/2401.00368

  37. [37]

    Y . Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats, M. Jamnik, and C. Szegedy. Autoformalization with Large Language Models. URLhttp://arxiv.org/abs/2205.12615

  38. [38]

    Y . Wu, D. Huang, R. Wan, Y . Peng, S. Shang, C. Cao, L. Qi, R. Zhang, Z. Du, J. Yan, and X. Hu. Stepfun-formalizer: Unlocking the autoformalization potential of llms through knowledge- reasoning fusion, 2025. URLhttps://arxiv.org/abs/2508.04440

  39. [39]

    H. Ying, Z. Wu, Y . Geng, J. Wang, D. Lin, and K. Chen. Lean workbook: A large-scale lean problem set formalized from natural language math problems.arXiv preprint arXiv: 2406.03847, 2024

  40. [40]

    Zhang, M

    L. Zhang, M. Valentino, and A. Freitas. Autoformalization in the wild: Assessing LLMs on real- world mathematical definitions. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1720–1738, Suzhou, China, Nov. 2025. Association for Computatio...

  41. [41]

    Zhang, M

    L. Zhang, M. Valentino, and A. Freitas. Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning, June 2025

  42. [42]

    name” and “reference

    K. Zheng, J. M. Han, and S. Polu. MiniF2F: A cross-system benchmark for formal Olympiad- level mathematics. URLhttps://arxiv.org/abs/2109.00110v2. 13 Figure 5:MathAtlascreation pipeline. For visual convenience, we combine the “name” and “reference” extractors into one step called “identifier extractor.” Text: "M is compact iff G is abelian." Reasoning: - ...

  43. [43]

    Scan the text s y s t e m a t i c a l l y from start to end

  44. [44]

    Act iv el y search for : - E xpl ic it labels ( Definition , Theorem , Proof , etc .) - I mpl ic it d e f i n i t i o n s using d e f i n i t i o n a l l an gu ag e

  45. [45]

    For each c a n d i d a t e block : - Decide whether it is genuine m a t h e m a t i c a l content - Confirm it matches one of the allowed entity types

  46. [46]

    d e f i n i t i o n

    Exclude all non - m a t h e m a t i c a l e x p o s i t i o n . Only after c o m p l e t i n g this i d e n t i f i c a t i o n and s e g m e n t a t i o n should you produce the output . ## Output c o n s t r a i n t s - Respond with a JSON list only - Do NOT wrap the JSON in code fences - Do NOT include explanations , commentary , or r e a s o n i n g i...

  47. [47]

    A _union_ of two sets \( A \) and \( B \) is defined as \[ A \ cup B \ co lo ne qq \{{ x : x \ in A \ text {{ or }} x \ in B \}}.\]

  48. [48]

    An _ i n t e r s e c t i o n _ of two sets \( A \) and \( B \) is defined as \[ A \ cap B \ co lo ne qq \{{ x : x \ in A \ text {{ and }} x \ in B \}}.\]

  49. [49]

    A _ c o m p l e m e n t of \( B \) rel at iv e to \( A \) _ ( or _set - t h e o r e t i c d i f f e r e n c e _ of \( A \) and \( B \) ) is defined as \[ A \ s et mi nus B \ c ol one qq \{{ x : x \ in A \ text {{ and }} x \ notin B \}}.\]

  50. [50]

    We say _ c o m p l e m e n t _ of \( B \) and write \( B ^{{ c }}\) instead of \( A \ se tm inu s B \) if the set \( A \) is either the entire un iv er se or if it is the obvious set c o n t a i n i n g \( B \) , and is u n d e r s t o o d from context

  51. [51]

    union " ,

    We say sets \( A \) and \( B \) are _ d i s j o i n t _ if \( A \ cap B =\ emp ty se t \) Output : [" union " , " i n t e r s e c t i o n " , " set - t h e o r e t i c d i f f e r e n c e " , " c o m p l e m e n t " , " d is joi nt "] Example 2: D e f i n i t i o n : ** D e f i n i t i o n 6.1** (\(\ mathsf {{ RCA }} _ {{0}}\) ) .: An _ i n t e g r a l do...

  52. [52]

    o b j e c t _ r e f e r e n c e

    " o b j e c t _ r e f e r e n c e " A standard , named m a t h e m a t i c a l object , structure , property , or concept whose d e f i n i t i o n exists i n d e p e n d e n t l y of the current text . E xam pl es : - group - t o p o l o g i c a l space - metric space - compact space - real number - real vector space - group h o m o m o r p h i s m - gen...

  53. [53]

    e n t i t y _ r e f e r e n c e

    " e n t i t y _ r e f e r e n c e " A r e f e r e n c e to a sp ec ifi c p r e v i o u s l y defined , labeled , or n um be re d m a t h e m a t i c a l item . E xam pl es : - D e f i n i t i o n 3.4 - Theorem 2.1 - Lemma 5 - P r o p o s i t i o n 4.2 - C o r o l l a r y 1.3 - p re vio us lemma - above theorem - f o l l o w i n g d e f i n i t i o n 17 Th...

  54. [54]

    l o c a l _ v a r i a b l e _ r e f e r e n c e

    " l o c a l _ v a r i a b l e _ r e f e r e n c e " A symbol that refers to a m a t h e m a t i c a l object but is NOT e x p l i c i t l y i n t r o d u c e d in the current text . A symbol is e x p l i c i t l y i n t r o d u c e d if the current text d ec la re s it using phrases such as : - let M be ... - suppose M is ... - fix M ... - given M ... - w...