TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

Andrey Galichin; Elvir Karimov; Ivan Oseledets; QuocViet Pham

arxiv: 2606.09450 · v1 · pith:GCZX6QCAnew · submitted 2026-06-08 · 💻 cs.AI

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

QuocViet Pham , Elvir Karimov , Andrey Galichin , Ivan Oseledets This is my paper

Pith reviewed 2026-06-27 16:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords theorem provingLean4LLM evaluationformal mathematicsbenchmarksubtheoremsproof efficiencyclassical theorems

0 comments

The pith

Explicit premises raise success rates for Lean4 theorem provers on classical theorems

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds TheoremBench from nearly one hundred classical theorems to test LLMs on formal proving in Lean4 outside contest settings. Each theorem appears in two forms: a plain version with only the target statement, and a premised version that adds automatically extracted supporting subtheorems to expose internal proof dependencies. Experiments show that Lean4-capable models achieve substantially higher success when given these explicit premises. New metrics for coverage and token efficiency reveal that models still prefer easy subtheorems and generate long tactic traces instead of compact plans. The design therefore supplies a finer-grained picture of how models handle connected mathematical reasoning.

Core claim

TheoremBench supplies classical theorems in a plain version and a premised version built from automatically extracted supporting subtheorems. Lean4 prover models record markedly higher proof rates on the premised versions. The same models remain biased toward easy subtheorems and solve tasks via long, inefficient tactic traces rather than compact proof plans. Theorem-level coverage and token-efficiency metrics make these qualitative differences measurable.

What carries the argument

TheoremBench premised version, which expands each target theorem into a structured family of the main statement plus its automatically extracted supporting subtheorems

If this is right

Models record higher proof success when given explicit supporting subtheorems.
Provers exhibit clear bias toward easy subtheorems over harder ones.
Proof generation favors long tactic traces rather than compact plans.
Coverage and token-efficiency metrics distinguish proof behavior beyond binary success or failure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks for formal theorem proving should include dependency information as a standard feature to expose structural reasoning limits.
Refining the automatic extraction process could produce even larger gains if extraction accuracy improves.
The observed bias toward easy subtheorems may appear in other formal systems when similar structured benchmarks are created.

Load-bearing premise

The automatically extracted supporting subtheorems correctly capture each theorem's internal proof dependencies without errors or irrelevant additions.

What would settle it

Manual inspection of the extracted subtheorems showing frequent inaccuracies or irrelevancies that directly explain the reported performance gains would falsify the claim that premises aid reasoning rather than supply shortcuts.

Figures

Figures reproduced from arXiv: 2606.09450 by Andrey Galichin, Elvir Karimov, Ivan Oseledets, QuocViet Pham.

**Figure 2.** Figure 2: Performance of models on the premised benchmark. DeepSeek-Prover-V2- [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: TheoremBench construction pipeline. Lean4 source files are parsed into [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of TheoremBench instances by mathematical field. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Theorem-level coverage survival curve on the premised benchmark. For [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Overall token-efficiency distribution on solved premised instances. Values [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Coverage by mathematical field and complexity bucket on the premised [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Token-efficiency conditioned on theorem difficulty. The strongest solved [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Ground-truth proof length versus average successful generated proof length. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Example of transformation from original to premised structure. The [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TheoremBench's main/premised split and new coverage/token metrics are a reasonable step past contest-only benchmarks, but the automatic subtheorem extraction has no validation shown, which undercuts the performance claims.

read the letter

The paper's core move is to take nearly 100 classical theorems in Lean4 and release them in two forms: a plain main version and a premised version that adds automatically extracted supporting subtheorems. This turns each theorem into a family of tasks so you can score partial progress and internal structure instead of just final success or failure. They also add theorem-level coverage and token-efficiency metrics. Those choices are not in the competition benchmarks they reference, and they do surface the reported pattern that models favor easy subtheorems and produce long tactic traces.

The extraction step is the clear weak point. The abstract only says the subtheorems are "automatically extracted" with no description of the method, no error analysis, and no comparison to the original human proofs or human-curated dependencies. Without that, the claimed gains from explicit premises and the bias observations rest on an unverified assumption. The experiments also give no details on model versions, theorem selection criteria, or statistical tests, so the quantitative results stay hard to interpret.

This is aimed at groups working on Lean4 provers who want evaluation that tracks dependency structure rather than isolated contest problems. The construction idea is worth referee time even if the current write-up needs the extraction procedure and more experimental transparency before the numbers can be taken at face value. I would send it to review.

Referee Report

3 major / 1 minor

Summary. The paper introduces TheoremBench, a Lean4 benchmark built from nearly 100 classical theorems, released in a plain main version and a premised version that augments each target with automatically extracted supporting subtheorems. Experiments claim that explicit premises substantially improve performance for Lean4-capable prover models; new theorem-level coverage and token-efficiency metrics reveal that current provers are biased toward easy subtheorems and produce long, inefficient tactic traces rather than compact proofs.

Significance. If the subtheorem extraction faithfully reflects internal proof dependencies, the benchmark supplies a structured, dependency-aware evaluation that goes beyond competition-style problems and exposes qualitative limitations in existing provers. The work is an empirical benchmark contribution with no machine-checked proofs or parameter-free derivations.

major comments (3)

[Abstract / benchmark construction] Abstract and benchmark-construction description: the claim that 'explicit premises substantially improve performance' and that provers are 'strongly biased toward easy subtheorems' rests on the premised version faithfully decomposing each theorem into its actual proof dependencies. The text states only that subtheorems are 'automatically extracted' and provides no procedure, error analysis, comparison to human-curated Lean proofs, or validation that irrelevant lemmas or missing dependencies have not been introduced.
[Experiments] Experiments and evaluation sections: no details are given on theorem selection criteria, the precise model versions or prover configurations tested, the number of runs, or statistical significance testing for the reported performance gains. These omissions leave the quantitative claims without verifiable support.
[Metrics] Metrics definition: the new 'theorem-level coverage' and 'token-efficiency' metrics are introduced to expose qualitative differences, yet the paper does not demonstrate that these metrics are robust to variations in the (unvalidated) subtheorem extraction or that they correlate with human judgments of proof quality.

minor comments (1)

[Abstract] The abstract and introduction would benefit from an explicit statement of the total number of theorems and the distribution of their dependency depths.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on TheoremBench. The comments identify important gaps in documentation and validation that we will address in revision. Below we respond point-by-point to the three major comments.

read point-by-point responses

Referee: [Abstract / benchmark construction] Abstract and benchmark-construction description: the claim that 'explicit premises substantially improve performance' and that provers are 'strongly biased toward easy subtheorems' rests on the premised version faithfully decomposing each theorem into its actual proof dependencies. The text states only that subtheorems are 'automatically extracted' and provides no procedure, error analysis, comparison to human-curated Lean proofs, or validation that irrelevant lemmas or missing dependencies have not been introduced.

Authors: We agree that the manuscript currently provides insufficient detail on the automatic extraction process. The revised version will add a dedicated subsection describing the dependency-analysis algorithm used to extract supporting subtheorems, an error analysis performed on a random sample of theorems, and a direct comparison against manually inspected Lean proofs from standard mathematical libraries. These additions will substantiate that the extracted premises correspond to genuine proof dependencies. revision: yes
Referee: [Experiments] Experiments and evaluation sections: no details are given on theorem selection criteria, the precise model versions or prover configurations tested, the number of runs, or statistical significance testing for the reported performance gains. These omissions leave the quantitative claims without verifiable support.

Authors: We accept that the experimental protocol requires fuller specification. The revision will explicitly state the criteria used to select the classical theorems, list the exact model checkpoints and prover configurations, report the number of independent runs, and include statistical significance tests (e.g., bootstrap confidence intervals or paired tests) for all performance differences. revision: yes
Referee: [Metrics] Metrics definition: the new 'theorem-level coverage' and 'token-efficiency' metrics are introduced to expose qualitative differences, yet the paper does not demonstrate that these metrics are robust to variations in the (unvalidated) subtheorem extraction or that they correlate with human judgments of proof quality.

Authors: The metrics are intended as exploratory tools to surface qualitative behaviors not captured by binary success. We will add a short robustness analysis examining how coverage and efficiency scores change under controlled perturbations of the extracted premise sets, together with illustrative examples that align the metric values with human-readable notions of proof compactness. A comprehensive human correlation study lies beyond the scope of the current revision but can be noted as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with direct observations

full rationale

The paper introduces TheoremBench as a new Lean4 dataset from classical theorems and reports performance metrics from running existing prover models on its plain and premised versions. No equations, fitted parameters, or derived predictions appear. The central claims (premises improve performance; bias toward easy subtheorems) are presented as direct empirical outcomes from the benchmark runs, not as outputs of any internal derivation or self-referential construction. Automatic subtheorem extraction is described at a high level but is not used to derive the reported results by construction; any validation gap is a correctness issue, not circularity. No self-citations are load-bearing for the main claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The paper contributes a new evaluation framework rather than new mathematical derivations; it rests on standard assumptions about Lean4 and the value of classical theorems for testing.

axioms (1)

domain assumption Lean4 is a sound and usable formal system for expressing and checking mathematical theorems
The benchmark is entirely built and evaluated inside Lean4.

invented entities (3)

TheoremBench (main and premised versions) no independent evidence
purpose: To evaluate theorem provers on structured classical theorems with partial progress
Newly created benchmark dataset and task families introduced by the paper.
theorem-level coverage metric no independent evidence
purpose: To quantify partial progress through internal proof structure
Newly proposed evaluation measure.
token-efficiency metric no independent evidence
purpose: To measure qualitative differences in proof behavior
Newly proposed evaluation measure.

pith-pipeline@v0.9.1-grok · 5761 in / 1427 out tokens · 31522 ms · 2026-06-27T16:17:31.787821+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 36 canonical work pages · 10 internal anchors

[1]

Avigad, J., De Moura, L., Kong, S.: Theorem proving in lean (2021)

2021
[2]

Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E.W., Radev, D., Avigad, J.: Proofnet: Autoformalizing and formally proving undergraduate-level mathematics (2023),https://arxiv.org/abs/2302.12433 TheoremBench: LLMs in Formal Mathematics 13

work page arXiv 2023
[3]

Azerbayev, Z., Schoelkopf, H., Paster, K., Dos Santos, M., McAleer, S., Jiang, A.Q., Deng, J., Biderman, S., Welleck, S.: Llemma: An open language model for mathematics (2023),https://arxiv.org/abs/2310.10631

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Bansal, K., Loos, S.M., Rabe, M.N., Szegedy, C., Wilcox, S.: Holist: An environment for machine learning of higher-order theorem proving (2019),https://arxiv.org/ abs/1904.03241

work page arXiv 2019
[5]

Research Re- port RT-0203, INRIA (May 1997),https://inria.hal.science/inria-00069968, projet COQ

Barras, B., Boutin, S., Cornes, C., Courant, J., Filliâtre, J.C., Giménez, E., Herbelin, H., Huet, G., Muñoz, C., Murthy, C., Parent, C., Paulin-Mohring, C., Saïbi, A., Werner, B.: The Coq Proof Assistant Reference Manual : Version 6.1. Research Re- port RT-0203, INRIA (May 1997),https://inria.hal.science/inria-00069968, projet COQ

1997
[6]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H.P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F.P., Cummings, D., Plappert, M., Chan...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

2020 , pages =

mathlib Community, T.: The lean mathematical library. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. p. 367–381. CPP 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3372885.3373824, https://doi.org/10. 1145/3372885.3373824

work page doi:10.1145/3372885.3373824 2020
[8]

First, E., Rabe, M.N., Ringer, T., Brun, Y.: Baldur: Whole-proof generation and repair with large language models (2023),https://arxiv.org/abs/2303.04910

work page arXiv 2023
[9]

Gauthier, T., Kaliszyk, C., Urban, J.: Learning to reason with hol4 tactics (2018), https://arxiv.org/abs/1804.00595

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Gu, A., Piotrowski, B., Gloeckle, F., Yang, K., Markosyan, A.H.: Proofoptimizer: Training language models to simplify proofs without human demonstrations (2025), https://arxiv.org/abs/2510.15700

work page arXiv 2025
[11]

Huang, D., Dhariwal, P., Song, D., Sutskever, I.: Gamepad: A learning environment for theorem proving (2018),https://arxiv.org/abs/1806.00608

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Jiang, A.Q., Li, W., Tworkowski, S., Czechowski, K., Odrzygozdz, T., Milos, P., Wu, Y., Jamnik, M.: Thor: Wielding hammers to integrate language models and automated theorem provers (2022),https://arxiv.org/abs/2205.10893

work page arXiv 2022
[13]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SMa9EAovKMC

Jiang, A.Q., Welleck, S., Zhou, J.P., Lacroix, T., Liu, J., Li, W., Jamnik, M., Lample, G., Wu, Y.: Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SMa9EAovKMC

2023
[14]

Lample, G., Lachaux, M.A., Lavril, T., Martinet, X., Hayat, A., Ebner, G., Ro- driguez, A., Lacroix, T.: Hypertree proof search for neural theorem proving (2022), https://arxiv.org/abs/2205.11491

work page arXiv 2022
[15]

Li, Y., Du, D., Song, L., Li, C., Wang, W., Yang, T., Mi, H.: Hunyuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving (2025),https://arxiv.org/abs/2412.20735

work page arXiv 2025
[16]

Pham et al

Li, Z., Sun, J., Murphy, L., Su, Q., Li, Z., Zhang, X., Yang, K., Si, X.: A survey on deep learning for theorem proving (2024),https://arxiv.org/abs/2404.09939 14 Q. Pham et al

work page arXiv 2024
[17]

Lin, Y., Tang, S., Lyu, B., Wu, J., Lin, H., Yang, K., Li, J., Xia, M., Chen, D., Arora, S., Jin, C.: Goedel-prover: A frontier model for open-source automated theorem proving (2025),https://arxiv.org/abs/2502.07640

work page arXiv 2025
[18]

Lin, Y., Tang, S., Lyu, B., Yang, Z., Chung, J.H., Zhao, H., Jiang, L., Geng, Y., Ge, J., Sun, J., Wu, J., Gesi, J., Lu, X., Acuna, D., Yang, K., Lin, H., Choi, Y., Chen, D., Arora, S., Jin, C.: Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction (2025),https://arxiv.org/abs/2508.03613

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Liu, C., Shen, J., Xin, H., Liu, Z., Yuan, Y., Wang, H., Ju, W., Zheng, C., Yin, Y., Li, L., Zhang, M., Liu, Q.: Fimo: A challenge formal dataset for automated theorem proving (2023)

2023
[20]

org/abs/2505.03171

Liu, J., Lin, X., Bayer, J., Dillies, Y., Jiang, W., Liang, X., Soletskyi, R., Wang, H., Xie, Y., Xiong, B., Yang, Z., Zhang, J., Zhi, L., Li, J., Liu, Z.: Combibench: Benchmarking llm capability for combinatorial mathematics (2025),https://arxiv. org/abs/2505.03171

work page arXiv 2025
[21]

Mikula, M., Tworkowski, S., Antoniak, S., Piotrowski, B., Jiang, A.Q., Zhou, J.P., Szegedy, C., Kucinski, L., Milos, P., Wu, Y.: Magnushammer: A transformer-based approach to premise selection (2023),https://arxiv.org/abs/2303.04488

work page arXiv 2023
[22]

de Moura, S

Moura, L.d., Ullrich, S.: The lean 4 theorem prover and programming language. In: Automated Deduction – CADE 28: 28th International Conference on Automated De- duction, Virtual Event, July 12–15, 2021, Proceedings. p. 625–635. Springer-Verlag, Berlin, Heidelberg (2021). https://doi.org/10.1007/978-3-030-79876-5_37, https://doi.org/10.1007/978-3-030-79876-5_37

work page doi:10.1007/978-3-030-79876-5_37 2021
[23]

Ospanov, A., Yousefzadeh, R.: Apollo: Automated llm and lean collaboration for advanced formal reasoning (2025),https://arxiv.org/abs/2505.05758

work page arXiv 2025
[24]

Lecture Notes in Computer Science, Springer (1994),https://books.google.ru/books?id=RxlhqG0-cGwC

Paulson, L.: Isabelle: A Generic Theorem Prover. Lecture Notes in Computer Science, Springer (1994),https://books.google.ru/books?id=RxlhqG0-cGwC

1994
[25]

Polu, S., Sutskever, I.: Generative language modeling for automated theorem proving (2020),https://arxiv.org/abs/2009.03393

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

Ren, Z.Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z.F., Gou, Z., Ma, S., Tang, H., Liu, Y., Gao, W., Guo, D., Ruan, C.: Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition (2025),https://arxiv.org/abs/2504.21801

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Shen, Z., Huang, N., Yang, F., Wang, Y., Gao, G., Xu, T., Jiang, J., He, W., Yang, P., Sun, M., Ju, H., Wu, P., Dai, B., Dong, B.: Real-prover: Retrieval augmented lean prover for mathematical reasoning (2025),https://arxiv.org/abs/2505.20613

work page arXiv 2025
[28]

Thakur, A., Tsoukalas, G., Wen, Y., Xin, J., Chaudhuri, S.: An in-context learning agent for formal theorem-proving (2023),https://arxiv.org/abs/2310.04353

work page arXiv 2023
[29]

Tsoukalas, G., Lee, J., Jennings, J., Xin, J., Ding, M., Jennings, M., Thakur, A., Chaudhuri, S.: Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition (2024),https://arxiv.org/abs/2407.11214

work page arXiv 2024
[30]

Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Santos, M.D., Sung, F., Vinyes, M., Ying, Z., Zhu, Z., Lu, J., Saxcé, H.d., Bailey, B., Song, C., Xiao, C., Zhang, D., Zhang, E., Pu, F., Zhu, H., Liu, J., Bayer, J., Michel, J., Yu, L., Dreyfus-Schmidt, L., Tunstall, L., Pagani, L., Machado, M., Bourigault, P., Wang, R., Polu, S., Barroyer, T., Li, W.D.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Wang, R., Zhang, J., Jia, Y., Pan, R., Diao, S., Pi, R., Zhang, T.: Theoremllama: Transforming general-purpose llms into lean4 experts (2024),https://arxiv.org/ abs/2407.03203 TheoremBench: LLMs in Formal Mathematics 15

work page arXiv 2024
[32]

Weng, K., Du, L., Li, S., Lu, W., Sun, H., Liu, H., Zhang, T.: Autoformalization in the era of large language models: A survey (2025),https://arxiv.org/abs/2505. 23486

2025
[33]

Wiedijk, F.: 100 theorems in lean,https://leanprover-community.github.io/ 100.html
[34]

Wu, Y., Jiang, A.Q., Li, W., Rabe, M.N., Staats, C., Jamnik, M., Szegedy, C.: Autoformalization with large language models (2022),https://arxiv.org/abs/ 2205.12615

work page arXiv 2022
[35]

Wu, Z., Huang, S., Zhou, Z., Ying, H., Wang, J., Lin, D., Chen, K.: Internlm2.5- stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems (2024),https://arxiv.org/abs/2410.15700

work page arXiv 2024
[36]

Xin, H., Ren, Z.Z., Song, J., Shao, Z., Zhao, W., Wang, H., Liu, B., Zhang, L., Lu, X., Du, Q., Gao, W., Zhang, H., Zhu, Q., Yang, D., Wu, Z.F., Luo, F., Ruan, C.: Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search (2024),https://arxiv.org/abs/2408.08152

work page arXiv 2024
[37]

Xin, R., Xi, C., Yang, J., Chen, F., Wu, H., Xiao, X., Sun, Y., Zheng, S., Shen, K.: Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving (2025),https://arxiv.org/abs/2502.03438

work page arXiv 2025
[38]

Yang, K., Deng, J.: Learning to prove theorems via interacting with proof assistants (2019),https://arxiv.org/abs/1905.09381

work page internal anchor Pith review Pith/arXiv arXiv 2019
[39]

Yang, K., Swope, A.M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., Anandkumar, A.: Leandojo: Theoremproving with retrieval-augmentedlanguage models (2023),https://arxiv.org/abs/2306.15626

work page arXiv 2023
[40]

Yin, D., Gao, J.: Generating millions of lean theorems with proofs by exploring state transition graphs (2025),https://arxiv.org/abs/2503.04772

work page arXiv 2025
[41]

Ying, H., Wu, Z., Geng, Y., Yuan, Z., Lin, D., Chen, K.: Lean workbook: A large- scale lean problem set formalized from natural language math problems (2025), https://arxiv.org/abs/2406.03847

work page arXiv 2025
[42]

Yu, Z., Peng, R., Ding, K., Li, Y., Peng, Z., Liu, M., Zhang, Y., Yuan, Z., Xin, H., Huang, W., Wen, Y., Zhang, G., Liu, W.: Formalmath: Benchmarking formal mathematical reasoning of large language models (2025),https://arxiv.org/abs/ 2505.02735

work page arXiv 2025
[43]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Zheng, K., Han, J.M., Polu, S.: Minif2f: a cross-system benchmark for formal olympiad-level mathematics (2022),https://arxiv.org/abs/2109.00110 16 Q. Pham et al. A Appendix A.1 Dataset Distribution In total, TheoremBench contains 1,142 Lean4 proving instances across the processed theorem groups. Table 2 summarizes the size of the extracted proof structure...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

Avigad, J., De Moura, L., Kong, S.: Theorem proving in lean (2021)

2021

[2] [2]

Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E.W., Radev, D., Avigad, J.: Proofnet: Autoformalizing and formally proving undergraduate-level mathematics (2023),https://arxiv.org/abs/2302.12433 TheoremBench: LLMs in Formal Mathematics 13

work page arXiv 2023

[3] [3]

Azerbayev, Z., Schoelkopf, H., Paster, K., Dos Santos, M., McAleer, S., Jiang, A.Q., Deng, J., Biderman, S., Welleck, S.: Llemma: An open language model for mathematics (2023),https://arxiv.org/abs/2310.10631

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Bansal, K., Loos, S.M., Rabe, M.N., Szegedy, C., Wilcox, S.: Holist: An environment for machine learning of higher-order theorem proving (2019),https://arxiv.org/ abs/1904.03241

work page arXiv 2019

[5] [5]

Research Re- port RT-0203, INRIA (May 1997),https://inria.hal.science/inria-00069968, projet COQ

Barras, B., Boutin, S., Cornes, C., Courant, J., Filliâtre, J.C., Giménez, E., Herbelin, H., Huet, G., Muñoz, C., Murthy, C., Parent, C., Paulin-Mohring, C., Saïbi, A., Werner, B.: The Coq Proof Assistant Reference Manual : Version 6.1. Research Re- port RT-0203, INRIA (May 1997),https://inria.hal.science/inria-00069968, projet COQ

1997

[6] [6]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H.P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F.P., Cummings, D., Plappert, M., Chan...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

2020 , pages =

mathlib Community, T.: The lean mathematical library. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. p. 367–381. CPP 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3372885.3373824, https://doi.org/10. 1145/3372885.3373824

work page doi:10.1145/3372885.3373824 2020

[8] [8]

First, E., Rabe, M.N., Ringer, T., Brun, Y.: Baldur: Whole-proof generation and repair with large language models (2023),https://arxiv.org/abs/2303.04910

work page arXiv 2023

[9] [9]

Gauthier, T., Kaliszyk, C., Urban, J.: Learning to reason with hol4 tactics (2018), https://arxiv.org/abs/1804.00595

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Gu, A., Piotrowski, B., Gloeckle, F., Yang, K., Markosyan, A.H.: Proofoptimizer: Training language models to simplify proofs without human demonstrations (2025), https://arxiv.org/abs/2510.15700

work page arXiv 2025

[11] [11]

Huang, D., Dhariwal, P., Song, D., Sutskever, I.: Gamepad: A learning environment for theorem proving (2018),https://arxiv.org/abs/1806.00608

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Jiang, A.Q., Li, W., Tworkowski, S., Czechowski, K., Odrzygozdz, T., Milos, P., Wu, Y., Jamnik, M.: Thor: Wielding hammers to integrate language models and automated theorem provers (2022),https://arxiv.org/abs/2205.10893

work page arXiv 2022

[13] [13]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SMa9EAovKMC

Jiang, A.Q., Welleck, S., Zhou, J.P., Lacroix, T., Liu, J., Li, W., Jamnik, M., Lample, G., Wu, Y.: Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SMa9EAovKMC

2023

[14] [14]

Lample, G., Lachaux, M.A., Lavril, T., Martinet, X., Hayat, A., Ebner, G., Ro- driguez, A., Lacroix, T.: Hypertree proof search for neural theorem proving (2022), https://arxiv.org/abs/2205.11491

work page arXiv 2022

[15] [15]

Li, Y., Du, D., Song, L., Li, C., Wang, W., Yang, T., Mi, H.: Hunyuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving (2025),https://arxiv.org/abs/2412.20735

work page arXiv 2025

[16] [16]

Pham et al

Li, Z., Sun, J., Murphy, L., Su, Q., Li, Z., Zhang, X., Yang, K., Si, X.: A survey on deep learning for theorem proving (2024),https://arxiv.org/abs/2404.09939 14 Q. Pham et al

work page arXiv 2024

[17] [17]

Lin, Y., Tang, S., Lyu, B., Wu, J., Lin, H., Yang, K., Li, J., Xia, M., Chen, D., Arora, S., Jin, C.: Goedel-prover: A frontier model for open-source automated theorem proving (2025),https://arxiv.org/abs/2502.07640

work page arXiv 2025

[18] [18]

Lin, Y., Tang, S., Lyu, B., Yang, Z., Chung, J.H., Zhao, H., Jiang, L., Geng, Y., Ge, J., Sun, J., Wu, J., Gesi, J., Lu, X., Acuna, D., Yang, K., Lin, H., Choi, Y., Chen, D., Arora, S., Jin, C.: Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction (2025),https://arxiv.org/abs/2508.03613

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Liu, C., Shen, J., Xin, H., Liu, Z., Yuan, Y., Wang, H., Ju, W., Zheng, C., Yin, Y., Li, L., Zhang, M., Liu, Q.: Fimo: A challenge formal dataset for automated theorem proving (2023)

2023

[20] [20]

org/abs/2505.03171

Liu, J., Lin, X., Bayer, J., Dillies, Y., Jiang, W., Liang, X., Soletskyi, R., Wang, H., Xie, Y., Xiong, B., Yang, Z., Zhang, J., Zhi, L., Li, J., Liu, Z.: Combibench: Benchmarking llm capability for combinatorial mathematics (2025),https://arxiv. org/abs/2505.03171

work page arXiv 2025

[21] [21]

Mikula, M., Tworkowski, S., Antoniak, S., Piotrowski, B., Jiang, A.Q., Zhou, J.P., Szegedy, C., Kucinski, L., Milos, P., Wu, Y.: Magnushammer: A transformer-based approach to premise selection (2023),https://arxiv.org/abs/2303.04488

work page arXiv 2023

[22] [22]

de Moura, S

Moura, L.d., Ullrich, S.: The lean 4 theorem prover and programming language. In: Automated Deduction – CADE 28: 28th International Conference on Automated De- duction, Virtual Event, July 12–15, 2021, Proceedings. p. 625–635. Springer-Verlag, Berlin, Heidelberg (2021). https://doi.org/10.1007/978-3-030-79876-5_37, https://doi.org/10.1007/978-3-030-79876-5_37

work page doi:10.1007/978-3-030-79876-5_37 2021

[23] [23]

Ospanov, A., Yousefzadeh, R.: Apollo: Automated llm and lean collaboration for advanced formal reasoning (2025),https://arxiv.org/abs/2505.05758

work page arXiv 2025

[24] [24]

Lecture Notes in Computer Science, Springer (1994),https://books.google.ru/books?id=RxlhqG0-cGwC

Paulson, L.: Isabelle: A Generic Theorem Prover. Lecture Notes in Computer Science, Springer (1994),https://books.google.ru/books?id=RxlhqG0-cGwC

1994

[25] [25]

Polu, S., Sutskever, I.: Generative language modeling for automated theorem proving (2020),https://arxiv.org/abs/2009.03393

work page internal anchor Pith review Pith/arXiv arXiv 2020

[26] [26]

Ren, Z.Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z.F., Gou, Z., Ma, S., Tang, H., Liu, Y., Gao, W., Guo, D., Ruan, C.: Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition (2025),https://arxiv.org/abs/2504.21801

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Shen, Z., Huang, N., Yang, F., Wang, Y., Gao, G., Xu, T., Jiang, J., He, W., Yang, P., Sun, M., Ju, H., Wu, P., Dai, B., Dong, B.: Real-prover: Retrieval augmented lean prover for mathematical reasoning (2025),https://arxiv.org/abs/2505.20613

work page arXiv 2025

[28] [28]

Thakur, A., Tsoukalas, G., Wen, Y., Xin, J., Chaudhuri, S.: An in-context learning agent for formal theorem-proving (2023),https://arxiv.org/abs/2310.04353

work page arXiv 2023

[29] [29]

Tsoukalas, G., Lee, J., Jennings, J., Xin, J., Ding, M., Jennings, M., Thakur, A., Chaudhuri, S.: Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition (2024),https://arxiv.org/abs/2407.11214

work page arXiv 2024

[30] [30]

Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Santos, M.D., Sung, F., Vinyes, M., Ying, Z., Zhu, Z., Lu, J., Saxcé, H.d., Bailey, B., Song, C., Xiao, C., Zhang, D., Zhang, E., Pu, F., Zhu, H., Liu, J., Bayer, J., Michel, J., Yu, L., Dreyfus-Schmidt, L., Tunstall, L., Pagani, L., Machado, M., Bourigault, P., Wang, R., Polu, S., Barroyer, T., Li, W.D.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Wang, R., Zhang, J., Jia, Y., Pan, R., Diao, S., Pi, R., Zhang, T.: Theoremllama: Transforming general-purpose llms into lean4 experts (2024),https://arxiv.org/ abs/2407.03203 TheoremBench: LLMs in Formal Mathematics 15

work page arXiv 2024

[32] [32]

Weng, K., Du, L., Li, S., Lu, W., Sun, H., Liu, H., Zhang, T.: Autoformalization in the era of large language models: A survey (2025),https://arxiv.org/abs/2505. 23486

2025

[33] [33]

Wiedijk, F.: 100 theorems in lean,https://leanprover-community.github.io/ 100.html

[34] [34]

Wu, Y., Jiang, A.Q., Li, W., Rabe, M.N., Staats, C., Jamnik, M., Szegedy, C.: Autoformalization with large language models (2022),https://arxiv.org/abs/ 2205.12615

work page arXiv 2022

[35] [35]

Wu, Z., Huang, S., Zhou, Z., Ying, H., Wang, J., Lin, D., Chen, K.: Internlm2.5- stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems (2024),https://arxiv.org/abs/2410.15700

work page arXiv 2024

[36] [36]

Xin, H., Ren, Z.Z., Song, J., Shao, Z., Zhao, W., Wang, H., Liu, B., Zhang, L., Lu, X., Du, Q., Gao, W., Zhang, H., Zhu, Q., Yang, D., Wu, Z.F., Luo, F., Ruan, C.: Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search (2024),https://arxiv.org/abs/2408.08152

work page arXiv 2024

[37] [37]

Xin, R., Xi, C., Yang, J., Chen, F., Wu, H., Xiao, X., Sun, Y., Zheng, S., Shen, K.: Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving (2025),https://arxiv.org/abs/2502.03438

work page arXiv 2025

[38] [38]

Yang, K., Deng, J.: Learning to prove theorems via interacting with proof assistants (2019),https://arxiv.org/abs/1905.09381

work page internal anchor Pith review Pith/arXiv arXiv 2019

[39] [39]

Yang, K., Swope, A.M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., Anandkumar, A.: Leandojo: Theoremproving with retrieval-augmentedlanguage models (2023),https://arxiv.org/abs/2306.15626

work page arXiv 2023

[40] [40]

Yin, D., Gao, J.: Generating millions of lean theorems with proofs by exploring state transition graphs (2025),https://arxiv.org/abs/2503.04772

work page arXiv 2025

[41] [41]

Ying, H., Wu, Z., Geng, Y., Yuan, Z., Lin, D., Chen, K.: Lean workbook: A large- scale lean problem set formalized from natural language math problems (2025), https://arxiv.org/abs/2406.03847

work page arXiv 2025

[42] [42]

Yu, Z., Peng, R., Ding, K., Li, Y., Peng, Z., Liu, M., Zhang, Y., Yuan, Z., Xin, H., Huang, W., Wen, Y., Zhang, G., Liu, W.: Formalmath: Benchmarking formal mathematical reasoning of large language models (2025),https://arxiv.org/abs/ 2505.02735

work page arXiv 2025

[43] [43]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Zheng, K., Han, J.M., Polu, S.: Minif2f: a cross-system benchmark for formal olympiad-level mathematics (2022),https://arxiv.org/abs/2109.00110 16 Q. Pham et al. A Appendix A.1 Dataset Distribution In total, TheoremBench contains 1,142 Lean4 proving instances across the processed theorem groups. Table 2 summarizes the size of the extracted proof structure...

work page internal anchor Pith review Pith/arXiv arXiv 2022