pith. sign in

arxiv: 2606.09450 · v1 · pith:GCZX6QCAnew · submitted 2026-06-08 · 💻 cs.AI

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

Pith reviewed 2026-06-27 16:17 UTC · model grok-4.3

classification 💻 cs.AI
keywords theorem provingLean4LLM evaluationformal mathematicsbenchmarksubtheoremsproof efficiencyclassical theorems
0
0 comments X

The pith

Explicit premises raise success rates for Lean4 theorem provers on classical theorems

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds TheoremBench from nearly one hundred classical theorems to test LLMs on formal proving in Lean4 outside contest settings. Each theorem appears in two forms: a plain version with only the target statement, and a premised version that adds automatically extracted supporting subtheorems to expose internal proof dependencies. Experiments show that Lean4-capable models achieve substantially higher success when given these explicit premises. New metrics for coverage and token efficiency reveal that models still prefer easy subtheorems and generate long tactic traces instead of compact plans. The design therefore supplies a finer-grained picture of how models handle connected mathematical reasoning.

Core claim

TheoremBench supplies classical theorems in a plain version and a premised version built from automatically extracted supporting subtheorems. Lean4 prover models record markedly higher proof rates on the premised versions. The same models remain biased toward easy subtheorems and solve tasks via long, inefficient tactic traces rather than compact proof plans. Theorem-level coverage and token-efficiency metrics make these qualitative differences measurable.

What carries the argument

TheoremBench premised version, which expands each target theorem into a structured family of the main statement plus its automatically extracted supporting subtheorems

If this is right

  • Models record higher proof success when given explicit supporting subtheorems.
  • Provers exhibit clear bias toward easy subtheorems over harder ones.
  • Proof generation favors long tactic traces rather than compact plans.
  • Coverage and token-efficiency metrics distinguish proof behavior beyond binary success or failure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks for formal theorem proving should include dependency information as a standard feature to expose structural reasoning limits.
  • Refining the automatic extraction process could produce even larger gains if extraction accuracy improves.
  • The observed bias toward easy subtheorems may appear in other formal systems when similar structured benchmarks are created.

Load-bearing premise

The automatically extracted supporting subtheorems correctly capture each theorem's internal proof dependencies without errors or irrelevant additions.

What would settle it

Manual inspection of the extracted subtheorems showing frequent inaccuracies or irrelevancies that directly explain the reported performance gains would falsify the claim that premises aid reasoning rather than supply shortcuts.

Figures

Figures reproduced from arXiv: 2606.09450 by Andrey Galichin, Elvir Karimov, Ivan Oseledets, QuocViet Pham.

Figure 1
Figure 1. Figure 1: Performance comparison of Lean4-capable theorem provers in the plain [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of models on the premised benchmark. DeepSeek-Prover-V2- [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TheoremBench construction pipeline. Lean4 source files are parsed into [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of TheoremBench instances by mathematical field. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Theorem-level coverage survival curve on the premised benchmark. For [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overall token-efficiency distribution on solved premised instances. Values [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Coverage by mathematical field and complexity bucket on the premised [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Token-efficiency conditioned on theorem difficulty. The strongest solved [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ground-truth proof length versus average successful generated proof length. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of transformation from original to premised structure. The [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces TheoremBench, a Lean4 benchmark built from nearly 100 classical theorems, released in a plain main version and a premised version that augments each target with automatically extracted supporting subtheorems. Experiments claim that explicit premises substantially improve performance for Lean4-capable prover models; new theorem-level coverage and token-efficiency metrics reveal that current provers are biased toward easy subtheorems and produce long, inefficient tactic traces rather than compact proofs.

Significance. If the subtheorem extraction faithfully reflects internal proof dependencies, the benchmark supplies a structured, dependency-aware evaluation that goes beyond competition-style problems and exposes qualitative limitations in existing provers. The work is an empirical benchmark contribution with no machine-checked proofs or parameter-free derivations.

major comments (3)
  1. [Abstract / benchmark construction] Abstract and benchmark-construction description: the claim that 'explicit premises substantially improve performance' and that provers are 'strongly biased toward easy subtheorems' rests on the premised version faithfully decomposing each theorem into its actual proof dependencies. The text states only that subtheorems are 'automatically extracted' and provides no procedure, error analysis, comparison to human-curated Lean proofs, or validation that irrelevant lemmas or missing dependencies have not been introduced.
  2. [Experiments] Experiments and evaluation sections: no details are given on theorem selection criteria, the precise model versions or prover configurations tested, the number of runs, or statistical significance testing for the reported performance gains. These omissions leave the quantitative claims without verifiable support.
  3. [Metrics] Metrics definition: the new 'theorem-level coverage' and 'token-efficiency' metrics are introduced to expose qualitative differences, yet the paper does not demonstrate that these metrics are robust to variations in the (unvalidated) subtheorem extraction or that they correlate with human judgments of proof quality.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from an explicit statement of the total number of theorems and the distribution of their dependency depths.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on TheoremBench. The comments identify important gaps in documentation and validation that we will address in revision. Below we respond point-by-point to the three major comments.

read point-by-point responses
  1. Referee: [Abstract / benchmark construction] Abstract and benchmark-construction description: the claim that 'explicit premises substantially improve performance' and that provers are 'strongly biased toward easy subtheorems' rests on the premised version faithfully decomposing each theorem into its actual proof dependencies. The text states only that subtheorems are 'automatically extracted' and provides no procedure, error analysis, comparison to human-curated Lean proofs, or validation that irrelevant lemmas or missing dependencies have not been introduced.

    Authors: We agree that the manuscript currently provides insufficient detail on the automatic extraction process. The revised version will add a dedicated subsection describing the dependency-analysis algorithm used to extract supporting subtheorems, an error analysis performed on a random sample of theorems, and a direct comparison against manually inspected Lean proofs from standard mathematical libraries. These additions will substantiate that the extracted premises correspond to genuine proof dependencies. revision: yes

  2. Referee: [Experiments] Experiments and evaluation sections: no details are given on theorem selection criteria, the precise model versions or prover configurations tested, the number of runs, or statistical significance testing for the reported performance gains. These omissions leave the quantitative claims without verifiable support.

    Authors: We accept that the experimental protocol requires fuller specification. The revision will explicitly state the criteria used to select the classical theorems, list the exact model checkpoints and prover configurations, report the number of independent runs, and include statistical significance tests (e.g., bootstrap confidence intervals or paired tests) for all performance differences. revision: yes

  3. Referee: [Metrics] Metrics definition: the new 'theorem-level coverage' and 'token-efficiency' metrics are introduced to expose qualitative differences, yet the paper does not demonstrate that these metrics are robust to variations in the (unvalidated) subtheorem extraction or that they correlate with human judgments of proof quality.

    Authors: The metrics are intended as exploratory tools to surface qualitative behaviors not captured by binary success. We will add a short robustness analysis examining how coverage and efficiency scores change under controlled perturbations of the extracted premise sets, together with illustrative examples that align the metric values with human-readable notions of proof compactness. A comprehensive human correlation study lies beyond the scope of the current revision but can be noted as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with direct observations

full rationale

The paper introduces TheoremBench as a new Lean4 dataset from classical theorems and reports performance metrics from running existing prover models on its plain and premised versions. No equations, fitted parameters, or derived predictions appear. The central claims (premises improve performance; bias toward easy subtheorems) are presented as direct empirical outcomes from the benchmark runs, not as outputs of any internal derivation or self-referential construction. Automatic subtheorem extraction is described at a high level but is not used to derive the reported results by construction; any validation gap is a correctness issue, not circularity. No self-citations are load-bearing for the main claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The paper contributes a new evaluation framework rather than new mathematical derivations; it rests on standard assumptions about Lean4 and the value of classical theorems for testing.

axioms (1)
  • domain assumption Lean4 is a sound and usable formal system for expressing and checking mathematical theorems
    The benchmark is entirely built and evaluated inside Lean4.
invented entities (3)
  • TheoremBench (main and premised versions) no independent evidence
    purpose: To evaluate theorem provers on structured classical theorems with partial progress
    Newly created benchmark dataset and task families introduced by the paper.
  • theorem-level coverage metric no independent evidence
    purpose: To quantify partial progress through internal proof structure
    Newly proposed evaluation measure.
  • token-efficiency metric no independent evidence
    purpose: To measure qualitative differences in proof behavior
    Newly proposed evaluation measure.

pith-pipeline@v0.9.1-grok · 5761 in / 1427 out tokens · 31522 ms · 2026-06-27T16:17:31.787821+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 36 canonical work pages · 10 internal anchors

  1. [1]

    Avigad, J., De Moura, L., Kong, S.: Theorem proving in lean (2021)

  2. [2]

    Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E.W., Radev, D., Avigad, J.: Proofnet: Autoformalizing and formally proving undergraduate-level mathematics (2023),https://arxiv.org/abs/2302.12433 TheoremBench: LLMs in Formal Mathematics 13

  3. [3]

    Azerbayev, Z., Schoelkopf, H., Paster, K., Dos Santos, M., McAleer, S., Jiang, A.Q., Deng, J., Biderman, S., Welleck, S.: Llemma: An open language model for mathematics (2023),https://arxiv.org/abs/2310.10631

  4. [4]

    Bansal, K., Loos, S.M., Rabe, M.N., Szegedy, C., Wilcox, S.: Holist: An environment for machine learning of higher-order theorem proving (2019),https://arxiv.org/ abs/1904.03241

  5. [5]

    Research Re- port RT-0203, INRIA (May 1997),https://inria.hal.science/inria-00069968, projet COQ

    Barras, B., Boutin, S., Cornes, C., Courant, J., Filliâtre, J.C., Giménez, E., Herbelin, H., Huet, G., Muñoz, C., Murthy, C., Parent, C., Paulin-Mohring, C., Saïbi, A., Werner, B.: The Coq Proof Assistant Reference Manual : Version 6.1. Research Re- port RT-0203, INRIA (May 1997),https://inria.hal.science/inria-00069968, projet COQ

  6. [6]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H.P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F.P., Cummings, D., Plappert, M., Chan...

  7. [7]

    2020 , pages =

    mathlib Community, T.: The lean mathematical library. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. p. 367–381. CPP 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3372885.3373824, https://doi.org/10. 1145/3372885.3373824

  8. [8]

    First, E., Rabe, M.N., Ringer, T., Brun, Y.: Baldur: Whole-proof generation and repair with large language models (2023),https://arxiv.org/abs/2303.04910

  9. [9]

    Gauthier, T., Kaliszyk, C., Urban, J.: Learning to reason with hol4 tactics (2018), https://arxiv.org/abs/1804.00595

  10. [10]

    Gu, A., Piotrowski, B., Gloeckle, F., Yang, K., Markosyan, A.H.: Proofoptimizer: Training language models to simplify proofs without human demonstrations (2025), https://arxiv.org/abs/2510.15700

  11. [11]

    Huang, D., Dhariwal, P., Song, D., Sutskever, I.: Gamepad: A learning environment for theorem proving (2018),https://arxiv.org/abs/1806.00608

  12. [12]

    Jiang, A.Q., Li, W., Tworkowski, S., Czechowski, K., Odrzygozdz, T., Milos, P., Wu, Y., Jamnik, M.: Thor: Wielding hammers to integrate language models and automated theorem provers (2022),https://arxiv.org/abs/2205.10893

  13. [13]

    In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SMa9EAovKMC

    Jiang, A.Q., Welleck, S., Zhou, J.P., Lacroix, T., Liu, J., Li, W., Jamnik, M., Lample, G., Wu, Y.: Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SMa9EAovKMC

  14. [14]

    Lample, G., Lachaux, M.A., Lavril, T., Martinet, X., Hayat, A., Ebner, G., Ro- driguez, A., Lacroix, T.: Hypertree proof search for neural theorem proving (2022), https://arxiv.org/abs/2205.11491

  15. [15]

    Li, Y., Du, D., Song, L., Li, C., Wang, W., Yang, T., Mi, H.: Hunyuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving (2025),https://arxiv.org/abs/2412.20735

  16. [16]

    Pham et al

    Li, Z., Sun, J., Murphy, L., Su, Q., Li, Z., Zhang, X., Yang, K., Si, X.: A survey on deep learning for theorem proving (2024),https://arxiv.org/abs/2404.09939 14 Q. Pham et al

  17. [17]

    Lin, Y., Tang, S., Lyu, B., Wu, J., Lin, H., Yang, K., Li, J., Xia, M., Chen, D., Arora, S., Jin, C.: Goedel-prover: A frontier model for open-source automated theorem proving (2025),https://arxiv.org/abs/2502.07640

  18. [18]

    Lin, Y., Tang, S., Lyu, B., Yang, Z., Chung, J.H., Zhao, H., Jiang, L., Geng, Y., Ge, J., Sun, J., Wu, J., Gesi, J., Lu, X., Acuna, D., Yang, K., Lin, H., Choi, Y., Chen, D., Arora, S., Jin, C.: Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction (2025),https://arxiv.org/abs/2508.03613

  19. [19]

    Liu, C., Shen, J., Xin, H., Liu, Z., Yuan, Y., Wang, H., Ju, W., Zheng, C., Yin, Y., Li, L., Zhang, M., Liu, Q.: Fimo: A challenge formal dataset for automated theorem proving (2023)

  20. [20]

    org/abs/2505.03171

    Liu, J., Lin, X., Bayer, J., Dillies, Y., Jiang, W., Liang, X., Soletskyi, R., Wang, H., Xie, Y., Xiong, B., Yang, Z., Zhang, J., Zhi, L., Li, J., Liu, Z.: Combibench: Benchmarking llm capability for combinatorial mathematics (2025),https://arxiv. org/abs/2505.03171

  21. [21]

    Mikula, M., Tworkowski, S., Antoniak, S., Piotrowski, B., Jiang, A.Q., Zhou, J.P., Szegedy, C., Kucinski, L., Milos, P., Wu, Y.: Magnushammer: A transformer-based approach to premise selection (2023),https://arxiv.org/abs/2303.04488

  22. [22]

    de Moura, S

    Moura, L.d., Ullrich, S.: The lean 4 theorem prover and programming language. In: Automated Deduction – CADE 28: 28th International Conference on Automated De- duction, Virtual Event, July 12–15, 2021, Proceedings. p. 625–635. Springer-Verlag, Berlin, Heidelberg (2021). https://doi.org/10.1007/978-3-030-79876-5_37, https://doi.org/10.1007/978-3-030-79876-5_37

  23. [23]

    Ospanov, A., Yousefzadeh, R.: Apollo: Automated llm and lean collaboration for advanced formal reasoning (2025),https://arxiv.org/abs/2505.05758

  24. [24]

    Lecture Notes in Computer Science, Springer (1994),https://books.google.ru/books?id=RxlhqG0-cGwC

    Paulson, L.: Isabelle: A Generic Theorem Prover. Lecture Notes in Computer Science, Springer (1994),https://books.google.ru/books?id=RxlhqG0-cGwC

  25. [25]

    Polu, S., Sutskever, I.: Generative language modeling for automated theorem proving (2020),https://arxiv.org/abs/2009.03393

  26. [26]

    Ren, Z.Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z.F., Gou, Z., Ma, S., Tang, H., Liu, Y., Gao, W., Guo, D., Ruan, C.: Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition (2025),https://arxiv.org/abs/2504.21801

  27. [27]

    Shen, Z., Huang, N., Yang, F., Wang, Y., Gao, G., Xu, T., Jiang, J., He, W., Yang, P., Sun, M., Ju, H., Wu, P., Dai, B., Dong, B.: Real-prover: Retrieval augmented lean prover for mathematical reasoning (2025),https://arxiv.org/abs/2505.20613

  28. [28]

    Thakur, A., Tsoukalas, G., Wen, Y., Xin, J., Chaudhuri, S.: An in-context learning agent for formal theorem-proving (2023),https://arxiv.org/abs/2310.04353

  29. [29]

    Tsoukalas, G., Lee, J., Jennings, J., Xin, J., Ding, M., Jennings, M., Thakur, A., Chaudhuri, S.: Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition (2024),https://arxiv.org/abs/2407.11214

  30. [30]

    Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Santos, M.D., Sung, F., Vinyes, M., Ying, Z., Zhu, Z., Lu, J., Saxcé, H.d., Bailey, B., Song, C., Xiao, C., Zhang, D., Zhang, E., Pu, F., Zhu, H., Liu, J., Bayer, J., Michel, J., Yu, L., Dreyfus-Schmidt, L., Tunstall, L., Pagani, L., Machado, M., Bourigault, P., Wang, R., Polu, S., Barroyer, T., Li, W.D.,...

  31. [31]

    Wang, R., Zhang, J., Jia, Y., Pan, R., Diao, S., Pi, R., Zhang, T.: Theoremllama: Transforming general-purpose llms into lean4 experts (2024),https://arxiv.org/ abs/2407.03203 TheoremBench: LLMs in Formal Mathematics 15

  32. [32]

    Weng, K., Du, L., Li, S., Lu, W., Sun, H., Liu, H., Zhang, T.: Autoformalization in the era of large language models: A survey (2025),https://arxiv.org/abs/2505. 23486

  33. [33]

    Wiedijk, F.: 100 theorems in lean,https://leanprover-community.github.io/ 100.html

  34. [34]

    Wu, Y., Jiang, A.Q., Li, W., Rabe, M.N., Staats, C., Jamnik, M., Szegedy, C.: Autoformalization with large language models (2022),https://arxiv.org/abs/ 2205.12615

  35. [35]

    Wu, Z., Huang, S., Zhou, Z., Ying, H., Wang, J., Lin, D., Chen, K.: Internlm2.5- stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems (2024),https://arxiv.org/abs/2410.15700

  36. [36]

    Xin, H., Ren, Z.Z., Song, J., Shao, Z., Zhao, W., Wang, H., Liu, B., Zhang, L., Lu, X., Du, Q., Gao, W., Zhang, H., Zhu, Q., Yang, D., Wu, Z.F., Luo, F., Ruan, C.: Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search (2024),https://arxiv.org/abs/2408.08152

  37. [37]

    Xin, R., Xi, C., Yang, J., Chen, F., Wu, H., Xiao, X., Sun, Y., Zheng, S., Shen, K.: Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving (2025),https://arxiv.org/abs/2502.03438

  38. [38]

    Yang, K., Deng, J.: Learning to prove theorems via interacting with proof assistants (2019),https://arxiv.org/abs/1905.09381

  39. [39]

    Yang, K., Swope, A.M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., Anandkumar, A.: Leandojo: Theoremproving with retrieval-augmentedlanguage models (2023),https://arxiv.org/abs/2306.15626

  40. [40]

    Yin, D., Gao, J.: Generating millions of lean theorems with proofs by exploring state transition graphs (2025),https://arxiv.org/abs/2503.04772

  41. [41]

    Ying, H., Wu, Z., Geng, Y., Yuan, Z., Lin, D., Chen, K.: Lean workbook: A large- scale lean problem set formalized from natural language math problems (2025), https://arxiv.org/abs/2406.03847

  42. [42]

    Yu, Z., Peng, R., Ding, K., Li, Y., Peng, Z., Liu, M., Zhang, Y., Yuan, Z., Xin, H., Huang, W., Wen, Y., Zhang, G., Liu, W.: Formalmath: Benchmarking formal mathematical reasoning of large language models (2025),https://arxiv.org/abs/ 2505.02735

  43. [43]

    MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

    Zheng, K., Han, J.M., Polu, S.: Minif2f: a cross-system benchmark for formal olympiad-level mathematics (2022),https://arxiv.org/abs/2109.00110 16 Q. Pham et al. A Appendix A.1 Dataset Distribution In total, TheoremBench contains 1,142 Lean4 proving instances across the processed theorem groups. Table 2 summarizes the size of the extracted proof structure...