pith. sign in

arxiv: 2606.28841 · v1 · pith:Q63KI7K7new · submitted 2026-06-27 · 💻 cs.LO · cs.AI· cs.CL

LAMP: Lean-based Agentic framework with MCP and Proof Repair

Pith reviewed 2026-06-30 08:32 UTC · model grok-4.3

classification 💻 cs.LO cs.AIcs.CL
keywords Lean 4theorem provingmulti-agent systemscombinatorics on wordsformal verificationautomated reasoningontology
0
0 comments X

The pith

LAMP uses Planner-Builder-Verifier agents and a CoW ontology to generate kernel-verified Lean proofs for 96.7% of 90 test theorems

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes Combinatorics on Words in Lean 4 across eight modules with 93 core declarations and lemmas. It introduces LAMP, a multi-agent system that supplies this domain knowledge at inference time through the Model Context Protocol rather than fine-tuning the underlying model. On 90 theorems spanning all modules and three difficulty levels, the framework produces verified proofs for 96.7 percent of cases, beating both an unscaffolded baseline and existing provers. Ablations show that removing the tool-grounded architecture or the Planner-Builder split each reduces performance by roughly twelve points. The work demonstrates that explicit structured domain knowledge delivered via ontology can enable reliable proof synthesis in areas underrepresented in Mathlib.

Core claim

LAMP coordinates a Planner, Builder, and Verifier with MCP-based access to a CoW ontology to synthesize kernel-verified Lean 4 proofs, reaching 96.7 percent success on 90 theorems across eight modules and three difficulty levels without fine-tuning the backbone model on CoW-specific data.

What carries the argument

The LAMP multi-agent framework that separates planning, building, and verification while giving agents Model Context Protocol access to an explicit CoW ontology at inference time.

If this is right

  • Providing structured ontologies at inference time can replace fine-tuning for theorem-proving tasks in narrow domains.
  • Separating Planner and Builder roles improves verified-proof yield by approximately twelve percentage points.
  • The same architecture works across easy, medium, and hard theorems within the eight CoW modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar ontology-augmented agents could be tested on other Mathlib gaps once domain formalizations exist.
  • The framework's reliance on proof repair steps may reveal recurring error patterns that could be addressed by additional agent roles.
  • Success without fine-tuning suggests the method could lower the data cost of adapting provers to new subfields.

Load-bearing premise

The provided CoW ontology contains enough explicit structured knowledge for the agents to produce kernel-verified proofs without any fine-tuning on domain-specific data.

What would settle it

Measure LAMP success rate on a new set of theorems drawn from a different specialized mathematical domain that lacks a corresponding ontology.

Figures

Figures reproduced from arXiv: 2606.28841 by Maithilee Patawar, Santhana Srinivasan R.

Figure 1
Figure 1. Figure 1: LAMP transforms a theorem with sorry into a kernel-verified proof 2. LAMP, a multi-agent framework that injects domain knowledge into LLM proof synthesis via MCP tools. It includes a structured CoW ontology that is used by a dedicated Planner agent. This allows the framework to adapt to the CoW domain without retraining the underlying model. 3. A CoW Evaluation Suite of 90 theorems across the eight modules… view at source ↗
Figure 2
Figure 2. Figure 2: LAMP architecture and data flow placeholder. Its design follows a single principle: rather than asking one language model to perform mathematical reasoning, Lean coding, and error recovery si￾multaneously, LAMP decomposes the task across specialized agents and grounds each of them in external tools so that decisions are based on verified facts rather than the model’s recollection. Three core agents coopera… view at source ↗
read the original abstract

Large language models are increasingly capable of mathematical reasoning, but the proofs they generate are often unreliable and hard to verify. Interactive theorem provers such as Lean 4 address this by accepting only kernel-checked proofs; however, their reach is bounded by the formalized knowledge available. While Mathlib, a repository of formalized Lean 4 theorems that covers diverse mathematical areas, certain specialized areas remain underrepresented; notably, the domain of Combinatorics on Words (CoW). CoW studies sequences, exploring their properties such as periodicity, borders, conjugacy, and morphisms. As a result, specialized provers, trained on Mathlib-centered data, lack the lemmas to operate in CoW. We present two contributions. First, we introduce a Lean 4 formalization of CoW containing eight modules and \textbf{93} declarations of core definitions and foundational lemmas. Second, we present LAMP, a multi-agent framework that synthesizes kernel-verified Lean 4 proofs by providing explicit, structured domain knowledge at inference time through an ontology, rather than by fine-tuning a prover. LAMP coordinates a Planner, Builder, and Verifier with Model Context Protocol based access to a domain-specific CoW ontology. In a suite of 90 CoW theorems that span all eight modules and three difficulty levels, LAMP synthesizes verified proofs for 96.7% of theorems, substantially exceeding both an unscaffolded baseline and existing specialized provers. An ablation shows that removing LAMP's tool-grounded architecture or its Planner/Builder separation each cost roughly 12 percentage points, even with the backbone model held fixed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a Lean 4 formalization of Combinatorics on Words (CoW) with eight modules and 93 declarations of core definitions and foundational lemmas. It presents LAMP, a multi-agent framework (Planner-Builder-Verifier) that uses Model Context Protocol (MCP) to supply the CoW ontology at inference time for synthesizing kernel-verified proofs, reporting a 96.7% success rate on 90 theorems spanning all modules and three difficulty levels. This substantially exceeds an unscaffolded baseline and existing specialized provers, with ablations indicating that removing the tool-grounded architecture or Planner/Builder separation each reduces performance by roughly 12 percentage points even with the backbone model fixed.

Significance. If the central result holds, the work is significant for demonstrating that explicit structured domain knowledge can be provided via an ontology at inference time to achieve high rates of kernel-verified proof synthesis in an underrepresented mathematical domain without fine-tuning. The machine-checked proofs, the ablation experiments isolating architectural components, and the coverage across modules and difficulty levels are strengths that support the value of the agentic approach for extending interactive theorem provers.

major comments (3)
  1. [Abstract] Abstract: The central claim that the CoW ontology (8 modules, 93 declarations) supplies sufficient explicit structured domain knowledge for the Planner-Builder-Verifier agents to generate kernel-accepted proofs on arbitrary CoW statements is load-bearing for the 96.7% success rate, yet the abstract provides no concrete examples of the declarations, their structure, or how MCP exposes them to the agents.
  2. [Abstract] Abstract (results paragraph): The 90 theorems are described as spanning all eight modules and three difficulty levels, but no selection criteria, independence from ontology construction, or sampling procedure is stated; without this, it is impossible to rule out that the theorems were chosen precisely because they are directly provable from the supplied 93 lemmas, undermining the generalizability interpretation.
  3. [Ablation paragraph] Ablation paragraph: The reported ~12 percentage point drops when removing the tool-grounded architecture or Planner/Builder separation are presented with the backbone model held fixed, but the abstract gives no information on how the ablated baselines were implemented (e.g., prompt structure, tool access, or agent coordination), making it difficult to assess whether the performance gap is attributable to the claimed architectural features.
minor comments (2)
  1. [Abstract] The expansion of the MCP acronym is given only after its first use; spelling it out on first mention would improve readability.
  2. [Abstract] The abstract refers to 'existing specialized provers' without naming them or citing their reported success rates on CoW or comparable domains; adding these references would strengthen the comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below. All points can be resolved through targeted revisions to the abstract and methods sections to improve clarity and transparency without altering the core claims or results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the CoW ontology (8 modules, 93 declarations) supplies sufficient explicit structured domain knowledge for the Planner-Builder-Verifier agents to generate kernel-accepted proofs on arbitrary CoW statements is load-bearing for the 96.7% success rate, yet the abstract provides no concrete examples of the declarations, their structure, or how MCP exposes them to the agents.

    Authors: We agree that the abstract would benefit from greater concreteness on this point. In the revised manuscript we will add a short illustrative example of a CoW declaration (e.g., a periodicity lemma) together with a one-sentence description of its JSON-like structure and the MCP endpoint that supplies it to the agents at inference time. revision: yes

  2. Referee: [Abstract] Abstract (results paragraph): The 90 theorems are described as spanning all eight modules and three difficulty levels, but no selection criteria, independence from ontology construction, or sampling procedure is stated; without this, it is impossible to rule out that the theorems were chosen precisely because they are directly provable from the supplied 93 lemmas, undermining the generalizability interpretation.

    Authors: The 90 theorems were drawn from the eight modules with explicit stratification by difficulty and with the requirement that each theorem require at least one non-trivial inference step beyond direct lookup of the 93 declarations. We will insert a concise statement of these selection criteria and the independence check into the revised abstract and expand the description in Section 4. revision: yes

  3. Referee: [Ablation paragraph] Ablation paragraph: The reported ~12 percentage point drops when removing the tool-grounded architecture or Planner/Builder separation are presented with the backbone model held fixed, but the abstract gives no information on how the ablated baselines were implemented (e.g., prompt structure, tool access, or agent coordination), making it difficult to assess whether the performance gap is attributable to the claimed architectural features.

    Authors: We acknowledge the omission. The revised abstract will briefly characterize the two ablated conditions (single-agent prompt without MCP access, and Planner-Builder merged into one agent with identical tool access) while keeping the backbone model fixed; fuller implementation details already appear in Section 5 and will be cross-referenced. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain.

full rationale

The paper introduces a new CoW formalization (8 modules, 93 declarations) and evaluates LAMP's agentic proof synthesis on 90 theorems spanning those modules, reporting 96.7% success against unscaffolded baselines and prior provers. No equations, fitted parameters, or self-referential definitions appear; the central performance claim is an empirical measurement on an external benchmark set in a newly formalized domain rather than a quantity derived from the ontology by construction. The ontology is supplied explicitly at inference time, and success is not redefined in terms of the input lemmas themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claims rest on the soundness of the Lean kernel, the adequacy of the newly constructed CoW ontology for guiding agents, and the assumption that agent separation plus ontology access suffices for high success rates without model adaptation.

axioms (1)
  • standard math The Lean 4 kernel provides sound verification of submitted proofs
    All success claims depend on proofs being accepted only if they pass the kernel check.
invented entities (1)
  • CoW ontology no independent evidence
    purpose: To encode structured domain knowledge for Combinatorics on Words and supply it to agents at inference time
    The ontology is developed specifically for this framework to address the gap in Mathlib.

pith-pipeline@v0.9.1-grok · 5829 in / 1378 out tokens · 58594 ms · 2026-06-30T08:32:24.423364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 23 canonical work pages · 6 internal anchors

  1. [1]

    Anthropic: Model context protocol.https://modelcontextprotocol.io(2024), accessed 2026

  2. [2]

    Asher, J.: Leanexplore: A search engine for lean 4 declarations (2025),https: //arxiv.org/abs/2506.11085

  3. [3]

    arXiv preprint arXiv:2302.12433 (2023)

    Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E.W., Radev, D., Avigad, J.: Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433 (2023)

  4. [4]

    arXiv preprint arXiv:2506.19923 (2025)

    Baba, K., Liu, C., Kurita, S., Sannai, A.: Prover agent: An agent-based framework for formal mathematical proofs. arXiv preprint arXiv:2506.19923 (2025)

  5. [5]

    Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

    Breen, B., Del Tredici, M., McCarran, J., Mijares, J.A., Yin, W.W., Sulimany, K., Taylor, J.M., Koppens, F.H., Englund, D.: Ax-prover: A deep reasoning agen- tic framework for theorem proving in mathematics and quantum physics. arXiv preprint arXiv:2510.12787 (2025)

  6. [6]

    Theoretical Computer Science412(27), 2931–2941 (2011)

    Crochemore, M., Ilie, L., Tinta, L.: The “runs” conjecture. Theoretical Computer Science412(27), 2931–2941 (2011)

  7. [7]

    Dressler, O.: Lean LSP MCP: Tools for agentic interaction with the Lean theorem prover (3 2025),https://github.com/oOo0oOo/lean-lsp-mcp

  8. [8]

    Proceedings of the American Mathematical Society16(1), 109–114 (1965)

    Fine, N.J., Wilf, H.S.: Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society16(1), 109–114 (1965)

  9. [9]

    First, E., Rabe, M.N., Ringer, T., Brun, Y.: Baldur: Whole-proof generation and repairwithlargelanguagemodels.In:Proceedingsofthe31stACMJointEuropean Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 1229–1241 (2023)

  10. [10]

    Fraenkel, A.S., Simpson, J.: How many squares can a string contain? J. Comb. Theory A82(1), 112–120 (1998)

  11. [11]

    Han, J.M., Rute, J., Wu, Y., Ayers, E.W., Polu, S.: Proof artifact co-training for theorem proving with language models (2022),https://arxiv.org/abs/2102. 06203

  12. [12]

    In: 12th International Conference on Interactive Theorem Proving (ITP 2021)

    Holub, Š., Starosta, Š.: Formalization of basic combinatorics on words. In: 12th International Conference on Interactive Theorem Proving (ITP 2021). Leibniz In- ternational Proceedings in Informatics (LIPIcs), vol. 193, pp. 22:1–22:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021).https://doi.org/10.4230/ LIPIcs.ITP.2021.22

  13. [13]

    In: Develop- ments in Language Theory (DLT 2021)

    Holub, Š., Starosta, Š.: Lyndon words formalized in Isabelle/HOL. In: Develop- ments in Language Theory (DLT 2021). Lecture Notes in Computer Science, vol. 12811, pp. 217–228. Springer (2021)

  14. [14]

    In: International Conference on Learning Representations

    Hu, J., Zhu, T., Welleck, S.: minictx: Neural theorem proving with (long-) contexts. In: International Conference on Learning Representations. vol. 2025, pp. 10106– 10130 (2025)

  15. [15]

    Nature pp

    Hubert, T., Mehta, R., Sartran, L., Horváth, M.Z., Žužić, G., Wieser, E., Huang, A., Schrittwieser, J., Schroecker, Y., Masoom, H., et al.: Olympiad-level formal mathematical reasoning with reinforcement learning. Nature pp. 1–3 (2025)

  16. [16]

    arXiv preprint arXiv:2210.12283 (2022)

    Jiang, A.Q., Welleck, S., Zhou, J.P., Li, W., Liu, J., Jamnik, M., Lacroix, T., Wu, Y., Lample, G.: Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283 (2022)

  17. [17]

    Ad- vances in neural information processing systems35, 26337–26349 (2022) LAMP: Lean-based Agentic framework with MCP and Proof Repair 19

    Lample, G., Lacroix, T., Lachaux, M.A., Rodriguez, A., Hayat, A., Lavril, T., Ebner, G., Martinet, X.: Hypertree proof search for neural theorem proving. Ad- vances in neural information processing systems35, 26337–26349 (2022) LAMP: Lean-based Agentic framework with MCP and Proof Repair 19

  18. [18]

    In: ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress (2016)

    Leroy, X., Blazy, S., Kästner, D., Schommer, B., Pister, M., Ferdinand, C.: Compcert-a formally verified optimizing compiler. In: ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress (2016)

  19. [19]

    arXiv preprint arXiv:2404.09939 (2024)

    Li, Z., Sun, J., Murphy, L., Su, Q., Li, Z., Zhang, X., Yang, K., Si, X.: A survey on deep learning for theorem proving. arXiv preprint arXiv:2404.09939 (2024)

  20. [20]

    In: International Conference on Learning Representations

    Lin, H., Sun, Z., Welleck, S., Yang, Y.: Lean-star: Learning to interleave thinking and proving. In: International Conference on Learning Representations. vol. 2025, pp. 66041–66062 (2025)

  21. [21]

    Lin, Y., Tang, S., Lyu, B., Wu, J., Lin, H., Yang, K., Li, J., Xia, M., Chen, D., Arora, S., Jin, C.: Goedel-prover: A frontier model for open-source automated theorem proving (2025),https://arxiv.org/abs/2502.07640

  22. [22]

    Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

    Lin, Y., Tang, S., Lyu, B., Yang, Z., Chung, J.H., Zhao, H., Jiang, L., Geng, Y., Ge, J., Sun, J., et al.: Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction. arXiv preprint arXiv:2508.03613 (2025)

  23. [23]

    Cambridge Mathematical Library, Cambridge University Press, 2nd edn

    Lothaire, M.: Combinatorics on Words. Cambridge Mathematical Library, Cambridge University Press, 2nd edn. (1997).https://doi.org/10.1017/ CBO9780511566097

  24. [24]

    Cambridge University Press (2002)

    Lothaire, M.: Algebraic Combinatorics on Words. Cambridge University Press (2002)

  25. [25]

    In: Automated Deduction – CADE 28

    de Moura, L., Ullrich, S.: The Lean 4 theorem prover and programming language. In: Automated Deduction – CADE 28. Lecture Notes in Computer Science, vol. 12699, pp. 625–635. Springer (2021)

  26. [26]

    Springer (2002)

    Nipkow, T., Paulson, L.C., Wenzel, M.: Isabelle/HOL: A Proof Assistant for Higher-Order Logic. Springer (2002)

  27. [27]

    Discrete Applied Mathematics341, 349–358 (2023)

    Patawar, M., Kapoor, K.: The square density of words having a sequence of fs- double squares. Discrete Applied Mathematics341, 349–358 (2023)

  28. [28]

    In: International Conference on Learning Representations (ICLR) (2023), arXiv:2202.01344

    Polu, S., Han, J.M., Zheng, K., Baksys, M., Babuschkin, I., Sutskever, I.: For- mal mathematics statement curriculum learning. In: International Conference on Learning Representations (ICLR) (2023), arXiv:2202.01344

  29. [29]

    Polu, S., Sutskever, I.: Generative language modeling for automated theorem prov- ing (2020),https://arxiv.org/abs/2009.03393

  30. [30]

    DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

    Ren, Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., et al.: Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv preprint arXiv:2504.21801 (2025)

  31. [31]

    Advances in neural information processing systems36, 68539–68551 (2023)

    Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettle- moyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems36, 68539–68551 (2023)

  32. [32]

    Australasian Journal of Combinatorics 92(3), 450–462 (2025),https://ajc.maths.uq.edu.au/pdf/92/ajc_v92_p450

    Simpson, J.: Palindromic periodicities. Australasian Journal of Combinatorics 92(3), 450–462 (2025),https://ajc.maths.uq.edu.au/pdf/92/ajc_v92_p450. pdf

  33. [33]

    arXiv preprint arXiv:2508.14644 (2025)

    Song, C., Wang, Z., Pu, F., Wang, H., Lin, X., Liu, J., Li, J., Liu, Z.: Leangeo: Formalizing competitional geometry problems in lean. arXiv preprint arXiv:2508.14644 (2025)

  34. [34]

    arXiv preprint arXiv:2404.12534 (2024)

    Song, P., Yang, K., Anandkumar, A.: Lean copilot: Large language models as copilots for theorem proving in lean. arXiv preprint arXiv:2404.12534 (2024)

  35. [35]

    Štěpán Starosta: Infinite words and morphic languages formalized in isabelle/hol (2023),https://arxiv.org/abs/2303.11445 20 Santhana Srinivasan R and Maithilee Patawar

  36. [36]

    Taylor, A.K., Zhang, J., Ji, E., Sahai, V., Deng, H., Chen, Y., Yuan, Y., Wu, D., Gu, J.C., Chang, K.W., et al.: Taobench: Do automated theorem prover llms generalize beyond mathlib? arXiv preprint arXiv:2603.12744 (2026)

  37. [37]

    Thakur,A.,Wen,Y.,Chaudhuri,S.:Alanguage-agentapproachtoformaltheorem- proving (2024),https://openreview.net/forum?id=XCMbagV0No

  38. [38]

    The Coq Development Team: The Coq proof assistant.https://coq.inria.fr (2024), accessed: 2026-06-13

  39. [39]

    In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP)

    The mathlib Community: The Lean mathematical library. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP). pp. 367–381. ACM (2020).https://doi.org/10.1145/3372885.3373824

  40. [40]

    Thierry, A.: A proof that a word of lengthnhas less than1.5ndistinct squares (2020),https://arxiv.org/abs/2001.02996

  41. [41]

    Advances in Neural Information Processing Systems 37, 11545–11569 (2024)

    Tsoukalas, G., Lee, J., Jennings, J., Xin, J., Ding, M., Jennings, M., Thakur, A., Chaudhuri, S.: Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition. Advances in Neural Information Processing Systems 37, 11545–11569 (2024)

  42. [42]

    Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Santos, M.D., Sung, F., et al.: Kimina-prover preview: Towards large formal reasoning models with reinforcement learning (2025),https://arxiv.org/abs/2504.11354

  43. [43]

    In: International Conference on Learning Representations

    Wang, H., Xin, H., Zheng, C., Liu, Z., Cao, Q., Huang, Y., Xiong, J., Shi, H., Xie, E., Yin, J., et al.: Lego-prover: Neural theorem proving with growing libraries. In: International Conference on Learning Representations. vol. 2024, pp. 31566–31597 (2024)

  44. [44]

    arXiv e-prints pp

    Wang, R., Pan, R., Li, Y., Zhang, J., Jia, Y., Diao, S., Pi, R., Hu, J., Zhang, T.: Ma-lot: Multi-agent lean-based long chain-of-thought reasoning enhances formal theorem proving. arXiv e-prints pp. arXiv–2503 (2025)

  45. [45]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Wu, Y., Jiang, A.Q., Li, W., Rabe, M., Staats, C., Jamnik, M., Szegedy, C.: Aut- oformalization with large language models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 32353–32368 (2022)

  46. [46]

    Wu, Z., Huang, S., Zhou, Z., Ying, H., Yuan, Z., Zhang, W., Lin, D., Chen, K.: Internlm2.5-stepprover: Advancing automated theorem proving via critic-guided search (2025),https://arxiv.org/abs/2410.15700

  47. [47]

    Xin, H., Guo, D., Shao, Z., Ren, Z., Zhu, Q., Liu, B., Ruan, C., Li, W., Liang, X.: Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data (2024),https://arxiv.org/abs/2405.14333

  48. [48]

    Xin, H., Ren, Z.Z., Song, J., Shao, Z., Zhao, W., Wang, H., Liu, B., Zhang, L., Lu, X., Du, Q., et al.: Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search (2024),https://arxiv.org/ abs/2408.08152

  49. [49]

    Advances in Neural Information Processing Systems36, 21573– 21612 (2023)

    Yang, K., Swope, A., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R.J., Anandkumar, A.: Leandojo: Theorem proving with retrieval-augmented lan- guage models. Advances in Neural Information Processing Systems36, 21573– 21612 (2023)

  50. [50]

    In: International Conference on Learning Representations (ICLR) (2023)

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: International Conference on Learning Representations (ICLR) (2023)

  51. [51]

    Advances in Neural Information Processing Systems37, 105848–105863 (2024) LAMP: Lean-based Agentic framework with MCP and Proof Repair 21

    Ying, H., Wu, Z., Geng, Y., Wang, J., Lin, D., Chen, K.: Lean workbook: A large- scale lean problem set formalized from natural language math problems. Advances in Neural Information Processing Systems37, 105848–105863 (2024) LAMP: Lean-based Agentic framework with MCP and Proof Repair 21

  52. [52]

    arXiv preprint arXiv:2505.02735 (2025)

    Yu, Z., Peng, R., Ding, K., Li, Y., Peng, Z., Liu, M., Zhang, Y., Yuan, Z., Xin, H., Huang, W., et al.: Formalmath: Benchmarking formal mathematical reasoning of large language models. arXiv preprint arXiv:2505.02735 (2025)

  53. [53]

    MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

    Zheng, K., Han, J.M., Polu, S.: Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110 (2021) Appendix A The CoW Library: Additional Details Design rationale Words are represented asListαrather than a custom inductive type, so that all of Lean’s existing list lemmas and automation are available for free.Fa...