LAMP: Lean-based Agentic framework with MCP and Proof Repair

Maithilee Patawar; Santhana Srinivasan R

arxiv: 2606.28841 · v1 · pith:Q63KI7K7new · submitted 2026-06-27 · 💻 cs.LO · cs.AI· cs.CL

LAMP: Lean-based Agentic framework with MCP and Proof Repair

Santhana Srinivasan R , Maithilee Patawar This is my paper

Pith reviewed 2026-06-30 08:32 UTC · model grok-4.3

classification 💻 cs.LO cs.AIcs.CL

keywords Lean 4theorem provingmulti-agent systemscombinatorics on wordsformal verificationautomated reasoningontology

0 comments

The pith

LAMP uses Planner-Builder-Verifier agents and a CoW ontology to generate kernel-verified Lean proofs for 96.7% of 90 test theorems

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes Combinatorics on Words in Lean 4 across eight modules with 93 core declarations and lemmas. It introduces LAMP, a multi-agent system that supplies this domain knowledge at inference time through the Model Context Protocol rather than fine-tuning the underlying model. On 90 theorems spanning all modules and three difficulty levels, the framework produces verified proofs for 96.7 percent of cases, beating both an unscaffolded baseline and existing provers. Ablations show that removing the tool-grounded architecture or the Planner-Builder split each reduces performance by roughly twelve points. The work demonstrates that explicit structured domain knowledge delivered via ontology can enable reliable proof synthesis in areas underrepresented in Mathlib.

Core claim

LAMP coordinates a Planner, Builder, and Verifier with MCP-based access to a CoW ontology to synthesize kernel-verified Lean 4 proofs, reaching 96.7 percent success on 90 theorems across eight modules and three difficulty levels without fine-tuning the backbone model on CoW-specific data.

What carries the argument

The LAMP multi-agent framework that separates planning, building, and verification while giving agents Model Context Protocol access to an explicit CoW ontology at inference time.

If this is right

Providing structured ontologies at inference time can replace fine-tuning for theorem-proving tasks in narrow domains.
Separating Planner and Builder roles improves verified-proof yield by approximately twelve percentage points.
The same architecture works across easy, medium, and hard theorems within the eight CoW modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar ontology-augmented agents could be tested on other Mathlib gaps once domain formalizations exist.
The framework's reliance on proof repair steps may reveal recurring error patterns that could be addressed by additional agent roles.
Success without fine-tuning suggests the method could lower the data cost of adapting provers to new subfields.

Load-bearing premise

The provided CoW ontology contains enough explicit structured knowledge for the agents to produce kernel-verified proofs without any fine-tuning on domain-specific data.

What would settle it

Measure LAMP success rate on a new set of theorems drawn from a different specialized mathematical domain that lacks a corresponding ontology.

Figures

Figures reproduced from arXiv: 2606.28841 by Maithilee Patawar, Santhana Srinivasan R.

**Figure 1.** Figure 1: LAMP transforms a theorem with sorry into a kernel-verified proof 2. LAMP, a multi-agent framework that injects domain knowledge into LLM proof synthesis via MCP tools. It includes a structured CoW ontology that is used by a dedicated Planner agent. This allows the framework to adapt to the CoW domain without retraining the underlying model. 3. A CoW Evaluation Suite of 90 theorems across the eight modules… view at source ↗

**Figure 2.** Figure 2: LAMP architecture and data flow placeholder. Its design follows a single principle: rather than asking one language model to perform mathematical reasoning, Lean coding, and error recovery simultaneously, LAMP decomposes the task across specialized agents and grounds each of them in external tools so that decisions are based on verified facts rather than the model’s recollection. Three core agents coopera… view at source ↗

read the original abstract

Large language models are increasingly capable of mathematical reasoning, but the proofs they generate are often unreliable and hard to verify. Interactive theorem provers such as Lean 4 address this by accepting only kernel-checked proofs; however, their reach is bounded by the formalized knowledge available. While Mathlib, a repository of formalized Lean 4 theorems that covers diverse mathematical areas, certain specialized areas remain underrepresented; notably, the domain of Combinatorics on Words (CoW). CoW studies sequences, exploring their properties such as periodicity, borders, conjugacy, and morphisms. As a result, specialized provers, trained on Mathlib-centered data, lack the lemmas to operate in CoW. We present two contributions. First, we introduce a Lean 4 formalization of CoW containing eight modules and \textbf{93} declarations of core definitions and foundational lemmas. Second, we present LAMP, a multi-agent framework that synthesizes kernel-verified Lean 4 proofs by providing explicit, structured domain knowledge at inference time through an ontology, rather than by fine-tuning a prover. LAMP coordinates a Planner, Builder, and Verifier with Model Context Protocol based access to a domain-specific CoW ontology. In a suite of 90 CoW theorems that span all eight modules and three difficulty levels, LAMP synthesizes verified proofs for 96.7% of theorems, substantially exceeding both an unscaffolded baseline and existing specialized provers. An ablation shows that removing LAMP's tool-grounded architecture or its Planner/Builder separation each cost roughly 12 percentage points, even with the backbone model held fixed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAMP adds a usable CoW formalization in Lean and shows that runtime ontology access via agents can lift proof rates without fine-tuning, though the 96.7% result rests on an unexamined test-set choice.

read the letter

LAMP formalizes Combinatorics on Words in Lean with eight modules and 93 declarations, then runs a Planner-Builder-Verifier loop that pulls structured knowledge from that ontology at inference time through MCP. On their 90-theorem suite the system reaches 96.7% kernel-verified proofs, beating both an unscaffolded baseline and prior specialized provers. The ablations, run with the backbone model fixed, show that removing the tool access or the Planner/Builder split each drops performance by roughly twelve points.

The formalization itself is the most immediately reusable piece; anyone working in CoW now has a starting library. The agent architecture is also concrete: it keeps domain knowledge explicit and external rather than baked into weights, which matches the stated goal of avoiding fine-tuning.

The soft spot is the test set. The abstract gives no information on how the 90 theorems were chosen or whether they were locked in before the ontology was written. If the theorems were selected because they follow directly from the 93 supplied lemmas, the high success rate mainly shows that the agents can use the provided material, not that the method works on arbitrary statements in the domain. That distinction matters for the claim.

People building domain-specific extensions to Lean or testing agentic proof systems will get the most from this. The numbers and the ablation design are specific enough to evaluate in review, even if the selection question needs a clearer answer.

Send it to peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a Lean 4 formalization of Combinatorics on Words (CoW) with eight modules and 93 declarations of core definitions and foundational lemmas. It presents LAMP, a multi-agent framework (Planner-Builder-Verifier) that uses Model Context Protocol (MCP) to supply the CoW ontology at inference time for synthesizing kernel-verified proofs, reporting a 96.7% success rate on 90 theorems spanning all modules and three difficulty levels. This substantially exceeds an unscaffolded baseline and existing specialized provers, with ablations indicating that removing the tool-grounded architecture or Planner/Builder separation each reduces performance by roughly 12 percentage points even with the backbone model fixed.

Significance. If the central result holds, the work is significant for demonstrating that explicit structured domain knowledge can be provided via an ontology at inference time to achieve high rates of kernel-verified proof synthesis in an underrepresented mathematical domain without fine-tuning. The machine-checked proofs, the ablation experiments isolating architectural components, and the coverage across modules and difficulty levels are strengths that support the value of the agentic approach for extending interactive theorem provers.

major comments (3)

[Abstract] Abstract: The central claim that the CoW ontology (8 modules, 93 declarations) supplies sufficient explicit structured domain knowledge for the Planner-Builder-Verifier agents to generate kernel-accepted proofs on arbitrary CoW statements is load-bearing for the 96.7% success rate, yet the abstract provides no concrete examples of the declarations, their structure, or how MCP exposes them to the agents.
[Abstract] Abstract (results paragraph): The 90 theorems are described as spanning all eight modules and three difficulty levels, but no selection criteria, independence from ontology construction, or sampling procedure is stated; without this, it is impossible to rule out that the theorems were chosen precisely because they are directly provable from the supplied 93 lemmas, undermining the generalizability interpretation.
[Ablation paragraph] Ablation paragraph: The reported ~12 percentage point drops when removing the tool-grounded architecture or Planner/Builder separation are presented with the backbone model held fixed, but the abstract gives no information on how the ablated baselines were implemented (e.g., prompt structure, tool access, or agent coordination), making it difficult to assess whether the performance gap is attributable to the claimed architectural features.

minor comments (2)

[Abstract] The expansion of the MCP acronym is given only after its first use; spelling it out on first mention would improve readability.
[Abstract] The abstract refers to 'existing specialized provers' without naming them or citing their reported success rates on CoW or comparable domains; adding these references would strengthen the comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below. All points can be resolved through targeted revisions to the abstract and methods sections to improve clarity and transparency without altering the core claims or results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the CoW ontology (8 modules, 93 declarations) supplies sufficient explicit structured domain knowledge for the Planner-Builder-Verifier agents to generate kernel-accepted proofs on arbitrary CoW statements is load-bearing for the 96.7% success rate, yet the abstract provides no concrete examples of the declarations, their structure, or how MCP exposes them to the agents.

Authors: We agree that the abstract would benefit from greater concreteness on this point. In the revised manuscript we will add a short illustrative example of a CoW declaration (e.g., a periodicity lemma) together with a one-sentence description of its JSON-like structure and the MCP endpoint that supplies it to the agents at inference time. revision: yes
Referee: [Abstract] Abstract (results paragraph): The 90 theorems are described as spanning all eight modules and three difficulty levels, but no selection criteria, independence from ontology construction, or sampling procedure is stated; without this, it is impossible to rule out that the theorems were chosen precisely because they are directly provable from the supplied 93 lemmas, undermining the generalizability interpretation.

Authors: The 90 theorems were drawn from the eight modules with explicit stratification by difficulty and with the requirement that each theorem require at least one non-trivial inference step beyond direct lookup of the 93 declarations. We will insert a concise statement of these selection criteria and the independence check into the revised abstract and expand the description in Section 4. revision: yes
Referee: [Ablation paragraph] Ablation paragraph: The reported ~12 percentage point drops when removing the tool-grounded architecture or Planner/Builder separation are presented with the backbone model held fixed, but the abstract gives no information on how the ablated baselines were implemented (e.g., prompt structure, tool access, or agent coordination), making it difficult to assess whether the performance gap is attributable to the claimed architectural features.

Authors: We acknowledge the omission. The revised abstract will briefly characterize the two ablated conditions (single-agent prompt without MCP access, and Planner-Builder merged into one agent with identical tool access) while keeping the backbone model fixed; fuller implementation details already appear in Section 5 and will be cross-referenced. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain.

full rationale

The paper introduces a new CoW formalization (8 modules, 93 declarations) and evaluates LAMP's agentic proof synthesis on 90 theorems spanning those modules, reporting 96.7% success against unscaffolded baselines and prior provers. No equations, fitted parameters, or self-referential definitions appear; the central performance claim is an empirical measurement on an external benchmark set in a newly formalized domain rather than a quantity derived from the ontology by construction. The ontology is supplied explicitly at inference time, and success is not redefined in terms of the input lemmas themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claims rest on the soundness of the Lean kernel, the adequacy of the newly constructed CoW ontology for guiding agents, and the assumption that agent separation plus ontology access suffices for high success rates without model adaptation.

axioms (1)

standard math The Lean 4 kernel provides sound verification of submitted proofs
All success claims depend on proofs being accepted only if they pass the kernel check.

invented entities (1)

CoW ontology no independent evidence
purpose: To encode structured domain knowledge for Combinatorics on Words and supply it to agents at inference time
The ontology is developed specifically for this framework to address the gap in Mathlib.

pith-pipeline@v0.9.1-grok · 5829 in / 1378 out tokens · 58594 ms · 2026-06-30T08:32:24.423364+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 23 canonical work pages · 6 internal anchors

[1]

Anthropic: Model context protocol.https://modelcontextprotocol.io(2024), accessed 2026

2024
[2]

Asher, J.: Leanexplore: A search engine for lean 4 declarations (2025),https: //arxiv.org/abs/2506.11085

work page arXiv 2025
[3]

arXiv preprint arXiv:2302.12433 (2023)

Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E.W., Radev, D., Avigad, J.: Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433 (2023)

work page arXiv 2023
[4]

arXiv preprint arXiv:2506.19923 (2025)

Baba, K., Liu, C., Kurita, S., Sannai, A.: Prover agent: An agent-based framework for formal mathematical proofs. arXiv preprint arXiv:2506.19923 (2025)

work page arXiv 2025
[5]

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Breen, B., Del Tredici, M., McCarran, J., Mijares, J.A., Yin, W.W., Sulimany, K., Taylor, J.M., Koppens, F.H., Englund, D.: Ax-prover: A deep reasoning agen- tic framework for theorem proving in mathematics and quantum physics. arXiv preprint arXiv:2510.12787 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Theoretical Computer Science412(27), 2931–2941 (2011)

Crochemore, M., Ilie, L., Tinta, L.: The “runs” conjecture. Theoretical Computer Science412(27), 2931–2941 (2011)

2011
[7]

Dressler, O.: Lean LSP MCP: Tools for agentic interaction with the Lean theorem prover (3 2025),https://github.com/oOo0oOo/lean-lsp-mcp

2025
[8]

Proceedings of the American Mathematical Society16(1), 109–114 (1965)

Fine, N.J., Wilf, H.S.: Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society16(1), 109–114 (1965)

1965
[9]

First, E., Rabe, M.N., Ringer, T., Brun, Y.: Baldur: Whole-proof generation and repairwithlargelanguagemodels.In:Proceedingsofthe31stACMJointEuropean Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 1229–1241 (2023)

2023
[10]

Fraenkel, A.S., Simpson, J.: How many squares can a string contain? J. Comb. Theory A82(1), 112–120 (1998)

1998
[11]

Han, J.M., Rute, J., Wu, Y., Ayers, E.W., Polu, S.: Proof artifact co-training for theorem proving with language models (2022),https://arxiv.org/abs/2102. 06203

2022
[12]

In: 12th International Conference on Interactive Theorem Proving (ITP 2021)

Holub, Š., Starosta, Š.: Formalization of basic combinatorics on words. In: 12th International Conference on Interactive Theorem Proving (ITP 2021). Leibniz In- ternational Proceedings in Informatics (LIPIcs), vol. 193, pp. 22:1–22:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021).https://doi.org/10.4230/ LIPIcs.ITP.2021.22

2021
[13]

In: Develop- ments in Language Theory (DLT 2021)

Holub, Š., Starosta, Š.: Lyndon words formalized in Isabelle/HOL. In: Develop- ments in Language Theory (DLT 2021). Lecture Notes in Computer Science, vol. 12811, pp. 217–228. Springer (2021)

2021
[14]

In: International Conference on Learning Representations

Hu, J., Zhu, T., Welleck, S.: minictx: Neural theorem proving with (long-) contexts. In: International Conference on Learning Representations. vol. 2025, pp. 10106– 10130 (2025)

2025
[15]

Nature pp

Hubert, T., Mehta, R., Sartran, L., Horváth, M.Z., Žužić, G., Wieser, E., Huang, A., Schrittwieser, J., Schroecker, Y., Masoom, H., et al.: Olympiad-level formal mathematical reasoning with reinforcement learning. Nature pp. 1–3 (2025)

2025
[16]

arXiv preprint arXiv:2210.12283 (2022)

Jiang, A.Q., Welleck, S., Zhou, J.P., Li, W., Liu, J., Jamnik, M., Lacroix, T., Wu, Y., Lample, G.: Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283 (2022)

work page arXiv 2022
[17]

Ad- vances in neural information processing systems35, 26337–26349 (2022) LAMP: Lean-based Agentic framework with MCP and Proof Repair 19

Lample, G., Lacroix, T., Lachaux, M.A., Rodriguez, A., Hayat, A., Lavril, T., Ebner, G., Martinet, X.: Hypertree proof search for neural theorem proving. Ad- vances in neural information processing systems35, 26337–26349 (2022) LAMP: Lean-based Agentic framework with MCP and Proof Repair 19

2022
[18]

In: ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress (2016)

Leroy, X., Blazy, S., Kästner, D., Schommer, B., Pister, M., Ferdinand, C.: Compcert-a formally verified optimizing compiler. In: ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress (2016)

2016
[19]

arXiv preprint arXiv:2404.09939 (2024)

Li, Z., Sun, J., Murphy, L., Su, Q., Li, Z., Zhang, X., Yang, K., Si, X.: A survey on deep learning for theorem proving. arXiv preprint arXiv:2404.09939 (2024)

work page arXiv 2024
[20]

In: International Conference on Learning Representations

Lin, H., Sun, Z., Welleck, S., Yang, Y.: Lean-star: Learning to interleave thinking and proving. In: International Conference on Learning Representations. vol. 2025, pp. 66041–66062 (2025)

2025
[21]

Lin, Y., Tang, S., Lyu, B., Wu, J., Lin, H., Yang, K., Li, J., Xia, M., Chen, D., Arora, S., Jin, C.: Goedel-prover: A frontier model for open-source automated theorem proving (2025),https://arxiv.org/abs/2502.07640

work page arXiv 2025
[22]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Lin, Y., Tang, S., Lyu, B., Yang, Z., Chung, J.H., Zhao, H., Jiang, L., Geng, Y., Ge, J., Sun, J., et al.: Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction. arXiv preprint arXiv:2508.03613 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Cambridge Mathematical Library, Cambridge University Press, 2nd edn

Lothaire, M.: Combinatorics on Words. Cambridge Mathematical Library, Cambridge University Press, 2nd edn. (1997).https://doi.org/10.1017/ CBO9780511566097

1997
[24]

Cambridge University Press (2002)

Lothaire, M.: Algebraic Combinatorics on Words. Cambridge University Press (2002)

2002
[25]

In: Automated Deduction – CADE 28

de Moura, L., Ullrich, S.: The Lean 4 theorem prover and programming language. In: Automated Deduction – CADE 28. Lecture Notes in Computer Science, vol. 12699, pp. 625–635. Springer (2021)

2021
[26]

Springer (2002)

Nipkow, T., Paulson, L.C., Wenzel, M.: Isabelle/HOL: A Proof Assistant for Higher-Order Logic. Springer (2002)

2002
[27]

Discrete Applied Mathematics341, 349–358 (2023)

Patawar, M., Kapoor, K.: The square density of words having a sequence of fs- double squares. Discrete Applied Mathematics341, 349–358 (2023)

2023
[28]

In: International Conference on Learning Representations (ICLR) (2023), arXiv:2202.01344

Polu, S., Han, J.M., Zheng, K., Baksys, M., Babuschkin, I., Sutskever, I.: For- mal mathematics statement curriculum learning. In: International Conference on Learning Representations (ICLR) (2023), arXiv:2202.01344

work page arXiv 2023
[29]

Polu, S., Sutskever, I.: Generative language modeling for automated theorem prov- ing (2020),https://arxiv.org/abs/2009.03393

work page internal anchor Pith review Pith/arXiv arXiv 2020
[30]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Ren, Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., et al.: Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv preprint arXiv:2504.21801 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Advances in neural information processing systems36, 68539–68551 (2023)

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettle- moyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems36, 68539–68551 (2023)

2023
[32]

Australasian Journal of Combinatorics 92(3), 450–462 (2025),https://ajc.maths.uq.edu.au/pdf/92/ajc_v92_p450

Simpson, J.: Palindromic periodicities. Australasian Journal of Combinatorics 92(3), 450–462 (2025),https://ajc.maths.uq.edu.au/pdf/92/ajc_v92_p450. pdf

2025
[33]

arXiv preprint arXiv:2508.14644 (2025)

Song, C., Wang, Z., Pu, F., Wang, H., Lin, X., Liu, J., Li, J., Liu, Z.: Leangeo: Formalizing competitional geometry problems in lean. arXiv preprint arXiv:2508.14644 (2025)

work page arXiv 2025
[34]

arXiv preprint arXiv:2404.12534 (2024)

Song, P., Yang, K., Anandkumar, A.: Lean copilot: Large language models as copilots for theorem proving in lean. arXiv preprint arXiv:2404.12534 (2024)

work page arXiv 2024
[35]

Štěpán Starosta: Infinite words and morphic languages formalized in isabelle/hol (2023),https://arxiv.org/abs/2303.11445 20 Santhana Srinivasan R and Maithilee Patawar

work page arXiv 2023
[36]

Taylor, A.K., Zhang, J., Ji, E., Sahai, V., Deng, H., Chen, Y., Yuan, Y., Wu, D., Gu, J.C., Chang, K.W., et al.: Taobench: Do automated theorem prover llms generalize beyond mathlib? arXiv preprint arXiv:2603.12744 (2026)

work page arXiv 2026
[37]

Thakur,A.,Wen,Y.,Chaudhuri,S.:Alanguage-agentapproachtoformaltheorem- proving (2024),https://openreview.net/forum?id=XCMbagV0No

2024
[38]

The Coq Development Team: The Coq proof assistant.https://coq.inria.fr (2024), accessed: 2026-06-13

2024
[39]

In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP)

The mathlib Community: The Lean mathematical library. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP). pp. 367–381. ACM (2020).https://doi.org/10.1145/3372885.3373824

work page doi:10.1145/3372885.3373824 2020
[40]

Thierry, A.: A proof that a word of lengthnhas less than1.5ndistinct squares (2020),https://arxiv.org/abs/2001.02996

work page arXiv 2020
[41]

Advances in Neural Information Processing Systems 37, 11545–11569 (2024)

Tsoukalas, G., Lee, J., Jennings, J., Xin, J., Ding, M., Jennings, M., Thakur, A., Chaudhuri, S.: Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition. Advances in Neural Information Processing Systems 37, 11545–11569 (2024)

2024
[42]

Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Santos, M.D., Sung, F., et al.: Kimina-prover preview: Towards large formal reasoning models with reinforcement learning (2025),https://arxiv.org/abs/2504.11354

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

In: International Conference on Learning Representations

Wang, H., Xin, H., Zheng, C., Liu, Z., Cao, Q., Huang, Y., Xiong, J., Shi, H., Xie, E., Yin, J., et al.: Lego-prover: Neural theorem proving with growing libraries. In: International Conference on Learning Representations. vol. 2024, pp. 31566–31597 (2024)

2024
[44]

arXiv e-prints pp

Wang, R., Pan, R., Li, Y., Zhang, J., Jia, Y., Diao, S., Pi, R., Hu, J., Zhang, T.: Ma-lot: Multi-agent lean-based long chain-of-thought reasoning enhances formal theorem proving. arXiv e-prints pp. arXiv–2503 (2025)

2025
[45]

In: Advances in Neural Information Processing Systems (NeurIPS)

Wu, Y., Jiang, A.Q., Li, W., Rabe, M., Staats, C., Jamnik, M., Szegedy, C.: Aut- oformalization with large language models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 32353–32368 (2022)

2022
[46]

Wu, Z., Huang, S., Zhou, Z., Ying, H., Yuan, Z., Zhang, W., Lin, D., Chen, K.: Internlm2.5-stepprover: Advancing automated theorem proving via critic-guided search (2025),https://arxiv.org/abs/2410.15700

work page arXiv 2025
[47]

Xin, H., Guo, D., Shao, Z., Ren, Z., Zhu, Q., Liu, B., Ruan, C., Li, W., Liang, X.: Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data (2024),https://arxiv.org/abs/2405.14333

work page arXiv 2024
[48]

Xin, H., Ren, Z.Z., Song, J., Shao, Z., Zhao, W., Wang, H., Liu, B., Zhang, L., Lu, X., Du, Q., et al.: Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search (2024),https://arxiv.org/ abs/2408.08152

work page arXiv 2024
[49]

Advances in Neural Information Processing Systems36, 21573– 21612 (2023)

Yang, K., Swope, A., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R.J., Anandkumar, A.: Leandojo: Theorem proving with retrieval-augmented lan- guage models. Advances in Neural Information Processing Systems36, 21573– 21612 (2023)

2023
[50]

In: International Conference on Learning Representations (ICLR) (2023)

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: International Conference on Learning Representations (ICLR) (2023)

2023
[51]

Advances in Neural Information Processing Systems37, 105848–105863 (2024) LAMP: Lean-based Agentic framework with MCP and Proof Repair 21

Ying, H., Wu, Z., Geng, Y., Wang, J., Lin, D., Chen, K.: Lean workbook: A large- scale lean problem set formalized from natural language math problems. Advances in Neural Information Processing Systems37, 105848–105863 (2024) LAMP: Lean-based Agentic framework with MCP and Proof Repair 21

2024
[52]

arXiv preprint arXiv:2505.02735 (2025)

Yu, Z., Peng, R., Ding, K., Li, Y., Peng, Z., Liu, M., Zhang, Y., Yuan, Z., Xin, H., Huang, W., et al.: Formalmath: Benchmarking formal mathematical reasoning of large language models. arXiv preprint arXiv:2505.02735 (2025)

work page arXiv 2025
[53]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Zheng, K., Han, J.M., Polu, S.: Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110 (2021) Appendix A The CoW Library: Additional Details Design rationale Words are represented asListαrather than a custom inductive type, so that all of Lean’s existing list lemmas and automation are available for free.Fa...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[1] [1]

Anthropic: Model context protocol.https://modelcontextprotocol.io(2024), accessed 2026

2024

[2] [2]

Asher, J.: Leanexplore: A search engine for lean 4 declarations (2025),https: //arxiv.org/abs/2506.11085

work page arXiv 2025

[3] [3]

arXiv preprint arXiv:2302.12433 (2023)

Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E.W., Radev, D., Avigad, J.: Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433 (2023)

work page arXiv 2023

[4] [4]

arXiv preprint arXiv:2506.19923 (2025)

Baba, K., Liu, C., Kurita, S., Sannai, A.: Prover agent: An agent-based framework for formal mathematical proofs. arXiv preprint arXiv:2506.19923 (2025)

work page arXiv 2025

[5] [5]

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Breen, B., Del Tredici, M., McCarran, J., Mijares, J.A., Yin, W.W., Sulimany, K., Taylor, J.M., Koppens, F.H., Englund, D.: Ax-prover: A deep reasoning agen- tic framework for theorem proving in mathematics and quantum physics. arXiv preprint arXiv:2510.12787 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Theoretical Computer Science412(27), 2931–2941 (2011)

Crochemore, M., Ilie, L., Tinta, L.: The “runs” conjecture. Theoretical Computer Science412(27), 2931–2941 (2011)

2011

[7] [7]

Dressler, O.: Lean LSP MCP: Tools for agentic interaction with the Lean theorem prover (3 2025),https://github.com/oOo0oOo/lean-lsp-mcp

2025

[8] [8]

Proceedings of the American Mathematical Society16(1), 109–114 (1965)

Fine, N.J., Wilf, H.S.: Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society16(1), 109–114 (1965)

1965

[9] [9]

First, E., Rabe, M.N., Ringer, T., Brun, Y.: Baldur: Whole-proof generation and repairwithlargelanguagemodels.In:Proceedingsofthe31stACMJointEuropean Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 1229–1241 (2023)

2023

[10] [10]

Fraenkel, A.S., Simpson, J.: How many squares can a string contain? J. Comb. Theory A82(1), 112–120 (1998)

1998

[11] [11]

Han, J.M., Rute, J., Wu, Y., Ayers, E.W., Polu, S.: Proof artifact co-training for theorem proving with language models (2022),https://arxiv.org/abs/2102. 06203

2022

[12] [12]

In: 12th International Conference on Interactive Theorem Proving (ITP 2021)

Holub, Š., Starosta, Š.: Formalization of basic combinatorics on words. In: 12th International Conference on Interactive Theorem Proving (ITP 2021). Leibniz In- ternational Proceedings in Informatics (LIPIcs), vol. 193, pp. 22:1–22:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021).https://doi.org/10.4230/ LIPIcs.ITP.2021.22

2021

[13] [13]

In: Develop- ments in Language Theory (DLT 2021)

Holub, Š., Starosta, Š.: Lyndon words formalized in Isabelle/HOL. In: Develop- ments in Language Theory (DLT 2021). Lecture Notes in Computer Science, vol. 12811, pp. 217–228. Springer (2021)

2021

[14] [14]

In: International Conference on Learning Representations

Hu, J., Zhu, T., Welleck, S.: minictx: Neural theorem proving with (long-) contexts. In: International Conference on Learning Representations. vol. 2025, pp. 10106– 10130 (2025)

2025

[15] [15]

Nature pp

Hubert, T., Mehta, R., Sartran, L., Horváth, M.Z., Žužić, G., Wieser, E., Huang, A., Schrittwieser, J., Schroecker, Y., Masoom, H., et al.: Olympiad-level formal mathematical reasoning with reinforcement learning. Nature pp. 1–3 (2025)

2025

[16] [16]

arXiv preprint arXiv:2210.12283 (2022)

Jiang, A.Q., Welleck, S., Zhou, J.P., Li, W., Liu, J., Jamnik, M., Lacroix, T., Wu, Y., Lample, G.: Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283 (2022)

work page arXiv 2022

[17] [17]

Ad- vances in neural information processing systems35, 26337–26349 (2022) LAMP: Lean-based Agentic framework with MCP and Proof Repair 19

Lample, G., Lacroix, T., Lachaux, M.A., Rodriguez, A., Hayat, A., Lavril, T., Ebner, G., Martinet, X.: Hypertree proof search for neural theorem proving. Ad- vances in neural information processing systems35, 26337–26349 (2022) LAMP: Lean-based Agentic framework with MCP and Proof Repair 19

2022

[18] [18]

In: ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress (2016)

Leroy, X., Blazy, S., Kästner, D., Schommer, B., Pister, M., Ferdinand, C.: Compcert-a formally verified optimizing compiler. In: ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress (2016)

2016

[19] [19]

arXiv preprint arXiv:2404.09939 (2024)

Li, Z., Sun, J., Murphy, L., Su, Q., Li, Z., Zhang, X., Yang, K., Si, X.: A survey on deep learning for theorem proving. arXiv preprint arXiv:2404.09939 (2024)

work page arXiv 2024

[20] [20]

In: International Conference on Learning Representations

Lin, H., Sun, Z., Welleck, S., Yang, Y.: Lean-star: Learning to interleave thinking and proving. In: International Conference on Learning Representations. vol. 2025, pp. 66041–66062 (2025)

2025

[21] [21]

Lin, Y., Tang, S., Lyu, B., Wu, J., Lin, H., Yang, K., Li, J., Xia, M., Chen, D., Arora, S., Jin, C.: Goedel-prover: A frontier model for open-source automated theorem proving (2025),https://arxiv.org/abs/2502.07640

work page arXiv 2025

[22] [22]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Lin, Y., Tang, S., Lyu, B., Yang, Z., Chung, J.H., Zhao, H., Jiang, L., Geng, Y., Ge, J., Sun, J., et al.: Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction. arXiv preprint arXiv:2508.03613 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Cambridge Mathematical Library, Cambridge University Press, 2nd edn

Lothaire, M.: Combinatorics on Words. Cambridge Mathematical Library, Cambridge University Press, 2nd edn. (1997).https://doi.org/10.1017/ CBO9780511566097

1997

[24] [24]

Cambridge University Press (2002)

Lothaire, M.: Algebraic Combinatorics on Words. Cambridge University Press (2002)

2002

[25] [25]

In: Automated Deduction – CADE 28

de Moura, L., Ullrich, S.: The Lean 4 theorem prover and programming language. In: Automated Deduction – CADE 28. Lecture Notes in Computer Science, vol. 12699, pp. 625–635. Springer (2021)

2021

[26] [26]

Springer (2002)

Nipkow, T., Paulson, L.C., Wenzel, M.: Isabelle/HOL: A Proof Assistant for Higher-Order Logic. Springer (2002)

2002

[27] [27]

Discrete Applied Mathematics341, 349–358 (2023)

Patawar, M., Kapoor, K.: The square density of words having a sequence of fs- double squares. Discrete Applied Mathematics341, 349–358 (2023)

2023

[28] [28]

In: International Conference on Learning Representations (ICLR) (2023), arXiv:2202.01344

Polu, S., Han, J.M., Zheng, K., Baksys, M., Babuschkin, I., Sutskever, I.: For- mal mathematics statement curriculum learning. In: International Conference on Learning Representations (ICLR) (2023), arXiv:2202.01344

work page arXiv 2023

[29] [29]

Polu, S., Sutskever, I.: Generative language modeling for automated theorem prov- ing (2020),https://arxiv.org/abs/2009.03393

work page internal anchor Pith review Pith/arXiv arXiv 2020

[30] [30]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Ren, Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., et al.: Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv preprint arXiv:2504.21801 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Advances in neural information processing systems36, 68539–68551 (2023)

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettle- moyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems36, 68539–68551 (2023)

2023

[32] [32]

Australasian Journal of Combinatorics 92(3), 450–462 (2025),https://ajc.maths.uq.edu.au/pdf/92/ajc_v92_p450

Simpson, J.: Palindromic periodicities. Australasian Journal of Combinatorics 92(3), 450–462 (2025),https://ajc.maths.uq.edu.au/pdf/92/ajc_v92_p450. pdf

2025

[33] [33]

arXiv preprint arXiv:2508.14644 (2025)

Song, C., Wang, Z., Pu, F., Wang, H., Lin, X., Liu, J., Li, J., Liu, Z.: Leangeo: Formalizing competitional geometry problems in lean. arXiv preprint arXiv:2508.14644 (2025)

work page arXiv 2025

[34] [34]

arXiv preprint arXiv:2404.12534 (2024)

Song, P., Yang, K., Anandkumar, A.: Lean copilot: Large language models as copilots for theorem proving in lean. arXiv preprint arXiv:2404.12534 (2024)

work page arXiv 2024

[35] [35]

Štěpán Starosta: Infinite words and morphic languages formalized in isabelle/hol (2023),https://arxiv.org/abs/2303.11445 20 Santhana Srinivasan R and Maithilee Patawar

work page arXiv 2023

[36] [36]

Taylor, A.K., Zhang, J., Ji, E., Sahai, V., Deng, H., Chen, Y., Yuan, Y., Wu, D., Gu, J.C., Chang, K.W., et al.: Taobench: Do automated theorem prover llms generalize beyond mathlib? arXiv preprint arXiv:2603.12744 (2026)

work page arXiv 2026

[37] [37]

Thakur,A.,Wen,Y.,Chaudhuri,S.:Alanguage-agentapproachtoformaltheorem- proving (2024),https://openreview.net/forum?id=XCMbagV0No

2024

[38] [38]

The Coq Development Team: The Coq proof assistant.https://coq.inria.fr (2024), accessed: 2026-06-13

2024

[39] [39]

In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP)

The mathlib Community: The Lean mathematical library. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP). pp. 367–381. ACM (2020).https://doi.org/10.1145/3372885.3373824

work page doi:10.1145/3372885.3373824 2020

[40] [40]

Thierry, A.: A proof that a word of lengthnhas less than1.5ndistinct squares (2020),https://arxiv.org/abs/2001.02996

work page arXiv 2020

[41] [41]

Advances in Neural Information Processing Systems 37, 11545–11569 (2024)

Tsoukalas, G., Lee, J., Jennings, J., Xin, J., Ding, M., Jennings, M., Thakur, A., Chaudhuri, S.: Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition. Advances in Neural Information Processing Systems 37, 11545–11569 (2024)

2024

[42] [42]

Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Santos, M.D., Sung, F., et al.: Kimina-prover preview: Towards large formal reasoning models with reinforcement learning (2025),https://arxiv.org/abs/2504.11354

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

In: International Conference on Learning Representations

Wang, H., Xin, H., Zheng, C., Liu, Z., Cao, Q., Huang, Y., Xiong, J., Shi, H., Xie, E., Yin, J., et al.: Lego-prover: Neural theorem proving with growing libraries. In: International Conference on Learning Representations. vol. 2024, pp. 31566–31597 (2024)

2024

[44] [44]

arXiv e-prints pp

Wang, R., Pan, R., Li, Y., Zhang, J., Jia, Y., Diao, S., Pi, R., Hu, J., Zhang, T.: Ma-lot: Multi-agent lean-based long chain-of-thought reasoning enhances formal theorem proving. arXiv e-prints pp. arXiv–2503 (2025)

2025

[45] [45]

In: Advances in Neural Information Processing Systems (NeurIPS)

Wu, Y., Jiang, A.Q., Li, W., Rabe, M., Staats, C., Jamnik, M., Szegedy, C.: Aut- oformalization with large language models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 32353–32368 (2022)

2022

[46] [46]

Wu, Z., Huang, S., Zhou, Z., Ying, H., Yuan, Z., Zhang, W., Lin, D., Chen, K.: Internlm2.5-stepprover: Advancing automated theorem proving via critic-guided search (2025),https://arxiv.org/abs/2410.15700

work page arXiv 2025

[47] [47]

Xin, H., Guo, D., Shao, Z., Ren, Z., Zhu, Q., Liu, B., Ruan, C., Li, W., Liang, X.: Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data (2024),https://arxiv.org/abs/2405.14333

work page arXiv 2024

[48] [48]

Xin, H., Ren, Z.Z., Song, J., Shao, Z., Zhao, W., Wang, H., Liu, B., Zhang, L., Lu, X., Du, Q., et al.: Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search (2024),https://arxiv.org/ abs/2408.08152

work page arXiv 2024

[49] [49]

Advances in Neural Information Processing Systems36, 21573– 21612 (2023)

Yang, K., Swope, A., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R.J., Anandkumar, A.: Leandojo: Theorem proving with retrieval-augmented lan- guage models. Advances in Neural Information Processing Systems36, 21573– 21612 (2023)

2023

[50] [50]

In: International Conference on Learning Representations (ICLR) (2023)

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: International Conference on Learning Representations (ICLR) (2023)

2023

[51] [51]

Advances in Neural Information Processing Systems37, 105848–105863 (2024) LAMP: Lean-based Agentic framework with MCP and Proof Repair 21

Ying, H., Wu, Z., Geng, Y., Wang, J., Lin, D., Chen, K.: Lean workbook: A large- scale lean problem set formalized from natural language math problems. Advances in Neural Information Processing Systems37, 105848–105863 (2024) LAMP: Lean-based Agentic framework with MCP and Proof Repair 21

2024

[52] [52]

arXiv preprint arXiv:2505.02735 (2025)

Yu, Z., Peng, R., Ding, K., Li, Y., Peng, Z., Liu, M., Zhang, Y., Yuan, Z., Xin, H., Huang, W., et al.: Formalmath: Benchmarking formal mathematical reasoning of large language models. arXiv preprint arXiv:2505.02735 (2025)

work page arXiv 2025

[53] [53]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Zheng, K., Han, J.M., Polu, S.: Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110 (2021) Appendix A The CoW Library: Additional Details Design rationale Words are represented asListαrather than a custom inductive type, so that all of Lean’s existing list lemmas and automation are available for free.Fa...

work page internal anchor Pith review Pith/arXiv arXiv 2021