LAMP: Lean-based Agentic framework with MCP and Proof Repair
Pith reviewed 2026-06-30 08:32 UTC · model grok-4.3
The pith
LAMP uses Planner-Builder-Verifier agents and a CoW ontology to generate kernel-verified Lean proofs for 96.7% of 90 test theorems
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LAMP coordinates a Planner, Builder, and Verifier with MCP-based access to a CoW ontology to synthesize kernel-verified Lean 4 proofs, reaching 96.7 percent success on 90 theorems across eight modules and three difficulty levels without fine-tuning the backbone model on CoW-specific data.
What carries the argument
The LAMP multi-agent framework that separates planning, building, and verification while giving agents Model Context Protocol access to an explicit CoW ontology at inference time.
If this is right
- Providing structured ontologies at inference time can replace fine-tuning for theorem-proving tasks in narrow domains.
- Separating Planner and Builder roles improves verified-proof yield by approximately twelve percentage points.
- The same architecture works across easy, medium, and hard theorems within the eight CoW modules.
Where Pith is reading between the lines
- Similar ontology-augmented agents could be tested on other Mathlib gaps once domain formalizations exist.
- The framework's reliance on proof repair steps may reveal recurring error patterns that could be addressed by additional agent roles.
- Success without fine-tuning suggests the method could lower the data cost of adapting provers to new subfields.
Load-bearing premise
The provided CoW ontology contains enough explicit structured knowledge for the agents to produce kernel-verified proofs without any fine-tuning on domain-specific data.
What would settle it
Measure LAMP success rate on a new set of theorems drawn from a different specialized mathematical domain that lacks a corresponding ontology.
Figures
read the original abstract
Large language models are increasingly capable of mathematical reasoning, but the proofs they generate are often unreliable and hard to verify. Interactive theorem provers such as Lean 4 address this by accepting only kernel-checked proofs; however, their reach is bounded by the formalized knowledge available. While Mathlib, a repository of formalized Lean 4 theorems that covers diverse mathematical areas, certain specialized areas remain underrepresented; notably, the domain of Combinatorics on Words (CoW). CoW studies sequences, exploring their properties such as periodicity, borders, conjugacy, and morphisms. As a result, specialized provers, trained on Mathlib-centered data, lack the lemmas to operate in CoW. We present two contributions. First, we introduce a Lean 4 formalization of CoW containing eight modules and \textbf{93} declarations of core definitions and foundational lemmas. Second, we present LAMP, a multi-agent framework that synthesizes kernel-verified Lean 4 proofs by providing explicit, structured domain knowledge at inference time through an ontology, rather than by fine-tuning a prover. LAMP coordinates a Planner, Builder, and Verifier with Model Context Protocol based access to a domain-specific CoW ontology. In a suite of 90 CoW theorems that span all eight modules and three difficulty levels, LAMP synthesizes verified proofs for 96.7% of theorems, substantially exceeding both an unscaffolded baseline and existing specialized provers. An ablation shows that removing LAMP's tool-grounded architecture or its Planner/Builder separation each cost roughly 12 percentage points, even with the backbone model held fixed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a Lean 4 formalization of Combinatorics on Words (CoW) with eight modules and 93 declarations of core definitions and foundational lemmas. It presents LAMP, a multi-agent framework (Planner-Builder-Verifier) that uses Model Context Protocol (MCP) to supply the CoW ontology at inference time for synthesizing kernel-verified proofs, reporting a 96.7% success rate on 90 theorems spanning all modules and three difficulty levels. This substantially exceeds an unscaffolded baseline and existing specialized provers, with ablations indicating that removing the tool-grounded architecture or Planner/Builder separation each reduces performance by roughly 12 percentage points even with the backbone model fixed.
Significance. If the central result holds, the work is significant for demonstrating that explicit structured domain knowledge can be provided via an ontology at inference time to achieve high rates of kernel-verified proof synthesis in an underrepresented mathematical domain without fine-tuning. The machine-checked proofs, the ablation experiments isolating architectural components, and the coverage across modules and difficulty levels are strengths that support the value of the agentic approach for extending interactive theorem provers.
major comments (3)
- [Abstract] Abstract: The central claim that the CoW ontology (8 modules, 93 declarations) supplies sufficient explicit structured domain knowledge for the Planner-Builder-Verifier agents to generate kernel-accepted proofs on arbitrary CoW statements is load-bearing for the 96.7% success rate, yet the abstract provides no concrete examples of the declarations, their structure, or how MCP exposes them to the agents.
- [Abstract] Abstract (results paragraph): The 90 theorems are described as spanning all eight modules and three difficulty levels, but no selection criteria, independence from ontology construction, or sampling procedure is stated; without this, it is impossible to rule out that the theorems were chosen precisely because they are directly provable from the supplied 93 lemmas, undermining the generalizability interpretation.
- [Ablation paragraph] Ablation paragraph: The reported ~12 percentage point drops when removing the tool-grounded architecture or Planner/Builder separation are presented with the backbone model held fixed, but the abstract gives no information on how the ablated baselines were implemented (e.g., prompt structure, tool access, or agent coordination), making it difficult to assess whether the performance gap is attributable to the claimed architectural features.
minor comments (2)
- [Abstract] The expansion of the MCP acronym is given only after its first use; spelling it out on first mention would improve readability.
- [Abstract] The abstract refers to 'existing specialized provers' without naming them or citing their reported success rates on CoW or comparable domains; adding these references would strengthen the comparison.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below. All points can be resolved through targeted revisions to the abstract and methods sections to improve clarity and transparency without altering the core claims or results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the CoW ontology (8 modules, 93 declarations) supplies sufficient explicit structured domain knowledge for the Planner-Builder-Verifier agents to generate kernel-accepted proofs on arbitrary CoW statements is load-bearing for the 96.7% success rate, yet the abstract provides no concrete examples of the declarations, their structure, or how MCP exposes them to the agents.
Authors: We agree that the abstract would benefit from greater concreteness on this point. In the revised manuscript we will add a short illustrative example of a CoW declaration (e.g., a periodicity lemma) together with a one-sentence description of its JSON-like structure and the MCP endpoint that supplies it to the agents at inference time. revision: yes
-
Referee: [Abstract] Abstract (results paragraph): The 90 theorems are described as spanning all eight modules and three difficulty levels, but no selection criteria, independence from ontology construction, or sampling procedure is stated; without this, it is impossible to rule out that the theorems were chosen precisely because they are directly provable from the supplied 93 lemmas, undermining the generalizability interpretation.
Authors: The 90 theorems were drawn from the eight modules with explicit stratification by difficulty and with the requirement that each theorem require at least one non-trivial inference step beyond direct lookup of the 93 declarations. We will insert a concise statement of these selection criteria and the independence check into the revised abstract and expand the description in Section 4. revision: yes
-
Referee: [Ablation paragraph] Ablation paragraph: The reported ~12 percentage point drops when removing the tool-grounded architecture or Planner/Builder separation are presented with the backbone model held fixed, but the abstract gives no information on how the ablated baselines were implemented (e.g., prompt structure, tool access, or agent coordination), making it difficult to assess whether the performance gap is attributable to the claimed architectural features.
Authors: We acknowledge the omission. The revised abstract will briefly characterize the two ablated conditions (single-agent prompt without MCP access, and Planner-Builder merged into one agent with identical tool access) while keeping the backbone model fixed; fuller implementation details already appear in Section 5 and will be cross-referenced. revision: yes
Circularity Check
No significant circularity in derivation chain.
full rationale
The paper introduces a new CoW formalization (8 modules, 93 declarations) and evaluates LAMP's agentic proof synthesis on 90 theorems spanning those modules, reporting 96.7% success against unscaffolded baselines and prior provers. No equations, fitted parameters, or self-referential definitions appear; the central performance claim is an empirical measurement on an external benchmark set in a newly formalized domain rather than a quantity derived from the ontology by construction. The ontology is supplied explicitly at inference time, and success is not redefined in terms of the input lemmas themselves.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math The Lean 4 kernel provides sound verification of submitted proofs
invented entities (1)
-
CoW ontology
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Anthropic: Model context protocol.https://modelcontextprotocol.io(2024), accessed 2026
2024
- [2]
-
[3]
Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E.W., Radev, D., Avigad, J.: Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433 (2023)
-
[4]
arXiv preprint arXiv:2506.19923 (2025)
Baba, K., Liu, C., Kurita, S., Sannai, A.: Prover agent: An agent-based framework for formal mathematical proofs. arXiv preprint arXiv:2506.19923 (2025)
-
[5]
Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics
Breen, B., Del Tredici, M., McCarran, J., Mijares, J.A., Yin, W.W., Sulimany, K., Taylor, J.M., Koppens, F.H., Englund, D.: Ax-prover: A deep reasoning agen- tic framework for theorem proving in mathematics and quantum physics. arXiv preprint arXiv:2510.12787 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Theoretical Computer Science412(27), 2931–2941 (2011)
Crochemore, M., Ilie, L., Tinta, L.: The “runs” conjecture. Theoretical Computer Science412(27), 2931–2941 (2011)
2011
-
[7]
Dressler, O.: Lean LSP MCP: Tools for agentic interaction with the Lean theorem prover (3 2025),https://github.com/oOo0oOo/lean-lsp-mcp
2025
-
[8]
Proceedings of the American Mathematical Society16(1), 109–114 (1965)
Fine, N.J., Wilf, H.S.: Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society16(1), 109–114 (1965)
1965
-
[9]
First, E., Rabe, M.N., Ringer, T., Brun, Y.: Baldur: Whole-proof generation and repairwithlargelanguagemodels.In:Proceedingsofthe31stACMJointEuropean Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 1229–1241 (2023)
2023
-
[10]
Fraenkel, A.S., Simpson, J.: How many squares can a string contain? J. Comb. Theory A82(1), 112–120 (1998)
1998
-
[11]
Han, J.M., Rute, J., Wu, Y., Ayers, E.W., Polu, S.: Proof artifact co-training for theorem proving with language models (2022),https://arxiv.org/abs/2102. 06203
2022
-
[12]
In: 12th International Conference on Interactive Theorem Proving (ITP 2021)
Holub, Š., Starosta, Š.: Formalization of basic combinatorics on words. In: 12th International Conference on Interactive Theorem Proving (ITP 2021). Leibniz In- ternational Proceedings in Informatics (LIPIcs), vol. 193, pp. 22:1–22:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021).https://doi.org/10.4230/ LIPIcs.ITP.2021.22
2021
-
[13]
In: Develop- ments in Language Theory (DLT 2021)
Holub, Š., Starosta, Š.: Lyndon words formalized in Isabelle/HOL. In: Develop- ments in Language Theory (DLT 2021). Lecture Notes in Computer Science, vol. 12811, pp. 217–228. Springer (2021)
2021
-
[14]
In: International Conference on Learning Representations
Hu, J., Zhu, T., Welleck, S.: minictx: Neural theorem proving with (long-) contexts. In: International Conference on Learning Representations. vol. 2025, pp. 10106– 10130 (2025)
2025
-
[15]
Nature pp
Hubert, T., Mehta, R., Sartran, L., Horváth, M.Z., Žužić, G., Wieser, E., Huang, A., Schrittwieser, J., Schroecker, Y., Masoom, H., et al.: Olympiad-level formal mathematical reasoning with reinforcement learning. Nature pp. 1–3 (2025)
2025
-
[16]
Jiang, A.Q., Welleck, S., Zhou, J.P., Li, W., Liu, J., Jamnik, M., Lacroix, T., Wu, Y., Lample, G.: Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283 (2022)
-
[17]
Ad- vances in neural information processing systems35, 26337–26349 (2022) LAMP: Lean-based Agentic framework with MCP and Proof Repair 19
Lample, G., Lacroix, T., Lachaux, M.A., Rodriguez, A., Hayat, A., Lavril, T., Ebner, G., Martinet, X.: Hypertree proof search for neural theorem proving. Ad- vances in neural information processing systems35, 26337–26349 (2022) LAMP: Lean-based Agentic framework with MCP and Proof Repair 19
2022
-
[18]
In: ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress (2016)
Leroy, X., Blazy, S., Kästner, D., Schommer, B., Pister, M., Ferdinand, C.: Compcert-a formally verified optimizing compiler. In: ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress (2016)
2016
-
[19]
arXiv preprint arXiv:2404.09939 (2024)
Li, Z., Sun, J., Murphy, L., Su, Q., Li, Z., Zhang, X., Yang, K., Si, X.: A survey on deep learning for theorem proving. arXiv preprint arXiv:2404.09939 (2024)
-
[20]
In: International Conference on Learning Representations
Lin, H., Sun, Z., Welleck, S., Yang, Y.: Lean-star: Learning to interleave thinking and proving. In: International Conference on Learning Representations. vol. 2025, pp. 66041–66062 (2025)
2025
- [21]
-
[22]
Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
Lin, Y., Tang, S., Lyu, B., Yang, Z., Chung, J.H., Zhao, H., Jiang, L., Geng, Y., Ge, J., Sun, J., et al.: Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction. arXiv preprint arXiv:2508.03613 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Cambridge Mathematical Library, Cambridge University Press, 2nd edn
Lothaire, M.: Combinatorics on Words. Cambridge Mathematical Library, Cambridge University Press, 2nd edn. (1997).https://doi.org/10.1017/ CBO9780511566097
1997
-
[24]
Cambridge University Press (2002)
Lothaire, M.: Algebraic Combinatorics on Words. Cambridge University Press (2002)
2002
-
[25]
In: Automated Deduction – CADE 28
de Moura, L., Ullrich, S.: The Lean 4 theorem prover and programming language. In: Automated Deduction – CADE 28. Lecture Notes in Computer Science, vol. 12699, pp. 625–635. Springer (2021)
2021
-
[26]
Springer (2002)
Nipkow, T., Paulson, L.C., Wenzel, M.: Isabelle/HOL: A Proof Assistant for Higher-Order Logic. Springer (2002)
2002
-
[27]
Discrete Applied Mathematics341, 349–358 (2023)
Patawar, M., Kapoor, K.: The square density of words having a sequence of fs- double squares. Discrete Applied Mathematics341, 349–358 (2023)
2023
-
[28]
In: International Conference on Learning Representations (ICLR) (2023), arXiv:2202.01344
Polu, S., Han, J.M., Zheng, K., Baksys, M., Babuschkin, I., Sutskever, I.: For- mal mathematics statement curriculum learning. In: International Conference on Learning Representations (ICLR) (2023), arXiv:2202.01344
-
[29]
Polu, S., Sutskever, I.: Generative language modeling for automated theorem prov- ing (2020),https://arxiv.org/abs/2009.03393
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[30]
Ren, Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., et al.: Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv preprint arXiv:2504.21801 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Advances in neural information processing systems36, 68539–68551 (2023)
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettle- moyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems36, 68539–68551 (2023)
2023
-
[32]
Australasian Journal of Combinatorics 92(3), 450–462 (2025),https://ajc.maths.uq.edu.au/pdf/92/ajc_v92_p450
Simpson, J.: Palindromic periodicities. Australasian Journal of Combinatorics 92(3), 450–462 (2025),https://ajc.maths.uq.edu.au/pdf/92/ajc_v92_p450. pdf
2025
-
[33]
arXiv preprint arXiv:2508.14644 (2025)
Song, C., Wang, Z., Pu, F., Wang, H., Lin, X., Liu, J., Li, J., Liu, Z.: Leangeo: Formalizing competitional geometry problems in lean. arXiv preprint arXiv:2508.14644 (2025)
-
[34]
arXiv preprint arXiv:2404.12534 (2024)
Song, P., Yang, K., Anandkumar, A.: Lean copilot: Large language models as copilots for theorem proving in lean. arXiv preprint arXiv:2404.12534 (2024)
- [35]
- [36]
-
[37]
Thakur,A.,Wen,Y.,Chaudhuri,S.:Alanguage-agentapproachtoformaltheorem- proving (2024),https://openreview.net/forum?id=XCMbagV0No
2024
-
[38]
The Coq Development Team: The Coq proof assistant.https://coq.inria.fr (2024), accessed: 2026-06-13
2024
-
[39]
The mathlib Community: The Lean mathematical library. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP). pp. 367–381. ACM (2020).https://doi.org/10.1145/3372885.3373824
- [40]
-
[41]
Advances in Neural Information Processing Systems 37, 11545–11569 (2024)
Tsoukalas, G., Lee, J., Jennings, J., Xin, J., Ding, M., Jennings, M., Thakur, A., Chaudhuri, S.: Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition. Advances in Neural Information Processing Systems 37, 11545–11569 (2024)
2024
-
[42]
Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Santos, M.D., Sung, F., et al.: Kimina-prover preview: Towards large formal reasoning models with reinforcement learning (2025),https://arxiv.org/abs/2504.11354
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
In: International Conference on Learning Representations
Wang, H., Xin, H., Zheng, C., Liu, Z., Cao, Q., Huang, Y., Xiong, J., Shi, H., Xie, E., Yin, J., et al.: Lego-prover: Neural theorem proving with growing libraries. In: International Conference on Learning Representations. vol. 2024, pp. 31566–31597 (2024)
2024
-
[44]
arXiv e-prints pp
Wang, R., Pan, R., Li, Y., Zhang, J., Jia, Y., Diao, S., Pi, R., Hu, J., Zhang, T.: Ma-lot: Multi-agent lean-based long chain-of-thought reasoning enhances formal theorem proving. arXiv e-prints pp. arXiv–2503 (2025)
2025
-
[45]
In: Advances in Neural Information Processing Systems (NeurIPS)
Wu, Y., Jiang, A.Q., Li, W., Rabe, M., Staats, C., Jamnik, M., Szegedy, C.: Aut- oformalization with large language models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 32353–32368 (2022)
2022
- [46]
- [47]
- [48]
-
[49]
Advances in Neural Information Processing Systems36, 21573– 21612 (2023)
Yang, K., Swope, A., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R.J., Anandkumar, A.: Leandojo: Theorem proving with retrieval-augmented lan- guage models. Advances in Neural Information Processing Systems36, 21573– 21612 (2023)
2023
-
[50]
In: International Conference on Learning Representations (ICLR) (2023)
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: International Conference on Learning Representations (ICLR) (2023)
2023
-
[51]
Advances in Neural Information Processing Systems37, 105848–105863 (2024) LAMP: Lean-based Agentic framework with MCP and Proof Repair 21
Ying, H., Wu, Z., Geng, Y., Wang, J., Lin, D., Chen, K.: Lean workbook: A large- scale lean problem set formalized from natural language math problems. Advances in Neural Information Processing Systems37, 105848–105863 (2024) LAMP: Lean-based Agentic framework with MCP and Proof Repair 21
2024
-
[52]
FormalMATH: Benchmarking formal mathematical reasoning of large language models, 2025
Yu, Z., Peng, R., Ding, K., Li, Y., Peng, Z., Liu, M., Zhang, Y., Yuan, Z., Xin, H., Huang, W., et al.: Formalmath: Benchmarking formal mathematical reasoning of large language models. arXiv preprint arXiv:2505.02735 (2025)
-
[53]
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
Zheng, K., Han, J.M., Polu, S.: Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110 (2021) Appendix A The CoW Library: Additional Details Design rationale Words are represented asListαrather than a custom inductive type, so that all of Lean’s existing list lemmas and automation are available for free.Fa...
work page internal anchor Pith review Pith/arXiv arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.