Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics
Pith reviewed 2026-05-25 07:44 UTC · model grok-4.3
The pith
Ax-Prover equips LLMs with Lean tools in a multi-agent setup to prove theorems across math and quantum physics where specialized systems fall short.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a multi-agent system combining LLMs for reasoning with Lean tools via MCP delivers a generalizable methodology for automated theorem proving, remaining competitive with state-of-the-art provers on public math benchmarks while largely outperforming them on newly introduced benchmarks in abstract algebra and quantum theory.
What carries the argument
The multi-agent framework that supplies LLMs with Lean verification tools through the Model Context Protocol to enforce formal correctness during proof generation.
If this is right
- The approach remains competitive with specialized provers on existing public math benchmarks.
- It produces higher success rates than those systems on the new abstract algebra and quantum theory benchmarks.
- It supports collaborative formalization of complex results such as a cryptography theorem with human experts.
- The tool-based agentic design supplies a single methodology usable across multiple scientific domains.
Where Pith is reading between the lines
- The same architecture could be tested on benchmarks from additional fields such as number theory or condensed-matter physics without domain-specific retraining.
- Success here raises the question of whether similar tool integration would improve performance in non-Lean formal systems or in proof search without language models.
- If the pattern holds, research groups might shift from maintaining separate provers per subfield toward shared agentic tool layers.
Load-bearing premise
That outperformance on the two new benchmarks demonstrates genuine cross-domain generalization rather than effects from how those benchmarks were constructed or tuned to the system.
What would settle it
An evaluation on a third independently designed benchmark in a different domain, with no changes to Ax-Prover, that shows whether the performance edge disappears.
Figures
read the original abstract
We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover's assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Ax-Prover, a multi-agent system that integrates LLMs with Lean theorem-proving tools via the Model Context Protocol (MCP) to generate formal proofs. It evaluates the system as an autonomous prover on two existing public math benchmarks (where it is reported competitive with frontier LLMs and specialized provers) and on two newly introduced Lean benchmarks in abstract algebra and quantum theory (where it is reported to largely outperform specialized models). The central claim is that this tool-based agentic architecture provides a generalizable methodology for formal verification across scientific domains, unlike specialized systems. The paper also includes a collaborative demonstration in which Ax-Prover assists an expert in formalizing a complex cryptography theorem.
Significance. If the reported performance deltas on the new benchmarks are shown to be robust and free of construction bias, the work would offer a concrete demonstration that agentic, tool-augmented LLMs can extend formal theorem proving into domains such as abstract algebra and quantum theory where specialized provers have limited coverage. The explicit use of MCP for Lean tool access and the human-in-the-loop cryptography example are concrete strengths that could be built upon by the formal-methods and AI-for-science communities.
major comments (3)
- [Abstract] Abstract: the central claim that Ax-Prover 'largely outperforms' specialized models on the two new Lean benchmarks in abstract algebra and quantum theory is stated without any numerical results, baseline scores, error bars, problem counts, or description of how the benchmarks were constructed or filtered. This information is load-bearing for the generalization argument and its absence prevents evaluation of whether the reported gains reflect architectural advantages or benchmark-specific properties.
- [Benchmark sections (inferred from abstract)] The description of the new benchmarks (abstract algebra and quantum theory) provides no details on problem selection criteria, difficulty calibration, coverage of Lean tactics, or steps taken to prevent data leakage from training corpora. Without these, the outperformance cannot be taken as evidence that the multi-agent tool-use design generalizes rather than exploiting construction choices that reward iterative agent behavior.
- [Evaluation and conclusion sections (inferred from abstract)] The generalization claim ('unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology') rests entirely on the empirical comparison to the two new benchmarks; no ablation isolating the contribution of the multi-agent MCP architecture versus single-pass LLM prompting is reported, leaving the causal link between architecture and cross-domain performance untested.
minor comments (2)
- [Abstract] The abstract mentions 'two public math benchmarks' without naming them; explicit citation of the datasets (e.g., miniF2F, ProofNet) would improve reproducibility.
- [Use-case section (inferred from abstract)] The cryptography use-case is presented as a practical demonstration but lacks any quantitative measure of human effort saved or proof length/complexity, which would strengthen the assistant-capability claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to strengthen the presentation of the benchmarks and evaluation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that Ax-Prover 'largely outperforms' specialized models on the two new Lean benchmarks in abstract algebra and quantum theory is stated without any numerical results, baseline scores, error bars, problem counts, or description of how the benchmarks were constructed or filtered. This information is load-bearing for the generalization argument and its absence prevents evaluation of whether the reported gains reflect architectural advantages or benchmark-specific properties.
Authors: We agree that the abstract would benefit from greater specificity to support the central claim. In the revised manuscript we have updated the abstract to report problem counts for the new benchmarks, key performance numbers with baselines, and a concise description of construction and filtering. Full tables with error bars appear in the evaluation section. revision: yes
-
Referee: [Benchmark sections (inferred from abstract)] The description of the new benchmarks (abstract algebra and quantum theory) provides no details on problem selection criteria, difficulty calibration, coverage of Lean tactics, or steps taken to prevent data leakage from training corpora. Without these, the outperformance cannot be taken as evidence that the multi-agent tool-use design generalizes rather than exploiting construction choices that reward iterative agent behavior.
Authors: We acknowledge the need for these details. The revised manuscript adds a new subsection that specifies problem selection criteria (recent publications and textbooks), difficulty calibration via expert review and tactic complexity, Lean tactic coverage, and leakage-prevention steps including post-training-cutoff sourcing and manual originality checks. revision: yes
-
Referee: [Evaluation and conclusion sections (inferred from abstract)] The generalization claim ('unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology') rests entirely on the empirical comparison to the two new benchmarks; no ablation isolating the contribution of the multi-agent MCP architecture versus single-pass LLM prompting is reported, leaving the causal link between architecture and cross-domain performance untested.
Authors: We agree that a direct ablation would strengthen the causal argument. The original experiments did not include a single-pass versus multi-agent ablation. In the revision we have added an explicit discussion of this limitation in the evaluation section, clarified that the specialized baselines are architecturally distinct from LLM-based agents, and noted that future work will include such an ablation. The current evidence still supports generalization via the observed cross-domain performance gap. revision: partial
Circularity Check
No significant circularity; empirical evaluation only
full rationale
The paper presents an empirical multi-agent system evaluated on public benchmarks plus two author-introduced Lean benchmarks in abstract algebra and quantum theory. No mathematical derivation chain, fitted parameters, or first-principles results are described. The generalization claim rests on reported outperformance comparisons rather than any self-referential construction, self-citation load-bearing premise, or ansatz. This matches the default expectation for non-circular empirical work; the central claim has independent content in the form of external benchmark results.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 5 Pith papers
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
NeuroClaw Technical Report
NeuroClaw introduces a three-tier multi-agent framework and NeuroBench benchmark that improve executability and reproducibility scores for neuroimaging tasks when used with multimodal LLMs.
-
The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data
The topological dual of a dataset is introduced as a transformation that encodes logical structures into topological ones to expose invariants in neural latent spaces for AlphaGeometry-style reasoning.
-
Automated Conjecture Resolution with Formal Verification
An AI framework combining informal reasoning and formal verification resolves an open commutative algebra problem and produces a Lean 4-checked proof with minimal human input.
-
LLMs with in-context learning for Algorithmic Theoretical Physics
Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
Reference graph
Works this paper leans on
- [1]
-
[2]
Aristotle: IMO-level Automated Theorem Proving
Tudor Achim, Alex Best, Kevin Der, Math¨ıs F´ed´erico, Sergei Gukov, Daniel Halpern-Leister, Kirsten Hennings- gard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, et al. Aristotle: Imo-level automated theorem proving.arXiv preprint arXiv:2510.01346, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Claude 3.7 sonnet and claude code.https://www.anthropic.com/news/ claude-3-7-sonnet, 2025
Anthropic. Claude 3.7 sonnet and claude code.https://www.anthropic.com/news/ claude-3-7-sonnet, 2025. Accessed: YYYY-MM-DD
work page 2025
-
[4]
Claude 4.https://www.anthropic.com/news/claude-4, 2025
Anthropic. Claude 4.https://www.anthropic.com/news/claude-4, 2025. Accessed: 2025-09-16
work page 2025
-
[5]
Anthropic. Claude documentation, 2025. Accessed: 2025-09-19
work page 2025
-
[6]
Claude opus 4.1.https://www.anthropic.com/claude/opus, 2025
Anthropic. Claude opus 4.1.https://www.anthropic.com/claude/opus, 2025. Accessed: 2025-10- 14
work page 2025
-
[7]
Introducing claude haiku 4.5, October 2025
Anthropic PBC. Introducing claude haiku 4.5, October 2025. Accessed on November 14, 2025
work page 2025
-
[8]
Leanlm: Large language models for lean theorem proving.arXiv preprint arXiv:2306.09264, 2023
Ethan Ayers et al. Leanlm: Large language models for lean theorem proving.arXiv preprint arXiv:2306.09264, 2023
-
[9]
Formal proving with llms: Lean as a benchmark
Zhenisbek Azerbayev et al. Formal proving with llms: Lean as a benchmark. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 23
work page 2023
-
[10]
Kaito Baba, Chaoran Liu, Shuhei Kurita, and Akiyoshi Sannai. Prover agent: An agent-based framework for formal mathematical proofs.arXiv preprint arXiv:2506.19923, 2025
-
[11]
Satisfiability modulo theories: An appe- tizer.Communications of the ACM, 65(6):69–77, 2022
Haniel Barbosa, Clark Barrett, Pascal Fontaine, and Andrew Reynolds. Satisfiability modulo theories: An appe- tizer.Communications of the ACM, 65(6):69–77, 2022
work page 2022
-
[12]
Computer-aided security proofs for the working cryptographer
Gilles Barthe, Benjamin Gr ´egoire, Sylvain Heraud, and Santiago Zanella B ´eguelin. Computer-aided security proofs for the working cryptographer. InAnnual Cryptology Conference, pages 71–90. Springer, 2011
work page 2011
-
[13]
Tactician: Lean proof automation with knn
Bart Blaauwbroek et al. Tactician: Lean proof automation with knn. InProceedings of the International Confer- ence on Interactive Theorem Proving (ITP), pages 348–366. Springer, 2020
work page 2020
-
[14]
Bruno Blanchet. A computationally sound mechanized prover for security protocols.IEEE Transactions on Dependable and Secure Computing, 5(4):193–207, 2008
work page 2008
-
[15]
Julia Borghoff, Anne Canteaut, Tim G ¨uneysu, Elif Bilge Kavun, Miroslav Kne ˇzevi´c, Lars R. Knudsen, Gregor Leander, Ventzislav Nikov, Christof Paar, Christian Rechberger, Peter Rombouts, Søren S. Thomsen, and Tolga Yalc ¸ın. Prince – a low-latency block cipher for pervasive computing applications. In Xiaoyun Wang and Kazue Sako, editors,Advances in Cry...
work page 2012
-
[16]
Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, et al. Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv preprint arXiv:2507.23726, 2025
-
[17]
Evaluating Large Language Models Trained on Code
Mark Chen et al. Evaluating large language models trained on code. InarXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Quantum algorithms for lattice problems
Yilei Chen. Quantum algorithms for lattice problems. Technical report, Cryptology ePrint Archive, Report 2024/555, 2024. Updated April 18: algorithm contains an unfixable bug invalidating the main claim (see Section 3.5.9, Page 37)
work page 2024
-
[19]
Yuri Chervonyi, Trieu H Trinh, Miroslav Ol ˇs´ak, Xiaomeng Yang, Hoang Nguyen, Marcelo Menegali, Junehyuk Jung, Vikas Verma, Quoc V Le, and Thang Luong. Gold-medalist performance in solving olympiad geometry with alphageometry2.arXiv preprint arXiv:2502.03544, 2025
-
[20]
Llms as conversational partners for mathematicians.arXiv preprint arXiv:2305.XXXX, 2023
Emily Collins et al. Llms as conversational partners for mathematicians.arXiv preprint arXiv:2305.XXXX, 2023
work page 2023
-
[21]
Leonardo de Moura and Nikolaj Bjørner. Z3: An efficient smt solver. InProceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pages 337–340. Springer, 2008
work page 2008
-
[22]
The lean theorem prover (system description)
Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. The lean theorem prover (system description). InAutomated Deduction – CADE-25, volume 9195 ofLecture Notes in Computer Science, pages 378–388. Springer, 2015
work page 2015
-
[23]
Deepseek-prover-v1 dataset.https://huggingface.co/datasets/deepseek-ai/ DeepSeek-Prover-V1, 2024
DeepSeek-AI. Deepseek-prover-v1 dataset.https://huggingface.co/datasets/deepseek-ai/ DeepSeek-Prover-V1, 2024. Accessed: 2025-08-24
work page 2024
-
[24]
deepseek ai. Deepseek-prover-v2-671b, 2025. Accessed: 2025-09-24
work page 2025
-
[25]
Lean-lsp-mcp: Tools for agentic interaction with the lean theorem prover, 3 2025
Oliver Dressler. Lean-lsp-mcp: Tools for agentic interaction with the lean theorem prover, 3 2025. Accessed: 2025-08-24
work page 2025
-
[26]
David S Dummit and Richard M Foote. Abstract algebra. john wile & sons.Inc., Hoboken, NJ, 2004
work page 2004
-
[27]
Praveen Gauravaram, Lars Knudsen, Krystian Matusiewicz, Florian Mendel, Christian Rechberger, Martin Schl¨affer, and Søren S. Thomsen. Grøstl – a sha-3 candidate. Submission to nist, NIST, September 2008. Available athttp://www.groestl.info/
work page 2008
-
[28]
Tactictoe: Learning to prove with tactics
Th ´er`ese Gauthier, Cezary Kaliszyk, and Josef Urban. Tactictoe: Learning to prove with tactics. InProceedings of the International Conference on Automated Deduction (CADE), pages 275–294. Springer, 2021. 24
work page 2021
-
[29]
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Learning to prove theorems via interacting with proof assistants
Daniel Huang et al. Learning to prove theorems via interacting with proof assistants. InInternational Conference on Machine Learning (ICML), pages 2654–2663, 2019
work page 2019
-
[31]
Geoffrey Irving, Christian Szegedy, Alexander A. Alemi, et al. Deepmath - deep sequence models for premise selection. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2235–2243, 2016
work page 2016
-
[32]
Another look at” provable security”.Journal of Cryptology, 20(1):3–37, 2007
Neal Koblitz and Alfred J Menezes. Another look at” provable security”.Journal of Cryptology, 20(1):3–37, 2007
work page 2007
-
[33]
First-order theorem proving and vampire
Laura Kov ´acs and Andrei V oronkov. First-order theorem proving and vampire. InProceedings of the Interna- tional Conference on Computer Aided Verification (CAV), pages 1–35. Springer, 2013
work page 2013
-
[34]
Leanagent: Lifelong learning for formal theorem proving.arXiv preprint arXiv:2410.06209, 2024
Adarsh Kumarappan, Mo Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, and Anima Anandkumar. Leanagent: Lifelong learning for formal theorem proving.arXiv preprint arXiv:2410.06209, 2024
-
[35]
Deep reinforcement learning for theorem proving
Guillaume Lample and Franc ¸ois Charton. Deep reinforcement learning for theorem proving. InInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[36]
Mathlib statistics.https://leanprover-community.github.io/ mathlib_stats.html, 2025
Lean Prover Community. Mathlib statistics.https://leanprover-community.github.io/ mathlib_stats.html, 2025. GitHub repository for generating statistics plots for Mathlib; accessed 2025-08-24
work page 2025
-
[37]
Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath.[https://huggingface.co/AI-MO/NuminaMath-1.5](https: //github.com/project-numina/aimo-progress-prize/blob/main/report/num...
work page 2024
-
[38]
Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, et al. Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025
-
[39]
Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Unconditional Security Of Quantum Key Distribution Over Arbitrarily Long Distances
Hoi-Kwong Lo and HF Chau. Unconditional security of quantum key distribution over arbitrarily long distances. arXiv preprint quant-ph/9803006, 1998
work page internal anchor Pith review Pith/arXiv arXiv 1998
-
[41]
Hoi-Kwong Lo and Hoi Fung Chau. Unconditional security of quantum key distribution over arbitrarily long distances.science, 283(5410):2050–2056, 1999
work page 2050
-
[42]
Process-driven autoformalization in lean 4
Zeyuan Lu, Guodong Zhang, Junxiao Chen, Huan Chen, Yilun Chen, Zonglin Li, Yiping Li, Lianmin Wang, Yao Lin, Ce Zhang, and Jie Chen. Process-driven autoformalization in lean 4. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[43]
Introducing gauss, an agent for autoformalization, 2025
Math Inc. Introducing gauss, an agent for autoformalization, 2025. Announcement of autoformalization agent for formal verification in mathematics
work page 2025
-
[44]
Alex Meiburg, Leonardo A Lessa, and Rodolfo R Soldati. A formalization of the generalized quantum stein’s lemma in lean.arXiv preprint arXiv:2510.08672, 2025
- [45]
-
[46]
What is the model context protocol (mcp)?https://modelcontextprotocol
Model Context Protocol. What is the model context protocol (mcp)?https://modelcontextprotocol. io/docs/getting-started/intro, 2024. Accessed: 2025-10-05
work page 2024
-
[47]
Robust de-anonymization of large sparse datasets
Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In2008 IEEE Symposium on Security and Privacy (sp 2008), pages 111–125. IEEE, 2008
work page 2008
-
[48]
Advanced encryption standard (aes)
National Institute of Standards and Technology. Advanced encryption standard (aes). Federal Information Processing Standards Publication FIPS 197-upd1, U.S. Department of Commerce, Gaithersburg, MD, 2001. Published November 26, 2001; Updated May 9, 2023
work page 2001
-
[49]
Cambridge university press, 2010
Michael A Nielsen and Isaac L Chuang.Quantum computation and quantum information. Cambridge university press, 2010
work page 2010
-
[50]
Numinamath-lean dataset.https://huggingface.co/datasets/AI-MO/ NuminaMath-LEAN, 2025
Numina-Team. Numinamath-lean dataset.https://huggingface.co/datasets/AI-MO/ NuminaMath-LEAN, 2025. Accessed: 2025-08-24
work page 2025
-
[51]
Openai models documentation, 2025
OpenAI. Openai models documentation, 2025. Accessed: 2025-09-19
work page 2025
-
[52]
Azim Ospanov, Farzan Farnia, and Roozbeh Yousefzadeh. Apollo: Automated llm and lean collaboration for advanced formal reasoning.arXiv preprint arXiv:2505.05758, 2025
-
[53]
Formal mathematics statement curriculum learning
Stanislas Polu et al. Formal mathematics statement curriculum learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[54]
Generative language modeling for automated theorem proving
Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. InInterna- tional Conference on Learning Representations (ICLR), 2020
work page 2020
-
[55]
Z.Z. Ren, Zhihong Shao, Wenfeng Liang, et al. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.https://arxiv.org/abs/2504.21801, 2025. Accessed: 2025-08-24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
S ´ebastien Rousseau. Bug discovered in quantum algorithm for lattice-based crypto.https:// sebastienrousseau.com/2024-04-22-bug-discovered-in-quantum-algorithm-for-lattice-based-crypto/ index.html, April 22 2024. Accessed: [add access date here]
work page 2024
-
[57]
E prover 2.0: Integrating equational and first-order logic
Stephan Schulz, Simon Cruanes, and Petar Vukmirovi´c. E prover 2.0: Integrating equational and first-order logic. InProceedings of the International Conference on Automated Deduction (CADE), pages 523–541. Springer, 2019
work page 2019
-
[58]
Peter W Shor and John Preskill. Simple proof of security of the bb84 quantum key distribution protocol.Physical review letters, 85(2):441, 2000
work page 2000
-
[59]
Sequences of games: a tool for taming complexity in security proofs.cryptology eprint archive, 2004
Victor Shoup. Sequences of games: a tool for taming complexity in security proofs.cryptology eprint archive, 2004
work page 2004
-
[60]
Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Lean copilot: Large language models as copilots for theorem proving in lean.arXiv preprint arXiv:2404.12534, 2024
-
[61]
Latanya Sweeney. k-anonymity: A model for protecting privacy.International journal of uncertainty, fuzziness and knowledge-based systems, 10(05):557–570, 2002
work page 2002
-
[62]
Putnambench leaderboard.https://trishullab.github.io/PutnamBench/ leaderboard.html, 2025
Trishullab. Putnambench leaderboard.https://trishullab.github.io/PutnamBench/ leaderboard.html, 2025. Accessed: 2025-10-07
work page 2025
-
[63]
George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competi- tion.Advances in Neural Information Processing Systems, 37:11545–11569, 2024
work page 2024
-
[64]
Machine learning preselected proof steps
Josef Urban, Geoff Sutcliffe, Stefan Petrov, and Josef Vysko ˇcil. Machine learning preselected proof steps. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 2046–2051, 2011. 26
work page 2046
-
[65]
Sumanth Varambally, Thomas V oice, Yanchao Sun, Zhifeng Chen, Rose Yu, and Ke Ye. Hilbert: Recursively building formal proofs with informal reasoning.arXiv preprint arXiv:2509.22819, 2025
-
[66]
Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning
Haiming Wang et al. Kimina-prover preview: Towards large formal reasoning models with reinforcement learn- ing.https://arxiv.org/abs/2504.11354, 2025. Accessed: 2025-08-24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Qihao Wu, Haotian Zhang, Jialin Chen, Yizhou Li, Xingjian Zhang, Ce Zhang, and Jie Chen. Autoformalization in the era of large language models: A survey.arXiv preprint arXiv:2505.23486, 2025
-
[68]
Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.https: //arxiv.org/abs/2405.14333, 2024. Accessed: 2025-08-24
-
[69]
Huajian Xin, ZZ Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, et al. Deepseek-prover-v1. 5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search.arXiv preprint arXiv:2408.08152, 2024
-
[70]
Leandojo: Theorem proving with large language models
Zhangir Xin et al. Leandojo: Theorem proving with large language models. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[71]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Huaiyuan Ying, Zijian Wu, Yihan Geng, Jiayu Wang, Dahua Lin, and Kai Chen. Lean workbook: A large-scale lean problem set formalized from natural language math problems.https://arxiv.org/abs/2406. 03847, 2024. Accessed: 2025-08-24
work page 2024
-
[73]
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: A cross-system benchmark for formal olympiad- level mathematics.https://arxiv.org/abs/2109.00110, 2021. Accessed: 2025-08-24. 27
work page internal anchor Pith review Pith/arXiv arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.