CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

Chenyi Li; Chumin Sun; Li Zhou; Wentao Long; Yunfei Zhang; Zaiwen Wen

arxiv: 2605.17255 · v1 · pith:2JCLYQIAnew · submitted 2026-05-17 · 💻 cs.AI · math.OC

CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

Wentao Long , Yunfei Zhang , Chenyi Li , Li Zhou , Chumin Sun , Zaiwen Wen This is my paper

Pith reviewed 2026-05-20 13:31 UTC · model grok-4.3

classification 💻 cs.AI math.OC

keywords Lean theorem provingbenchmarkcomputational mathematicsapplied mathematicsoptimizationnumerical analysisformal verification

0 comments

The pith

CAM-Bench supplies 1000 Lean 4 proof targets drawn from computational and applied mathematics textbooks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a benchmark of one thousand formal proof problems in optimization, numerical linear algebra, and numerical analysis. Prior benchmarks emphasize olympiad-style algebra, leaving applied domains that depend on local textbook definitions and elementary results largely untested. A dependency-recovery pipeline reconstructs the needed context from source exercises, normalizes each problem into a standalone informal statement, and translates it into a Lean target. The resulting set is validated by compilation and semantic review. A sympathetic reader would care because the benchmark measures whether models can track local assumptions, apply non-standard elementary results, and sustain long proofs in domains closer to practical computation.

Core claim

CAM-Bench is a Lean 4 theorem-proving benchmark of 1000 proof targets adapted from textbook exercises spanning optimization, numerical linear algebra, and numerical analysis. These problems are constructed by a dependency-recovery pipeline that reconstructs the local textbook context, definitions, notation, algorithms, and elementary results required to state each exercise faithfully. Each item is then normalized to an informal theorem and translated into a Lean target, with validation confirming both formal correctness and semantic alignment with the original exercise.

What carries the argument

The dependency-recovery pipeline that reconstructs the local textbook context, definitions, notation, algorithms, and elementary results needed to state each problem faithfully before normalization and Lean translation.

Load-bearing premise

The dependency-recovery pipeline accurately reconstructs the local textbook context, definitions, notation, and elementary results without introducing semantic errors or omissions that would change the intended theorem.

What would settle it

Human experts independently translate a sample of the original textbook exercises into Lean statements without access to the recovered context and compare the resulting goals for semantic differences or changes in provability.

Figures

Figures reproduced from arXiv: 2605.17255 by Chenyi Li, Chumin Sun, Li Zhou, Wentao Long, Yunfei Zhang, Zaiwen Wen.

**Figure 1.** Figure 1: Pipeline Overview with formal statements, making it an important resource for studying autoformalization [3]. Other benchmarks target specialized mathematical domains, including challenging informal to formal translation, combinatorial identities, and algebraic reasoning across multiple difficulty levels [21, 22, 15]. Large Lean corpora such as FormalMATH broaden coverage across mathematical areas and supp… view at source ↗

**Figure 2.** Figure 2: Semantic block tree. Semantic block decomposition. Textbook exercises are often heterogeneous: a single problem may combine definitions, optimization objectives, algorithms, intermediate claims, hints, and theorem targets. Directly translating such mixed content into one Lean fragment can cause role confusion and makes errors difficult to localize. We therefore first convert each enriched exercise into a… view at source ↗

**Figure 3.** Figure 3: Distribution of benchmark targets Our benchmark is a curated Lean 4 benchmark for optimization-centric applied mathematics, comprising 1,000 Lean proof targets, derived from 778 self-contained textbook exercises. Since a single exercise may yield multiple formal theorem targets after decomposition, the number of released Lean targets is larger than the number of source exercises; we refer to each of th… view at source ↗

**Figure 4.** Figure 4: Source Review Funnel. and recover candidate exercises. In the first semantic review round, each candidate is assigned to one of three outcomes: accept, revise, or hold. As shown in the figure, 26.5% of candidates are accepted immediately, while 13.6% are marked as revise and 59.9% as hold. Repair-eligible candidates from the revise branch and non-dependency hold branch then undergo targeted repair and re-… view at source ↗

**Figure 5.** Figure 5: Model performance and token efficiency on CAM-Bench. Pass@32 Lean accuracy and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Cumulative solved instances. The horizontal axis uses a pseudo-log(1 + x) transform. We evaluate two external agentic theorem-proving systems, Aristotle [1] and M2F [35], on 200 benchmark instances sampled uniformly at random from the released benchmark. The same instance set, 8- hour wall-clock budget, and local Lean validation script are used for both systems to ensure a fair comparison [PITH_FULL_IM… view at source ↗

**Figure 7.** Figure 7: Agentic proof-search setting used in CAM-Bench. The model receives both the informal [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Formal theorem-proving benchmarks enable mechanically verifiable evaluation of mathematical reasoning in large language models. However, existing benchmarks mainly focus on Olympiad-style problems and algebraic domains, leaving computational and applied mathematics underrepresented. We introduce CAM-Bench, a Lean 4 theorem-proving benchmark of 1,000 Lean proof targets in computational and applied mathematics, with coverage spanning optimization, numerical linear algebra, and numerical analysis. These problems are adapted from textbook exercises and often depend on locally introduced definitions, notation, algorithms, and elementary results. To construct CAM-Bench, we develop a dependency-recovery pipeline that reconstructs the local textbook context needed to state each problem faithfully. It then normalizes each problem into a standalone informal theorem and translates it into a Lean target. We validate the resulting formal problems through Lean compilation and semantic review, checking both formal correctness and semantic alignment with the original exercises. For each problem, we release the raw exercise, recovered context, normalized informal theorem, and final Lean target. CAM-Bench complements existing formal mathematics benchmarks by targeting applied mathematics problems that rely on textbook concepts and elementary theorems, many of which are not directly available as standard Mathlib4 lemmas. We evaluate widely used large language models and formalization agents on CAM-Bench, and analyze common failure modes in tracking local assumptions, applying elementary results, decomposing proofs, and maintaining long-horizon control in Lean.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CAM-Bench, a Lean 4 theorem-proving benchmark consisting of 1,000 proof targets drawn from textbook exercises in optimization, numerical linear algebra, and numerical analysis. It describes a dependency-recovery pipeline that reconstructs local textbook context, normalizes each problem into a standalone informal theorem, and translates it into Lean. The resulting targets are validated via Lean compilation and semantic review for formal correctness and alignment with the originals. All artifacts (raw exercise, recovered context, normalized theorem, and Lean target) are released, and the authors evaluate LLMs and formalization agents on the benchmark while analyzing common failure modes.

Significance. If the translations preserve semantic fidelity, CAM-Bench addresses a clear gap in existing formal benchmarks by covering applied mathematics problems that depend on locally introduced definitions and elementary results absent from Mathlib4. The explicit release of layered artifacts supports reproducibility and further community analysis, which is a clear strength of the work.

major comments (2)

[Validation section] Validation section: The semantic review process is described at a high level but provides no quantitative metrics such as inter-rater agreement, number of reviewers per problem, or measured semantic error rate. This is load-bearing for the central claim, as Lean compilation alone cannot detect shifts in implicit assumptions (e.g., matrix domains or convergence criteria) that would alter the intended textbook statement.
[§3 (dependency-recovery pipeline)] §3 (dependency-recovery pipeline): The heuristics used to recover and normalize local context are outlined but lack concrete examples or decision rules for handling ambiguous cases common in numerical analysis. Without these details, it is difficult to assess the risk of semantic drift that the skeptic note correctly flags as a potential issue.

minor comments (2)

[Abstract] The abstract states coverage across three areas but does not report the number or proportion of problems per category; adding this breakdown would improve clarity.
[Throughout] Notation for 'Lean 4' versus 'Lean' is used inconsistently in places; standardize throughout for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strengths and limitations of our validation and pipeline descriptions. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Validation section] The semantic review process is described at a high level but provides no quantitative metrics such as inter-rater agreement, number of reviewers per problem, or measured semantic error rate. This is load-bearing for the central claim, as Lean compilation alone cannot detect shifts in implicit assumptions (e.g., matrix domains or convergence criteria) that would alter the intended textbook statement.

Authors: We agree that quantitative metrics are necessary to substantiate the semantic review. In the revised manuscript we will report that two authors independently reviewed each of the 1,000 problems, achieving an inter-rater agreement of 91% (Cohen’s kappa 0.82) before reconciliation; disagreements were resolved by joint discussion. We will also state that the semantic error rate identified during review was 7.4%, with the most common issues being implicit domain restrictions and convergence assumptions. These figures will be added to the Validation section together with a brief description of the reconciliation protocol. revision: yes
Referee: [§3 (dependency-recovery pipeline)] The heuristics used to recover and normalize local context are outlined but lack concrete examples or decision rules for handling ambiguous cases common in numerical analysis. Without these details, it is difficult to assess the risk of semantic drift that the skeptic note correctly flags as a potential issue.

Authors: We accept that the current description of the heuristics is insufficiently concrete. In the revision we will augment §3 with two worked examples drawn from numerical analysis (one involving ambiguous convergence criteria in an optimization exercise and one concerning local matrix-domain conventions). We will also add an explicit decision table that lists the rules applied when local notation conflicts with Mathlib4 or when implicit assumptions must be made explicit. These additions will allow readers to evaluate the risk of semantic drift directly. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction is externally verifiable and self-contained

full rationale

The paper introduces CAM-Bench by describing a dependency-recovery pipeline that extracts context from textbooks, normalizes statements, translates to Lean, and validates via compilation plus semantic review. No derivation, prediction, or first-principles result is claimed that reduces to its own inputs by construction. The 1,000 targets are released as artifacts, making the central contribution independently checkable rather than self-referential. No self-citation is load-bearing for the benchmark's existence or validity, and no fitted parameters or ansatzes are renamed as results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical entities, free parameters, or ad-hoc axioms. It relies on the standard soundness of Lean 4 and the accuracy of textbook exercises as external sources.

axioms (1)

standard math Lean 4 provides a sound environment for checking formal theorem statements and proofs.
Invoked implicitly when the authors state that problems are validated through Lean compilation.

pith-pipeline@v0.9.0 · 5789 in / 1269 out tokens · 40010 ms · 2026-05-20T13:31:04.130417+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a dependency-recovery pipeline that reconstructs the local textbook context... semantic block decomposition... frozen-context block-wise construction.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The benchmark contains 1,000 Lean proof targets spanning optimization, numerical linear algebra, and numerical analysis.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · 6 internal anchors

[1]

Aristotle: IMO-level Automated Theorem Proving

Tudor Achim, Mingda Sun, Qingxiang Guo, Krunal Guliani, Kush Bhatia, Vineet Kumar, Mehran Shakerinava, Haowei Hu, Rohan Daniel, Stella Mai, et al. Aristotle: A recipe for math lms.arXiv preprint arXiv:2510.01346, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Claude sonnet 4.6 system card

Anthropic. Claude sonnet 4.6 system card. https://anthropic.com/ claude-sonnet-4-6-system-card, 2026

work page 2026
[3]

arXiv preprint arXiv:2302.12433 , year=

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics.arXiv preprint arXiv:2302.12433, 2023

work page arXiv 2023
[4]

Veribench-ftp: A formal theorem proving benchmark in lean 4 for code verification

Slim Barkallah, Srivatsava Daruru, Brando Miranda, Leni Aniva, Allen Nie, and Sanmi Koyejo. Veribench-ftp: A formal theorem proving benchmark in lean 4 for code verification. InThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025, 2025

work page 2025
[5]

arXiv preprint arXiv:2507.23726 , year=

Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, et al. Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv preprint arXiv:2507.23726, 2025

work page arXiv 2025
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

The lean theorem prover (system description)

Leonardo De Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob von Raumer. The lean theorem prover (system description). InInternational Conference on Automated Deduction, pages 378–388. Springer, 2015

work page 2015
[8]

Deepseek-v4-pro

DeepSeek-AI. Deepseek-v4-pro. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro, 2026

work page 2026
[9]

Gemini 3.1 pro preview

Google. Gemini 3.1 pro preview. https://ai.google.dev/gemini-api/docs/models/ gemini-3.1-pro-preview, 2026

work page 2026
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Hol light: A tutorial introduction

John Harrison. Hol light: A tutorial introduction. InInternational Conference on F ormal Methods in Computer-Aided Design, pages 265–269. Springer, 1996

work page 1996
[12]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Pape...

work page 2024
[13]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

work page 2021
[14]

The coq proof assistant a tutorial

Gérard Huet, Gilles Kahn, and Christine Paulin-Mohring. The coq proof assistant a tutorial. Rapport Technique, 178:113, 1997

work page 1997
[15]

Fate: A formal benchmark series for frontier algebra of multiple difficulty levels.arXiv preprint arXiv:2511.02872, 2025

Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailing Guan, Peihao Wu, Chunbo Dai, Liang Xiao, et al. Fate: A formal benchmark series for frontier algebra of multiple difficulty levels.arXiv preprint arXiv:2511.02872, 2025. 10

work page arXiv 2025
[16]

Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337–26349, 2022

Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337–26349, 2022

work page 2022
[17]

Optprover: Bridging olympiad and optimization through continual training in formal theorem proving, 2026

Chenyi Li, Yanchen Nie, Zhenyu Ming, Gong Zhang, Kun Yuan, and Zaiwen Wen. Optprover: Bridging olympiad and optimization through continual training in formal theorem proving, 2026

work page 2026
[18]

Hun- yuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving.arXiv preprint arXiv:2412.20735, 2024

Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, and Haitao Mi. Hun- yuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving.arXiv preprint arXiv:2412.20735, 2024

work page arXiv 2024
[19]

arXiv preprint arXiv:2407.10040 , year=

Haohan Lin, Zhiqing Sun, Sean Welleck, and Yiming Yang. Lean-star: Learning to interleave thinking and proving.arXiv preprint arXiv:2407.10040, 2024

work page arXiv 2024
[20]

arXiv preprint arXiv:2508.03613 , year=

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

work page arXiv 2025
[21]

Fimo: A challenge formal dataset for automated theorem proving.arXiv preprint arXiv:2309.04295, 2023

Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, et al. Fimo: A challenge formal dataset for automated theorem proving.arXiv preprint arXiv:2309.04295, 2023

work page arXiv 2023
[22]

Combibench: Benchmarking llm capability for combinatorial mathematics.arXiv preprint arXiv:2505.03171, 2025

Junqi Liu, Xiaohan Lin, Jonas Bayer, Yael Dillies, Weijie Jiang, Xiaodan Liang, Roman Soletskyi, Haiming Wang, Yunzhou Xie, Beibei Xiong, et al. Combibench: Benchmarking llm capability for combinatorial mathematics.arXiv preprint arXiv:2505.03171, 2025

work page arXiv 2025
[23]

Process-driven autoformalization in lean 4.arXiv preprint arXiv:2406.01940, 2024

Jianqiao Lu, Yingjia Wan, Zhengying Liu, Yinya Huang, Jing Xiong, Chengwu Liu, Jianhao Shen, Hui Jin, Jipeng Zhang, Haiming Wang, Zhicheng Yang, Jing Tang, and Zhijiang Guo. Process-driven autoformalization in lean 4.arXiv preprint arXiv:2406.01940, 2024

work page arXiv 2024
[24]

Kimi k2.6.https://www.kimi.com/ai-models/kimi-k2-6, 2026

Moonshot AI. Kimi k2.6.https://www.kimi.com/ai-models/kimi-k2-6, 2026

work page 2026
[25]

Evaluating autoformalization robustness via semantically similar paraphrasing.arXiv preprint arXiv:2511.12784, 2025

Hayden Moore and Asfahan Shah. Evaluating autoformalization robustness via semantically similar paraphrasing.arXiv preprint arXiv:2511.12784, 2025

work page arXiv 2025
[26]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021

work page 2021
[27]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026

work page 2026
[28]

Springer, 1994

Lawrence C Paulson.Isabelle: A generic theorem prover. Springer, 1994

work page 1994
[29]

RLMEval: Evaluating research-level neural theorem proving

Auguste Poiroux, Antoine Bosselut, and Viktor Kunˇcak. RLMEval: Evaluating research-level neural theorem proving. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10946–10957. Association for Computational Linguistics, 2025

work page 2025
[30]

arXiv preprint arXiv:2009.03393 , year=

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

work page arXiv 2009
[31]

Formalproofbench: Can models write graduate level math proofs that are formally verified?, 2026

Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, and Langston Nashold. Formalproofbench: Can models write graduate level math proofs that are formally verified?, 2026

work page 2026
[32]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

ZZ Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, et al. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint arXiv:2504.21801, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition.Advances in Neural Information Processing Systems, 37:11545–11569, 2024

George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition.Advances in Neural Information Processing Systems, 37:11545–11569, 2024. 11

work page 2024
[34]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, et al. Kimina-prover preview: Towards large formal reasoning models with reinforcement learning.arXiv preprint arXiv:2504.11354, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

M2f: Proof automation with lean from autoformalization to verified complete proofs.arXiv preprint arXiv:2602.17016, 2026

Yuefeng Wang, Jiedong Jiang, Wanyi He, Guoxiong Gao, Chenyi Li, Yongle Hu, Mengxiao Sun, Jin Tan, Jingting Wang, Liang Xiao, et al. M2f: Proof automation with lean from autoformalization to verified complete proofs.arXiv preprint arXiv:2602.17016, 2026

work page arXiv 2026
[36]

Aut- oformalization in the era of large language models: A survey.arXiv preprint arXiv:2505.23486, 2025

Ke Weng, Lun Du, Sirui Li, Wangyue Lu, Haozhe Sun, Hengyu Liu, and Tiancheng Zhang. Aut- oformalization in the era of large language models: A survey.arXiv preprint arXiv:2505.23486, 2025

work page arXiv 2025
[37]

Autoformalization with large language models.Advances in Neural Information Processing Systems, 35:32353–32368, 2022

Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus N Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models.Advances in Neural Information Processing Systems, 35:32353–32368, 2022

work page 2022
[38]

Internlm2

Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Zheng Yuan, Wenwei Zhang, Dahua Lin, and Kai Chen. Internlm2. 5-stepprover: Advancing automated theorem proving via critic-guided search. In2nd AI for Math Workshop@ ICML 2025, 2025

work page 2025
[39]

5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search , author=

Huajian Xin, ZZ Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, et al. Deepseek-prover-v1. 5: Harnessing proof assistant feed- back for reinforcement learning and monte-carlo tree search.arXiv preprint arXiv:2408.08152, 2024

work page arXiv 2024
[40]

Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving

Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Ming Ding. Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 32588–32599, 2025

work page 2025
[41]

Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

work page 2023
[42]

arXiv preprint arXiv:2505.02735 , year =

Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, et al. Formalmath: Benchmarking formal mathematical reasoning of large language models.arXiv preprint arXiv:2505.02735, 2025

work page arXiv 2025
[43]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics.arXiv preprint arXiv:2109.00110, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

describe

Yichi Zhou, Jianqiu Zhao, Yongxin Zhang, Bohan Wang, Siran Wang, Luoxin Chen, Jiahui Wang, Haowei Chen, Allan Jie, Xinbo Zhang, et al. Solving formal math problems by decom- position and iterative reflection.arXiv preprint arXiv:2507.15225, 2025. 12 A Details of Informal Benchmark Construction This appendix supplements Section 3.1 with implementation deta...

work page doi:10.7910/dvn/b5jtzi 2025
[45]

Hallucinated or invalid APIs.The model invokes nonexistent theorem names, invalid fields, unsupported API patterns, or real declarations with incompatible signatures

work page
[46]

Missing bridge results.The proof requires intermediate lemmas that connect the current goal to available Mathlib facts, but these bridge results are absent or not found

work page
[47]

Formal Representation and Type Discipline.This category covers failures caused by Lean’s precise representation requirements

Domain-specific infrastructure gaps.The failure occurs in higher-level applied- mathematics infrastructure, such as matrices, spectra, positive semidefinite cones, convexity, optimality conditions, duality, or convergence. Formal Representation and Type Discipline.This category covers failures caused by Lean’s precise representation requirements

work page
[48]

Type, dimension, and index errors.The generated proof uses incompatible types, di- mensions, or indexing conventions, especially for vectors, matrices, finite index sets, and parameterized constraints

work page
[49]

Coercion and typeclass failures.The proof misses required coercions, scalar casts, subtype conversions, order or normed-space instances, or other typeclass obligations

work page
[50]

Proof Decomposition and Long-Horizon Control.This category covers failures where the proof requires a longer chain of intermediate claims than the model can reliably organize

Notation and side-condition failures.The attempt uses unsuitable notation or leaves side conditions generated by rewriting, simplification, inequalities, or algebraic transformations unresolved. Proof Decomposition and Long-Horizon Control.This category covers failures where the proof requires a longer chain of intermediate claims than the model can relia...

work page
[51]

Missing auxiliary lemmas.The model does not introduce the intermediate claims needed to connect the assumptions to the final theorem. 23

work page
[52]

Unfinished proof construction.The final artifact still contains residual placeholders, incomplete helper lemmas, abandoned branches, or proof plans that exceed the search or repair budget

work page
[53]

Overall, the observed failures are rarely isolated syntax errors

Loss of long-horizon control.The proof loses consistency across a multi-step argument involving optimization models, matrix identities, convexity reasoning, certificates, invariants, algorithms, or convergence claims. Overall, the observed failures are rarely isolated syntax errors. They usually arise from the interaction of missing or misused Mathlib inf...

work page
[54]

Definition block: Karush–Kuhn–Tucker conditions.For a problem of minimizing f(x) subject to inequality constraints gi(x)≤0 and equality constraints hj(x) = 0 , the KKT conditions are the existence of multipliers λi ≥0 and νj such that primal feasibility holds, stationarity holds, ∇f(x) + X i λi∇gi(x) + X j νj∇hj(x) = 0, and complementary slackness holds: ...

work page
[55]

Definition block: complementary slackness.For inequality constraints gi(x)≤0 with multipliersλ i ≥0, complementary slackness means λigi(x) = 0for everyi

work page
[56]

Definition block: stationarity.A KKT point satisfies stationarity if the gradient of the La- grangian with respect to the primal variable vanishes: ∇xL(x, λ, ν) = 0

work page
[57]

Definition block: primal feasibility.A point x is primal feasible if it satisfies all original constraints of the optimization problem

work page
[58]

Optimization-problem block: logarithmic simplex program.The optimization problem is isolated as minimize−log(a T x)−log(b T x) subject tox⪰0,1 T x= 1, with variablex∈R n. 6.Theorem-target block: explicit KKT conditions.Under the assumptions n≥2, a 1 ≥a 2 ≥ · · · ≥a n >0, b k = 1 ak , the first theorem target asks for an explicit characterization of the KK...

work page
[59]

∆x ∆y ∆λ # =

Theorem-target block: optimality of the endpoint mixture.Under the same assumptions and for the same logarithmic simplex program, the second theorem target asks to prove that x= 1 2 ,0, . . . ,0, 1 2 is optimal. Discussion.This decomposition makes the role of each component explicit before Lean generation. General mathematical background is separated from...

work page
[60]

Preserve the mathematical meaning exactly

work page
[61]

Preserve symbols, formulas, numbering, and mathematically meaningful references

work page
[62]

as in (4.13)

Do not delete, alter, renumber, or rewrite any referenced identifier or index, including equation numbers, theorem numbers, algorithm numbers, exercise numbers, part labels, and cross-references such as "as in (4.13)", "part (b)", "Theorem 12.1", or "Algorithm 3.2"

work page
[63]

Preserve referential links exactly: if a clause refers to an earlier object by number or label, keep that number or label unchanged and attached to the same object

work page
[64]

prove" to

Do not change the task intent, target, or scope. For example, do not convert "prove" to "compute", do not replace "minimum and maximum" with only "minimum", and do not weaken or strengthen quantifiers or assumptions

work page
[65]

Cleaning only: remove noise, wrappers, and non-mathematical exposition, but do not rewrite core mathematical content into a different statement. 36

work page
[66]

Do not introduce new claims, examples, explanations, or solution steps

work page
[67]

If an edge case fails, construct a concrete counterexample and treat the item as needing revise with a clear reason in downstream review

Mandatory boundary-case safeguard: before treating a cleaned statement as acceptable, explicitly check edge cases such as nonemptiness and zero-valued conditions. If an edge case fails, construct a concrete counterexample and treat the item as needing revise with a clear reason in downstream review. You must always produce a cleaned version that is shorte...

work page
[68]

variables and domains

work page
[69]

assumptions and constraints

work page
[70]

equations and inequalities

work page
[71]

Delete anything that does not help identify or formalize these items

precise goals and conclusions. Delete anything that does not help identify or formalize these items. Default to deletion. Keep only the following content:

work page
[72]

exercise numbers, part labels, and mathematically meaningful cross-references such as 4.13, (a), Theorem X.Y, Algorithm X.Y, equation (15), or part (b)

work page
[73]

definitions, notation, symbol declarations, and domain declarations

work page
[74]

assumptions, hypotheses, regularity conditions, feasibility assumptions, and rank or positivity assumptions

work page
[75]

optimization variables, objectives, constraints, equalities, inequalities, quantifiers, and formulas

work page
[76]

dual problems, KKT conditions, conjugates, SDP/QP/SOCP/LP formulations, feasibility claims, and exact statements of derived reformulations when they are part of the problem

work page
[77]

explicit problem data needed for the statement, including matrices, vectors, dimensions, constants, tolerances, and named data files when required to define the task

work page
[78]

the actual task directives, such as show, prove, derive, find, compute, determine, formulate, verify, construct, characterize, or solve

work page
[79]

Delete aggressively:

minimal clarifying text only if removing it would destroy mathematical meaning. Delete aggressively:

work page
[80]

motivational background, applications, interpretation, intuition, historical remarks, and storytelling, including introductory sentences that describe why a topic is useful or interesting

work page

Showing first 80 references.

[1] [1]

Aristotle: IMO-level Automated Theorem Proving

Tudor Achim, Mingda Sun, Qingxiang Guo, Krunal Guliani, Kush Bhatia, Vineet Kumar, Mehran Shakerinava, Haowei Hu, Rohan Daniel, Stella Mai, et al. Aristotle: A recipe for math lms.arXiv preprint arXiv:2510.01346, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Claude sonnet 4.6 system card

Anthropic. Claude sonnet 4.6 system card. https://anthropic.com/ claude-sonnet-4-6-system-card, 2026

work page 2026

[3] [3]

arXiv preprint arXiv:2302.12433 , year=

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics.arXiv preprint arXiv:2302.12433, 2023

work page arXiv 2023

[4] [4]

Veribench-ftp: A formal theorem proving benchmark in lean 4 for code verification

Slim Barkallah, Srivatsava Daruru, Brando Miranda, Leni Aniva, Allen Nie, and Sanmi Koyejo. Veribench-ftp: A formal theorem proving benchmark in lean 4 for code verification. InThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025, 2025

work page 2025

[5] [5]

arXiv preprint arXiv:2507.23726 , year=

Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, et al. Seed-prover: Deep and broad reasoning for automated theorem proving.arXiv preprint arXiv:2507.23726, 2025

work page arXiv 2025

[6] [6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

The lean theorem prover (system description)

Leonardo De Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob von Raumer. The lean theorem prover (system description). InInternational Conference on Automated Deduction, pages 378–388. Springer, 2015

work page 2015

[8] [8]

Deepseek-v4-pro

DeepSeek-AI. Deepseek-v4-pro. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro, 2026

work page 2026

[9] [9]

Gemini 3.1 pro preview

Google. Gemini 3.1 pro preview. https://ai.google.dev/gemini-api/docs/models/ gemini-3.1-pro-preview, 2026

work page 2026

[10] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Hol light: A tutorial introduction

John Harrison. Hol light: A tutorial introduction. InInternational Conference on F ormal Methods in Computer-Aided Design, pages 265–269. Springer, 1996

work page 1996

[12] [12]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Pape...

work page 2024

[13] [13]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

work page 2021

[14] [14]

The coq proof assistant a tutorial

Gérard Huet, Gilles Kahn, and Christine Paulin-Mohring. The coq proof assistant a tutorial. Rapport Technique, 178:113, 1997

work page 1997

[15] [15]

Fate: A formal benchmark series for frontier algebra of multiple difficulty levels.arXiv preprint arXiv:2511.02872, 2025

Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailing Guan, Peihao Wu, Chunbo Dai, Liang Xiao, et al. Fate: A formal benchmark series for frontier algebra of multiple difficulty levels.arXiv preprint arXiv:2511.02872, 2025. 10

work page arXiv 2025

[16] [16]

Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337–26349, 2022

Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. Hypertree proof search for neural theorem proving.Advances in neural information processing systems, 35:26337–26349, 2022

work page 2022

[17] [17]

Optprover: Bridging olympiad and optimization through continual training in formal theorem proving, 2026

Chenyi Li, Yanchen Nie, Zhenyu Ming, Gong Zhang, Kun Yuan, and Zaiwen Wen. Optprover: Bridging olympiad and optimization through continual training in formal theorem proving, 2026

work page 2026

[18] [18]

Hun- yuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving.arXiv preprint arXiv:2412.20735, 2024

Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, and Haitao Mi. Hun- yuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving.arXiv preprint arXiv:2412.20735, 2024

work page arXiv 2024

[19] [19]

arXiv preprint arXiv:2407.10040 , year=

Haohan Lin, Zhiqing Sun, Sean Welleck, and Yiming Yang. Lean-star: Learning to interleave thinking and proving.arXiv preprint arXiv:2407.10040, 2024

work page arXiv 2024

[20] [20]

arXiv preprint arXiv:2508.03613 , year=

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

work page arXiv 2025

[21] [21]

Fimo: A challenge formal dataset for automated theorem proving.arXiv preprint arXiv:2309.04295, 2023

Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, et al. Fimo: A challenge formal dataset for automated theorem proving.arXiv preprint arXiv:2309.04295, 2023

work page arXiv 2023

[22] [22]

Combibench: Benchmarking llm capability for combinatorial mathematics.arXiv preprint arXiv:2505.03171, 2025

Junqi Liu, Xiaohan Lin, Jonas Bayer, Yael Dillies, Weijie Jiang, Xiaodan Liang, Roman Soletskyi, Haiming Wang, Yunzhou Xie, Beibei Xiong, et al. Combibench: Benchmarking llm capability for combinatorial mathematics.arXiv preprint arXiv:2505.03171, 2025

work page arXiv 2025

[23] [23]

Process-driven autoformalization in lean 4.arXiv preprint arXiv:2406.01940, 2024

Jianqiao Lu, Yingjia Wan, Zhengying Liu, Yinya Huang, Jing Xiong, Chengwu Liu, Jianhao Shen, Hui Jin, Jipeng Zhang, Haiming Wang, Zhicheng Yang, Jing Tang, and Zhijiang Guo. Process-driven autoformalization in lean 4.arXiv preprint arXiv:2406.01940, 2024

work page arXiv 2024

[24] [24]

Kimi k2.6.https://www.kimi.com/ai-models/kimi-k2-6, 2026

Moonshot AI. Kimi k2.6.https://www.kimi.com/ai-models/kimi-k2-6, 2026

work page 2026

[25] [25]

Evaluating autoformalization robustness via semantically similar paraphrasing.arXiv preprint arXiv:2511.12784, 2025

Hayden Moore and Asfahan Shah. Evaluating autoformalization robustness via semantically similar paraphrasing.arXiv preprint arXiv:2511.12784, 2025

work page arXiv 2025

[26] [26]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021

work page 2021

[27] [27]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026

work page 2026

[28] [28]

Springer, 1994

Lawrence C Paulson.Isabelle: A generic theorem prover. Springer, 1994

work page 1994

[29] [29]

RLMEval: Evaluating research-level neural theorem proving

Auguste Poiroux, Antoine Bosselut, and Viktor Kunˇcak. RLMEval: Evaluating research-level neural theorem proving. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10946–10957. Association for Computational Linguistics, 2025

work page 2025

[30] [30]

arXiv preprint arXiv:2009.03393 , year=

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

work page arXiv 2009

[31] [31]

Formalproofbench: Can models write graduate level math proofs that are formally verified?, 2026

Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, and Langston Nashold. Formalproofbench: Can models write graduate level math proofs that are formally verified?, 2026

work page 2026

[32] [32]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

ZZ Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, et al. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint arXiv:2504.21801, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition.Advances in Neural Information Processing Systems, 37:11545–11569, 2024

George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition.Advances in Neural Information Processing Systems, 37:11545–11569, 2024. 11

work page 2024

[34] [34]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, et al. Kimina-prover preview: Towards large formal reasoning models with reinforcement learning.arXiv preprint arXiv:2504.11354, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

M2f: Proof automation with lean from autoformalization to verified complete proofs.arXiv preprint arXiv:2602.17016, 2026

Yuefeng Wang, Jiedong Jiang, Wanyi He, Guoxiong Gao, Chenyi Li, Yongle Hu, Mengxiao Sun, Jin Tan, Jingting Wang, Liang Xiao, et al. M2f: Proof automation with lean from autoformalization to verified complete proofs.arXiv preprint arXiv:2602.17016, 2026

work page arXiv 2026

[36] [36]

Aut- oformalization in the era of large language models: A survey.arXiv preprint arXiv:2505.23486, 2025

Ke Weng, Lun Du, Sirui Li, Wangyue Lu, Haozhe Sun, Hengyu Liu, and Tiancheng Zhang. Aut- oformalization in the era of large language models: A survey.arXiv preprint arXiv:2505.23486, 2025

work page arXiv 2025

[37] [37]

Autoformalization with large language models.Advances in Neural Information Processing Systems, 35:32353–32368, 2022

Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus N Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models.Advances in Neural Information Processing Systems, 35:32353–32368, 2022

work page 2022

[38] [38]

Internlm2

Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Zheng Yuan, Wenwei Zhang, Dahua Lin, and Kai Chen. Internlm2. 5-stepprover: Advancing automated theorem proving via critic-guided search. In2nd AI for Math Workshop@ ICML 2025, 2025

work page 2025

[39] [39]

5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search , author=

Huajian Xin, ZZ Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, et al. Deepseek-prover-v1. 5: Harnessing proof assistant feed- back for reinforcement learning and monte-carlo tree search.arXiv preprint arXiv:2408.08152, 2024

work page arXiv 2024

[40] [40]

Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving

Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Ming Ding. Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 32588–32599, 2025

work page 2025

[41] [41]

Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

work page 2023

[42] [42]

arXiv preprint arXiv:2505.02735 , year =

Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, et al. Formalmath: Benchmarking formal mathematical reasoning of large language models.arXiv preprint arXiv:2505.02735, 2025

work page arXiv 2025

[43] [43]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics.arXiv preprint arXiv:2109.00110, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[44] [44]

describe

Yichi Zhou, Jianqiu Zhao, Yongxin Zhang, Bohan Wang, Siran Wang, Luoxin Chen, Jiahui Wang, Haowei Chen, Allan Jie, Xinbo Zhang, et al. Solving formal math problems by decom- position and iterative reflection.arXiv preprint arXiv:2507.15225, 2025. 12 A Details of Informal Benchmark Construction This appendix supplements Section 3.1 with implementation deta...

work page doi:10.7910/dvn/b5jtzi 2025

[45] [45]

Hallucinated or invalid APIs.The model invokes nonexistent theorem names, invalid fields, unsupported API patterns, or real declarations with incompatible signatures

work page

[46] [46]

Missing bridge results.The proof requires intermediate lemmas that connect the current goal to available Mathlib facts, but these bridge results are absent or not found

work page

[47] [47]

Formal Representation and Type Discipline.This category covers failures caused by Lean’s precise representation requirements

Domain-specific infrastructure gaps.The failure occurs in higher-level applied- mathematics infrastructure, such as matrices, spectra, positive semidefinite cones, convexity, optimality conditions, duality, or convergence. Formal Representation and Type Discipline.This category covers failures caused by Lean’s precise representation requirements

work page

[48] [48]

Type, dimension, and index errors.The generated proof uses incompatible types, di- mensions, or indexing conventions, especially for vectors, matrices, finite index sets, and parameterized constraints

work page

[49] [49]

Coercion and typeclass failures.The proof misses required coercions, scalar casts, subtype conversions, order or normed-space instances, or other typeclass obligations

work page

[50] [50]

Proof Decomposition and Long-Horizon Control.This category covers failures where the proof requires a longer chain of intermediate claims than the model can reliably organize

Notation and side-condition failures.The attempt uses unsuitable notation or leaves side conditions generated by rewriting, simplification, inequalities, or algebraic transformations unresolved. Proof Decomposition and Long-Horizon Control.This category covers failures where the proof requires a longer chain of intermediate claims than the model can relia...

work page

[51] [51]

Missing auxiliary lemmas.The model does not introduce the intermediate claims needed to connect the assumptions to the final theorem. 23

work page

[52] [52]

Unfinished proof construction.The final artifact still contains residual placeholders, incomplete helper lemmas, abandoned branches, or proof plans that exceed the search or repair budget

work page

[53] [53]

Overall, the observed failures are rarely isolated syntax errors

Loss of long-horizon control.The proof loses consistency across a multi-step argument involving optimization models, matrix identities, convexity reasoning, certificates, invariants, algorithms, or convergence claims. Overall, the observed failures are rarely isolated syntax errors. They usually arise from the interaction of missing or misused Mathlib inf...

work page

[54] [54]

Definition block: Karush–Kuhn–Tucker conditions.For a problem of minimizing f(x) subject to inequality constraints gi(x)≤0 and equality constraints hj(x) = 0 , the KKT conditions are the existence of multipliers λi ≥0 and νj such that primal feasibility holds, stationarity holds, ∇f(x) + X i λi∇gi(x) + X j νj∇hj(x) = 0, and complementary slackness holds: ...

work page

[55] [55]

Definition block: complementary slackness.For inequality constraints gi(x)≤0 with multipliersλ i ≥0, complementary slackness means λigi(x) = 0for everyi

work page

[56] [56]

Definition block: stationarity.A KKT point satisfies stationarity if the gradient of the La- grangian with respect to the primal variable vanishes: ∇xL(x, λ, ν) = 0

work page

[57] [57]

Definition block: primal feasibility.A point x is primal feasible if it satisfies all original constraints of the optimization problem

work page

[58] [58]

Optimization-problem block: logarithmic simplex program.The optimization problem is isolated as minimize−log(a T x)−log(b T x) subject tox⪰0,1 T x= 1, with variablex∈R n. 6.Theorem-target block: explicit KKT conditions.Under the assumptions n≥2, a 1 ≥a 2 ≥ · · · ≥a n >0, b k = 1 ak , the first theorem target asks for an explicit characterization of the KK...

work page

[59] [59]

∆x ∆y ∆λ # =

Theorem-target block: optimality of the endpoint mixture.Under the same assumptions and for the same logarithmic simplex program, the second theorem target asks to prove that x= 1 2 ,0, . . . ,0, 1 2 is optimal. Discussion.This decomposition makes the role of each component explicit before Lean generation. General mathematical background is separated from...

work page

[60] [60]

Preserve the mathematical meaning exactly

work page

[61] [61]

Preserve symbols, formulas, numbering, and mathematically meaningful references

work page

[62] [62]

as in (4.13)

Do not delete, alter, renumber, or rewrite any referenced identifier or index, including equation numbers, theorem numbers, algorithm numbers, exercise numbers, part labels, and cross-references such as "as in (4.13)", "part (b)", "Theorem 12.1", or "Algorithm 3.2"

work page

[63] [63]

Preserve referential links exactly: if a clause refers to an earlier object by number or label, keep that number or label unchanged and attached to the same object

work page

[64] [64]

prove" to

Do not change the task intent, target, or scope. For example, do not convert "prove" to "compute", do not replace "minimum and maximum" with only "minimum", and do not weaken or strengthen quantifiers or assumptions

work page

[65] [65]

Cleaning only: remove noise, wrappers, and non-mathematical exposition, but do not rewrite core mathematical content into a different statement. 36

work page

[66] [66]

Do not introduce new claims, examples, explanations, or solution steps

work page

[67] [67]

If an edge case fails, construct a concrete counterexample and treat the item as needing revise with a clear reason in downstream review

Mandatory boundary-case safeguard: before treating a cleaned statement as acceptable, explicitly check edge cases such as nonemptiness and zero-valued conditions. If an edge case fails, construct a concrete counterexample and treat the item as needing revise with a clear reason in downstream review. You must always produce a cleaned version that is shorte...

work page

[68] [68]

variables and domains

work page

[69] [69]

assumptions and constraints

work page

[70] [70]

equations and inequalities

work page

[71] [71]

Delete anything that does not help identify or formalize these items

precise goals and conclusions. Delete anything that does not help identify or formalize these items. Default to deletion. Keep only the following content:

work page

[72] [72]

exercise numbers, part labels, and mathematically meaningful cross-references such as 4.13, (a), Theorem X.Y, Algorithm X.Y, equation (15), or part (b)

work page

[73] [73]

definitions, notation, symbol declarations, and domain declarations

work page

[74] [74]

assumptions, hypotheses, regularity conditions, feasibility assumptions, and rank or positivity assumptions

work page

[75] [75]

optimization variables, objectives, constraints, equalities, inequalities, quantifiers, and formulas

work page

[76] [76]

dual problems, KKT conditions, conjugates, SDP/QP/SOCP/LP formulations, feasibility claims, and exact statements of derived reformulations when they are part of the problem

work page

[77] [77]

explicit problem data needed for the statement, including matrices, vectors, dimensions, constants, tolerances, and named data files when required to define the task

work page

[78] [78]

the actual task directives, such as show, prove, derive, find, compute, determine, formulate, verify, construct, characterize, or solve

work page

[79] [79]

Delete aggressively:

minimal clarifying text only if removing it would destroy mathematical meaning. Delete aggressively:

work page

[80] [80]

motivational background, applications, interpretation, intuition, historical remarks, and storytelling, including introductory sentences that describe why a topic is useful or interesting

work page