Recognition: unknown
Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
Pith reviewed 2026-05-10 08:33 UTC · model grok-4.3
The pith
DAP achieves SOTA on Hard Mode ATP by having LLMs discover answers then prove them formally, solving 10 CombiBench and 36 PutnamBench problems while exposing that LLMs exceed 80% answer accuracy where formal provers stay under 10%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode -- while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%.
Load-bearing premise
That rewriting a Hard Mode statement into an Easy Mode version after answer discovery preserves the original problem's logical content and difficulty without introducing simplifications or errors that artificially aid the prover.
Figures
read the original abstract
Most ATP benchmarks embed the final answer within the formal statement -- a convention we call "Easy Mode" -- a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode -- while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces 'Hard Mode' ATP benchmarks in Lean 4, where systems must discover answers independently before proving (contrasted with 'Easy Mode' where answers are embedded in statements). It releases expert-reannotated Hard Mode variants MiniF2F-Hard and FIMO-Hard, and presents the open-source Discover And Prove (DAP) agentic framework: LLMs perform natural-language answer discovery with self-reflection, rewrite Hard Mode statements into Easy Mode versions embedding the answer, and invoke existing provers. DAP reports new SOTA results on CombiBench (10 solved vs. prior 7 at Pass@16) and PutnamBench (first system to formally prove 36 theorems in Hard Mode), while measuring a large gap where LLMs exceed 80% answer accuracy but formal provers succeed on under 10% of the same problems.
Significance. If the Hard-to-Easy rewrites are shown to preserve logical equivalence without simplification or error, the work is significant: it supplies more realistic benchmarks that better match human competition settings, demonstrates a practical hybrid LLM-plus-prover pipeline that advances Hard Mode performance, and provides quantitative evidence of the informal-to-formal gap that standard Easy Mode benchmarks obscure. The open-source release of DAP and the two new Hard Mode datasets are concrete, reusable contributions that can ground future research.
major comments (2)
- [DAP Framework and Evaluation sections] The central empirical claims (10 solved on CombiBench, 36 on PutnamBench in Hard Mode) rest entirely on the correctness of the LLM-driven Hard-to-Easy rewrites described in the DAP pipeline. The manuscript provides no validation protocol—neither Lean-level implication or equivalence checks, nor external expert audit, nor error analysis—of these rewrites. Without such verification, the subsequent formal proofs do not establish the original Hard Mode theorems, rendering the reported solve counts and the LLM-vs-prover gap measurement inconclusive.
- [Dataset Release and Benchmark Construction] The abstract and introduction state that MiniF2F-Hard and FIMO-Hard are 'expert-reannotated' Hard Mode variants, yet supply no details on the annotation process, number of problems excluded or modified, inter-annotator agreement, or how Hard Mode status was validated. This information is required to assess whether the new benchmarks faithfully represent the stricter setting and to interpret the performance deltas.
minor comments (2)
- [Results and Discussion] The abstract claims 'state-of-the-art LLMs exceed 80% answer accuracy' on the same problems; the main text should report the exact LLM, prompt, and sampling settings used for this measurement so the gap can be reproduced.
- [Experimental Results] Figure or table presenting the per-benchmark solved counts should include baseline details (e.g., which prior SOTA system achieved 7 on CombiBench at Pass@16) and confidence intervals or variance across runs.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of validation and transparency in our work. We address the major comments point by point below, committing to revisions that will enhance the rigor of the manuscript.
read point-by-point responses
-
Referee: [DAP Framework and Evaluation sections] The central empirical claims (10 solved on CombiBench, 36 on PutnamBench in Hard Mode) rest entirely on the correctness of the LLM-driven Hard-to-Easy rewrites described in the DAP pipeline. The manuscript provides no validation protocol—neither Lean-level implication or equivalence checks, nor external expert audit, nor error analysis—of these rewrites. Without such verification, the subsequent formal proofs do not establish the original Hard Mode theorems, rendering the reported solve counts and the LLM-vs-prover gap measurement inconclusive.
Authors: We recognize the validity of this concern. The correctness of the Hard-to-Easy rewrites is indeed foundational to our empirical results. While the manuscript describes the DAP pipeline, it lacks explicit validation details. In the revised version, we will add a dedicated 'Validation of Rewrites' subsection under the DAP Framework. This will include: (1) a description of our post-hoc expert review process, where a Lean expert will manually inspect a sample of successful rewrites for semantic equivalence and absence of simplifications; (2) an error analysis breaking down failure modes of the rewrite step (e.g., incorrect answer embedding, logical alterations); and (3) quantitative results on rewrite success rate. Regarding Lean-level checks, we note that automated equivalence verification is non-trivial for these complex statements, but we will discuss this limitation and how the expert audit mitigates it. We believe these additions will substantiate the solve counts and the observed gap. revision: yes
-
Referee: [Dataset Release and Benchmark Construction] The abstract and introduction state that MiniF2F-Hard and FIMO-Hard are 'expert-reannotated' Hard Mode variants, yet supply no details on the annotation process, number of problems excluded or modified, inter-annotator agreement, or how Hard Mode status was validated. This information is required to assess whether the new benchmarks faithfully represent the stricter setting and to interpret the performance deltas.
Authors: We agree that the annotation details are insufficient in the current manuscript. The term 'expert-reannotated' was used without elaboration. In the revision, we will substantially expand the 'Benchmark Construction' section (or create a new appendix) with the following information: the total number of problems from the original benchmarks that were considered, how many were excluded and the reasons (e.g., problems where Hard Mode conversion was ambiguous or altered the mathematical intent), the specific guidelines given to the annotators, the number of annotators involved, inter-annotator agreement metrics (Cohen's kappa or percentage agreement on Hard Mode classification and modifications), and the procedure for validating Hard Mode status (e.g., checking that the target answer cannot be deduced directly from the statement without additional reasoning). This will provide the necessary transparency to evaluate the benchmarks' fidelity to the Hard Mode setting. revision: yes
Circularity Check
No circularity: empirical SOTA claims rest on independent benchmark counts
full rationale
The paper defines Hard Mode vs Easy Mode terminology and releases new expert-annotated benchmarks (MiniF2F-Hard, FIMO-Hard), then reports raw solve counts for DAP on CombiBench (10/??) and PutnamBench (36 theorems). No equations, fitted parameters, or self-referential definitions appear; the central results are direct empirical tallies of formally verified theorems. The LLM-driven rewrite step is a methodological component whose logical equivalence is an external validity question, not a reduction of the reported counts to the method's own inputs by construction. No self-citation chains or ansatzes underpin the performance claims.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ai achieves silver-medal standard solving international 178 mathematical olympiad prob- lems. DeepMind blog, 179:45. Kaito Baba, Chaoran Liu, Shuhei Kurita, and Akiyoshi Sannai. 2025. Prover agent: An agent-based framework for formal mathematical proofs. arXiv preprint arXiv:2506.19923 . Chenrui Cao, Liangcheng Song, Zenan Li, Xinyi Le, Xian Zhang, HUI XU...
-
[2]
Deep- Mind Blog
Advanced version of gemini with deep think officially achieves gold-medal standard at the international mathematical olympiad. Deep- Mind Blog . Leonardo de Moura and Sebastian Ullrich. 2021. The lean 4 theorem prover and programming language. In International Conference on Au- tomated Deduction, pages 625–635. Springer. Joseph Myers. 2025. Jsm28/IMOLean: ...
2021
-
[3]
WebGPT: Browser-assisted question-answering with human feedback
Webgpt: Browser-assisted question- answering with human feedback . Preprint, arXiv:2112.09332. Tobias Nipkow, Markus Wenzel, Lawrence C. Paul- son, Gerhard Goos, Juris Hartmanis, and Jan Van Leeuwen, editors. 2002. Isabelle/HOL, vol- ume 2283 of Lecture Notes in Computer Science . Springer, Berlin, Heidelberg. Z. Z. Ren, Zhihong Shao, Junxiao Song, Hua- j...
work page internal anchor Pith review arXiv 2002
-
[4]
PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition. Preprint, arXiv:2407.11214. Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, De- hao Zhang, Ebony Zhang, Frede...
-
[5]
I have successfully solved the problem. The final answer is
converts a general-purpose LLM into a Lean proof specialist via a language-agent loop. LeanAgent ( Kumarappan et al. , 2024) stud- ies lifelong learning to continuously improve LLM performance on advanced mathematics. ProverAgent (Baba et al. , 2025) leverages non- formal models to propose auxiliary lemmas that guide formal proofs. MA-LoT ( Wang et al. , ...
2024
-
[6]
A natural language math problem (`<NATURAL_LANGUAGE_PROBLEM>`).,→
-
[7]
A detailed natural language solution to that problem (`<NATURAL_LANGUAGE_SOLUTION>`).,→
-
[8]
fill-in-the-blank
A Lean 4 code statement that formalizes the problem (`<LEAN4_STATEMENT>`).,→ The Lean 4 statement contains a "fill-in-the-blank" definition for the solution, typically an `abbrev` or `def` with `sorry` as its value. ,→ ,→ **Your task is to:**
-
[9]
The value is π/2
**Read the `<NATURAL_LANGUAGE_SOLUTION>` to find the final answer** to the problem. The answer is usually explicitly stated at the end (e.g., "The value is π/2", "The result is 42"). ,→ ,→ ,→
-
[10]
**Locate the `abbrev` or `def` line** in the `<LEAN4_STATEMENT>` that defines the solution variable (e.g., `abbrev my_solution : ℝ := sorry`). ,→ ,→
-
[11]
Ensure the syntax is correct for Lean 4 (e.g., `π` becomes `Real.pi`)
**Replace the `sorry`** in that definition with the final answer you extracted. Ensure the syntax is correct for Lean 4 (e.g., `π` becomes `Real.pi`). ,→ ,→
-
[12]
The solution is correct,
**Output the entire Lean 4 code block with this single modification.** Do not change any other part of the code. Specifically, the `theorem`'s proof, which is also `sorry`, must remain unchanged. ,→ ,→ ,→ **Example:** **<NATURAL_LANGUAGE_PROBLEM>:** Evaluate the integral \[\int_0^1 \frac{1}{1+x^2} \,dx.\] **<NATURAL_LANGUAGE_SOLUTION>:** The problem is to...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.