arxiv: 2604.15839 · v1 · submitted 2026-04-17 · 💻 cs.AI · cs.CL· cs.LO

Recognition: unknown

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

Chengwu Liu , Yichun Yin , Ye Yuan , Jiaxuan Xie , Botao Li , Siqi Li , Jianhao Shen , Yan Xu

show 2 more authors

Lifeng Shang Ming Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:33 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LO

keywords modeharddiscoveranswerbenchmarksformalproveagentic

0 comments

The pith

DAP achieves SOTA on Hard Mode ATP by having LLMs discover answers then prove them formally, solving 10 CombiBench and 36 PutnamBench problems while exposing that LLMs exceed 80% answer accuracy where formal provers stay under 10%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard automated theorem proving benchmarks give the system the final answer inside the formal statement, which the authors call Easy Mode. Hard Mode removes that hint so the system must independently find the answer first, matching how human math competitions work. The authors create MiniF2F-Hard and FIMO-Hard by expert re-annotation. Their Discover And Prove framework uses large language models to reason step-by-step in natural language, reflect on their own thoughts, identify the answer, and then convert the original Hard Mode statement into an Easy Mode version that current formal provers can tackle. On CombiBench this raises solved problems from 7 to 10. On PutnamBench it is the first system to prove 36 theorems under Hard Mode conditions. The same experiments show LLMs can correctly state the answer more than 80 percent of the time on problems where formal provers succeed less than 10 percent of the time.

Core claim

DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode -- while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%.

Load-bearing premise

That rewriting a Hard Mode statement into an Easy Mode version after answer discovery preserves the original problem's logical content and difficulty without introducing simplifications or errors that artificially aid the prover.

Figures

Figures reproduced from arXiv: 2604.15839 by Botao Li, Chengwu Liu, Jianhao Shen, Jiaxuan Xie, Lifeng Shang, Ming Zhang, Siqi Li, Yan Xu, Ye Yuan, Yichun Yin.

**Figure 2.** Figure 2: Primary flowchart. A mathematical problem is first processed by the Discovery Module to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Rephrasing mathd_algebra_320 for Hard Mode compatibility. The original formalization hard-codes c = 2, leaking part of the answer; the rephrased version promotes c to a free parameter and adds explicit simplicity and squarefreeness conditions, requiring the solver to determine the canonical form independently. verification provides substantial gains; 10 iterations approach saturation and balance computat… view at source ↗

**Figure 4.** Figure 4: A spurious proof of fimo_2009_algebra_p3 observed under the No Rewriting setting. The abbrev solution is defined as the same set that appears in the theorem statement, so the prover closes the goal with rfl without performing any mathematical reasoning [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Most ATP benchmarks embed the final answer within the formal statement -- a convention we call "Easy Mode" -- a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode -- while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hard Mode benchmarks and the discovery-then-rewrite pipeline are the real additions here, but the rewrite step needs explicit validation to make the solved counts reliable.

read the letter

The paper draws a clean line between Easy Mode ATP, where the answer is already in the formal statement, and Hard Mode, where the system has to find the answer first. That distinction is useful and not just relabeling. They release MiniF2F-Hard and FIMO-Hard as expert-reannotated versions, plus the DAP agent that runs LLM discovery with self-reflection, then rewrites the statement into an Easy Mode form for existing provers. On the reported numbers, this lifts CombiBench from 7 to 10 solved and gets 36 theorems on PutnamBench in Hard Mode for the first time. It also shows LLMs hitting over 80% answer accuracy while formal provers stay under 10%, which lines up with what many of us see in practice and gives a concrete way to measure the gap. The benchmarks themselves are the part most likely to stick around. The rewrite step is the soft spot. The pipeline depends on the same LLM producing a logically equivalent Easy Mode statement after it finds the answer, yet the abstract gives no details on how they checked equivalence, whether they ran Lean implication tests, or if experts audited a sample. If even a few rewrites quietly simplify quantifiers or add implicit assumptions, the Hard Mode solved counts overstate what was actually proved. The stress-test note is right to flag this as load-bearing. Without that check in the full paper, the empirical claims stay plausible but not fully supported. This work is for groups already running LLM agents on Lean or building math benchmarks. It is worth sending to referees because the benchmarks are public and the separation of discovery from proving is a direction others can test directly, even if the current rewrite validation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces 'Hard Mode' ATP benchmarks in Lean 4, where systems must discover answers independently before proving (contrasted with 'Easy Mode' where answers are embedded in statements). It releases expert-reannotated Hard Mode variants MiniF2F-Hard and FIMO-Hard, and presents the open-source Discover And Prove (DAP) agentic framework: LLMs perform natural-language answer discovery with self-reflection, rewrite Hard Mode statements into Easy Mode versions embedding the answer, and invoke existing provers. DAP reports new SOTA results on CombiBench (10 solved vs. prior 7 at Pass@16) and PutnamBench (first system to formally prove 36 theorems in Hard Mode), while measuring a large gap where LLMs exceed 80% answer accuracy but formal provers succeed on under 10% of the same problems.

Significance. If the Hard-to-Easy rewrites are shown to preserve logical equivalence without simplification or error, the work is significant: it supplies more realistic benchmarks that better match human competition settings, demonstrates a practical hybrid LLM-plus-prover pipeline that advances Hard Mode performance, and provides quantitative evidence of the informal-to-formal gap that standard Easy Mode benchmarks obscure. The open-source release of DAP and the two new Hard Mode datasets are concrete, reusable contributions that can ground future research.

major comments (2)

[DAP Framework and Evaluation sections] The central empirical claims (10 solved on CombiBench, 36 on PutnamBench in Hard Mode) rest entirely on the correctness of the LLM-driven Hard-to-Easy rewrites described in the DAP pipeline. The manuscript provides no validation protocol—neither Lean-level implication or equivalence checks, nor external expert audit, nor error analysis—of these rewrites. Without such verification, the subsequent formal proofs do not establish the original Hard Mode theorems, rendering the reported solve counts and the LLM-vs-prover gap measurement inconclusive.
[Dataset Release and Benchmark Construction] The abstract and introduction state that MiniF2F-Hard and FIMO-Hard are 'expert-reannotated' Hard Mode variants, yet supply no details on the annotation process, number of problems excluded or modified, inter-annotator agreement, or how Hard Mode status was validated. This information is required to assess whether the new benchmarks faithfully represent the stricter setting and to interpret the performance deltas.

minor comments (2)

[Results and Discussion] The abstract claims 'state-of-the-art LLMs exceed 80% answer accuracy' on the same problems; the main text should report the exact LLM, prompt, and sampling settings used for this measurement so the gap can be reproduced.
[Experimental Results] Figure or table presenting the per-benchmark solved counts should include baseline details (e.g., which prior SOTA system achieved 7 on CombiBench at Pass@16) and confidence intervals or variance across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of validation and transparency in our work. We address the major comments point by point below, committing to revisions that will enhance the rigor of the manuscript.

read point-by-point responses

Referee: [DAP Framework and Evaluation sections] The central empirical claims (10 solved on CombiBench, 36 on PutnamBench in Hard Mode) rest entirely on the correctness of the LLM-driven Hard-to-Easy rewrites described in the DAP pipeline. The manuscript provides no validation protocol—neither Lean-level implication or equivalence checks, nor external expert audit, nor error analysis—of these rewrites. Without such verification, the subsequent formal proofs do not establish the original Hard Mode theorems, rendering the reported solve counts and the LLM-vs-prover gap measurement inconclusive.

Authors: We recognize the validity of this concern. The correctness of the Hard-to-Easy rewrites is indeed foundational to our empirical results. While the manuscript describes the DAP pipeline, it lacks explicit validation details. In the revised version, we will add a dedicated 'Validation of Rewrites' subsection under the DAP Framework. This will include: (1) a description of our post-hoc expert review process, where a Lean expert will manually inspect a sample of successful rewrites for semantic equivalence and absence of simplifications; (2) an error analysis breaking down failure modes of the rewrite step (e.g., incorrect answer embedding, logical alterations); and (3) quantitative results on rewrite success rate. Regarding Lean-level checks, we note that automated equivalence verification is non-trivial for these complex statements, but we will discuss this limitation and how the expert audit mitigates it. We believe these additions will substantiate the solve counts and the observed gap. revision: yes
Referee: [Dataset Release and Benchmark Construction] The abstract and introduction state that MiniF2F-Hard and FIMO-Hard are 'expert-reannotated' Hard Mode variants, yet supply no details on the annotation process, number of problems excluded or modified, inter-annotator agreement, or how Hard Mode status was validated. This information is required to assess whether the new benchmarks faithfully represent the stricter setting and to interpret the performance deltas.

Authors: We agree that the annotation details are insufficient in the current manuscript. The term 'expert-reannotated' was used without elaboration. In the revision, we will substantially expand the 'Benchmark Construction' section (or create a new appendix) with the following information: the total number of problems from the original benchmarks that were considered, how many were excluded and the reasons (e.g., problems where Hard Mode conversion was ambiguous or altered the mathematical intent), the specific guidelines given to the annotators, the number of annotators involved, inter-annotator agreement metrics (Cohen's kappa or percentage agreement on Hard Mode classification and modifications), and the procedure for validating Hard Mode status (e.g., checking that the target answer cannot be deduced directly from the statement without additional reasoning). This will provide the necessary transparency to evaluate the benchmarks' fidelity to the Hard Mode setting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SOTA claims rest on independent benchmark counts

full rationale

The paper defines Hard Mode vs Easy Mode terminology and releases new expert-annotated benchmarks (MiniF2F-Hard, FIMO-Hard), then reports raw solve counts for DAP on CombiBench (10/??) and PutnamBench (36 theorems). No equations, fitted parameters, or self-referential definitions appear; the central results are direct empirical tallies of formally verified theorems. The LLM-driven rewrite step is a methodological component whose logical equivalence is an external validity question, not a reduction of the reported counts to the method's own inputs by construction. No self-citation chains or ansatzes underpin the performance claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems paper; the central claims rest on experimental outcomes rather than mathematical axioms or fitted parameters. No free parameters, axioms, or invented entities are invoked in the abstract.

pith-pipeline@v0.9.0 · 5557 in / 1151 out tokens · 55656 ms · 2026-05-10T08:33:44.930345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages · 1 internal anchor

[1]

DeepMind blog, 179:45

Ai achieves silver-medal standard solving international 178 mathematical olympiad prob- lems. DeepMind blog, 179:45. Kaito Baba, Chaoran Liu, Shuhei Kurita, and Akiyoshi Sannai. 2025. Prover agent: An agent-based framework for formal mathematical proofs. arXiv preprint arXiv:2506.19923 . Chenrui Cao, Liangcheng Song, Zenan Li, Xinyi Le, Xian Zhang, HUI XU...

work page arXiv 2025
[2]

Deep- Mind Blog

Advanced version of gemini with deep think oﬀicially achieves gold-medal standard at the international mathematical olympiad. Deep- Mind Blog . Leonardo de Moura and Sebastian Ullrich. 2021. The lean 4 theorem prover and programming language. In International Conference on Au- tomated Deduction, pages 625–635. Springer. Joseph Myers. 2025. Jsm28/IMOLean: ...

2021
[3]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question- answering with human feedback . Preprint, arXiv:2112.09332. Tobias Nipkow, Markus Wenzel, Lawrence C. Paul- son, Gerhard Goos, Juris Hartmanis, and Jan Van Leeuwen, editors. 2002. Isabelle/HOL, vol- ume 2283 of Lecture Notes in Computer Science . Springer, Berlin, Heidelberg. Z. Z. Ren, Zhihong Shao, Junxiao Song, Hua- j...

work page internal anchor Pith review arXiv 2002
[4]

Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition.CoRR, abs/2407.11214, 2024

PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition. Preprint, arXiv:2407.11214. Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, De- hao Zhang, Ebony Zhang, Frede...

work page arXiv 2023
[5]

I have successfully solved the problem. The final answer is

converts a general-purpose LLM into a Lean proof specialist via a language-agent loop. LeanAgent ( Kumarappan et al. , 2024) stud- ies lifelong learning to continuously improve LLM performance on advanced mathematics. ProverAgent (Baba et al. , 2025) leverages non- formal models to propose auxiliary lemmas that guide formal proofs. MA-LoT ( Wang et al. , ...

2024
[6]

A natural language math problem (`<NATURAL_LANGUAGE_PROBLEM>`).,→
[7]

A detailed natural language solution to that problem (`<NATURAL_LANGUAGE_SOLUTION>`).,→
[8]

fill-in-the-blank

A Lean 4 code statement that formalizes the problem (`<LEAN4_STATEMENT>`).,→ The Lean 4 statement contains a "fill-in-the-blank" definition for the solution, typically an `abbrev` or `def` with `sorry` as its value. ,→ ,→ **Your task is to:**
[9]

The value is π/2

**Read the `<NATURAL_LANGUAGE_SOLUTION>` to find the final answer** to the problem. The answer is usually explicitly stated at the end (e.g., "The value is π/2", "The result is 42"). ,→ ,→ ,→
[10]

**Locate the `abbrev` or `def` line** in the `<LEAN4_STATEMENT>` that defines the solution variable (e.g., `abbrev my_solution : ℝ := sorry`). ,→ ,→
[11]

Ensure the syntax is correct for Lean 4 (e.g., `π` becomes `Real.pi`)

**Replace the `sorry`** in that definition with the final answer you extracted. Ensure the syntax is correct for Lean 4 (e.g., `π` becomes `Real.pi`). ,→ ,→
[12]

The solution is correct,

**Output the entire Lean 4 code block with this single modification.** Do not change any other part of the code. Specifically, the `theorem`'s proof, which is also `sorry`, must remain unchanged. ,→ ,→ ,→ **Example:** **<NATURAL_LANGUAGE_PROBLEM>:** Evaluate the integral \[\int_0^1 \frac{1}{1+x^2} \,dx.\] **<NATURAL_LANGUAGE_SOLUTION>:** The problem is to...

2025