pith. sign in

arxiv: 2606.12594 · v1 · pith:EIXTJMMNnew · submitted 2026-06-10 · 💻 cs.AI

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pith reviewed 2026-06-27 09:54 UTC · model grok-4.3

classification 💻 cs.AI
keywords Lean theorem provingformal verificationaugmented formalisationself-distillationcurriculum learningMiniF2FPutnamBenchdiffusion model
0
0 comments X

The pith

Augmented Lean formalisation lets 4B and 32B models outperform far larger theorem provers on MiniF2F and PutnamBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Pythagoras-Prover, an open-source family of Lean theorem provers designed for modest compute. It combines a stratified verified corpus for curriculum supervised fine-tuning, dynamic trace filtering to fit 8k-token limits, and Augmented Lean Formalisation that creates statement variants through self-distillation without verifying every mutation. The resulting 4B model exceeds a 671B baseline at pass@32 on MiniF2F-Test while the 32B model sets a new open-source record and solves 93 PutnamBench problems. A diffusion-based 4B variant is also tested as an alternative to autoregressive generation. Readers would care because formal proving has required heavy resources; the work shows a route to strong results with smaller models and expanded but partially unverified data.

Core claim

Pythagoras-Prover-4B reaches 86.1 percent pass@32 on MiniF2F-Test, surpassing DeepSeek-Prover-V2-671B's 82.4 percent with roughly 167 times fewer parameters, while Pythagoras-Prover-32B attains 93.0 percent on the same test and solves 93 of 672 PutnamBench problems. These outcomes follow from training on a Lean-verified corpus divided into easy-medium-hard tiers for progressive learning, dynamic proof-reasoning filtering, and Augmented Lean Formalisation that perturbs known statements and populates variants via self-distillation to enlarge the training signal.

What carries the argument

Augmented Lean Formalisation (ALF), a process that perturbs known formal statements while preserving their formal character and fills the variants with self-distilled proofs to supply extra training signal without full verification of each instance.

If this is right

  • Curriculum ordering from short simple proofs to longer harder ones allows models to acquire proof skills progressively within fixed context budgets.
  • Dynamic filtering that keeps only informative traces while respecting token limits makes supervised fine-tuning practical on scarce verified data.
  • A diffusion-based prover offers an iterative refinement path at inference time that differs from standard autoregressive sampling.
  • The 32B model remains strongest on the contamination-sensitive MiniF2F-ALF benchmark while the 4B model matches prior state-of-the-art accuracy there.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same augmentation approach could be applied to other interactive theorem provers beyond Lean to reduce dependence on surface forms of statements.
  • If the unverified mutations prove reliable across domains, the method could lower the barrier to training formal systems for domains with even scarcer verified data.
  • Releasing MiniF2F-ALF as a harder benchmark suggests that future work should routinely test on mutated variants to guard against overfitting to surface forms.

Load-bearing premise

Mutated formal statements generated by Augmented Lean Formalisation via self-distillation supply valid and useful training signal even though they are not formally verified for every instance.

What would settle it

Run the trained models on a fresh collection of formally verified mutated statements or on MiniF2F-ALF with every variant manually checked; a sharp drop relative to the reported numbers would indicate the unverified mutations were not supplying reliable signal.

Figures

Figures reproduced from arXiv: 2606.12594 by Eleonora Giunchiglia, Haonan Li, Joshua Ong Jun Leang, Mihaela C\u{a}t\u{a}lina Stoian, Qiyuan Xu, Shay B. Cohen, Wenda Li, Zheng Zhao.

Figure 1
Figure 1. Figure 1: Performance comparison across proving tasks. The left and middle panels show results under a limited [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The seed synthetic data generation pipeline for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Pythagoras-Prover pipeline. apply rubric-guided distillation to the hard tier, since its simplification step would convert hard-source failures into local subproblems and leak hard-problem structure into the supervised corpus. 2.2 Training Algorithm We train Pythagoras-Prover in three stages: supervised fine-tuning on the seed corpus under a difficulty-ordered curriculum, reinforcement learning… view at source ↗
Figure 4
Figure 4. Figure 4: Performance of current state-of-the-art provers on MiniF2F-ALF. As MiniF2F-ALF is introduced in this work, all results are evaluated by us under a unified setup. confirming that ALF mutation induces a non-trivial distribution shift. Pythagoras-Prover-32B retains the highest absolute pass rate of any evaluated model (85.0%); Pythagoras-Prover-4B (83.2%) is almost on par with the 8× larger Goedel-Prover-V2-3… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison on the MiniF2F bench￾mark. ∗ denotes the setting where training to￾kens are restricted to 4096 and evaluation is performed solely at 8192 tokens. points. We attribute this gap to three constraints that together govern diffusion training for formal mathematics: (i) the random-masking objective, which provides a sparser supervisory signal than next-token prediction; (ii) the limited stable context… view at source ↗
Figure 6
Figure 6. Figure 6: Data filtering and error analysis. Left: accepted and rejected instances across data sources. Right: distribution [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Scaling behaviour on MiniF2F test split. We [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-pair outcomes across MiniF2F-Test and its MiniF2F-ALF mutation. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Domain composition of the at-least￾one-model-wrong set on original MiniF2F. MiniF2F-ALF reopens the benchmark and diversifies the failures. MiniF2F-ALF holds each problem family fixed and asks whether mod￾els remain correct under controlled mutations of the same statement. Under this evaluation the all-model failure rate rises from 9.02% on the original to 13.93% (68/488), and the at-least-one-wrong set ex… view at source ↗
Figure 10
Figure 10. Figure 10: Gradient-norm dynamics of Pythagoras-Prover-Diffusion-4B under the two context-length settings. The [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Categories are taken directly from MiniF2F. The competition-derived slices comprise [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
read the original abstract

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Pythagoras-Prover, a family of parameter-efficient Lean theorem provers (autoregressive 4B/32B models and a 4B diffusion model) trained via curriculum SFT on a stratified Lean-verified corpus with dynamic proof filtering. It proposes Augmented Lean Formalisation (ALF) to generate mutated formal statements via self-distillation without per-instance verification, claiming this expands training data while preserving formal character. Key results: the 4B model achieves 86.1% pass@32 on MiniF2F-Test (surpassing DeepSeek-Prover-V2-671B's 82.4%), the 32B reaches 93.0% (open-source SOTA) and solves 93/672 PutnamBench problems; a new MiniF2F-ALF benchmark is released where all models drop in accuracy.

Significance. If the performance claims and ALF validity hold, the work demonstrates that smaller models can outperform much larger ones on formal proving benchmarks through data augmentation and curriculum training, advancing practical compute-efficient provers. The release of MiniF2F-ALF as a contamination-sensitive test set and the open-source models are concrete strengths that enable further verification of the results.

major comments (2)
  1. [Abstract / ALF section] Abstract and ALF description (likely §3): the central training pipeline relies on the unverified claim that ALF self-distilled mutations 'preserve their formal character' and logical validity without Lean checking every instance. No empirical validation (e.g., manual audit rate, semantic equivalence checks, or error analysis on a sample of mutations) is reported, which directly undermines the claim that the observed gains reflect improved proof skill rather than noisy or invalid training signal.
  2. [Experimental results section] Experimental results (likely §4, Table on MiniF2F-Test): headline numbers such as 86.1% (4B) and 93.0% (32B) at pass@32 are presented without run-to-run variance, statistical tests, exact train/eval splits, or confirmation that ALF mutations were excluded from evaluation data. This makes it impossible to assess robustness of the comparison to DeepSeek-Prover-V2-671B.
minor comments (2)
  1. [Model description] The diffusion-based 4B prover is mentioned as a proof-of-concept but receives little quantitative comparison to the autoregressive models; clarify its role and results relative to the main claims.
  2. [Training methodology] Notation for pass@k and context-length filtering could be made more precise with explicit equations or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / ALF section] Abstract and ALF description (likely §3): the central training pipeline relies on the unverified claim that ALF self-distilled mutations 'preserve their formal character' and logical validity without Lean checking every instance. No empirical validation (e.g., manual audit rate, semantic equivalence checks, or error analysis on a sample of mutations) is reported, which directly undermines the claim that the observed gains reflect improved proof skill rather than noisy or invalid training signal.

    Authors: We acknowledge that the manuscript does not report a quantitative error analysis or manual audit of ALF mutations. ALF generates variants via self-distillation from Lean-verified statements using syntactic perturbations intended to maintain logical structure. To address this, the revised manuscript will add an appendix reporting a manual audit of 200 sampled mutations, including the rate at which they type-check and preserve semantic equivalence under Lean. This will provide empirical support for the training signal quality. revision: yes

  2. Referee: [Experimental results section] Experimental results (likely §4, Table on MiniF2F-Test): headline numbers such as 86.1% (4B) and 93.0% (32B) at pass@32 are presented without run-to-run variance, statistical tests, exact train/eval splits, or confirmation that ALF mutations were excluded from evaluation data. This makes it impossible to assess robustness of the comparison to DeepSeek-Prover-V2-671B.

    Authors: The reported figures are from single runs owing to training compute limits. The revised version will explicitly state the train/eval splits, confirm that ALF mutations were generated only from the training corpus (and thus absent from MiniF2F-Test), and note the single-run limitation. We will attempt to add variance from additional runs if resources allow; otherwise the limitation will be stated clearly. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results on held-out benchmarks

full rationale

The paper reports pass rates measured on standard held-out benchmarks (MiniF2F-Test, PutnamBench) after training on a separately constructed corpus that includes ALF-augmented statements. No equations, fitted parameters, or self-citations reduce the reported accuracies to quantities defined by the evaluation data itself. The ALF perturbation claim is an unverified modeling assumption rather than a definitional loop, and the central performance numbers remain independent measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that Lean correctly verifies proofs and that ALF mutations remain logically sound enough to supply useful training signal without formal verification of each instance.

axioms (2)
  • domain assumption The Lean theorem prover correctly verifies the validity of proofs in the corpus and benchmarks.
    Invoked throughout the description of verified data and evaluation.
  • ad hoc to paper Self-distilled mutations generated by ALF preserve the formal character and logical validity of the original statements.
    Required to justify training on unverified mutated instances.

pith-pipeline@v0.9.1-grok · 5930 in / 1315 out tokens · 35418 ms · 2026-06-27T09:54:07.995256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    URLhttps://aclanthology.org/2025.emnlp-main.1024/

    doi: 10.18653/v1/2025.emnlp-main.1024. URLhttps://aclanthology.org/2025.emnlp-main.1024/. Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison- Burch. Faithful Chain-of-Thought Reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu 15 Pythagoras-Prover Purwarianti, and Adila ...

  2. [2]

    Thierry Coquand and Christine Paulin-Mohring

    URLhttps://doi.org/10.1016/0890-5401(88)90005-3. Yuri Chervonyi, Trieu H. Trinh, Miroslav Olsák, Xiaomeng Yang, Hoang H. Nguyen, Marcelo Menegali, Junehyuk Jung, Junsu Kim, Vikas Verma, Quoc V . Le, and Thang Luong. Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2.J. Mach. Learn. Res., 26:241:1–241:39, 2025. URL https://jmlr.org/...

  3. [3]

    URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/1582eaf9e0cf349e1e5a6ee453100aa1-Paper-Datasets_and_Benchmarks_Track.pdf

    doi: 10.52202/079017-0368. URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/1582eaf9e0cf349e1e5a6ee453100aa1-Paper-Datasets_and_Benchmarks_Track.pdf. Joshua Ong Jun Leang, Giwon Hong, Wenda Li, and Shay B Cohen. Theorem Prover as a Judge for Synthetic Data Generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pil...

  4. [4]

    Qwen3 Technical Report

    Notion Blog. Hugging Face. Open R1: A fully open reproduction of DeepSeek-R1, January 2025. URL https://github.com/ huggingface/open-r1. Qwen Team, An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 Technical Report, 2025. URLhttps://arxiv.org/abs/2505.09388. Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongy...

  5. [5]

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al

    URLhttps://openreview.net/forum?id=yaqPf0KAlN. Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions, 2024. URL https://github.com/project-numina/aimo-progr...

  6. [6]

    Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, et al

    URLhttps://openreview.net/forum?id=nZeVKeeFYf9. Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, et al. Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning, 2025b. URLhttps://arxiv.org/abs/2504.11354. Zhihong Shao, Peiyi Wang, Qihao Z...

  7. [7]

    URLhttps://arxiv.org/abs/2402.03300. 17 Pythagoras-Prover Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, juncai liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen...

  8. [8]

    URLhttps://openreview.net/forum?id=AjXkRZIvjB. OpenAI. OpenAI GPT-5 System Card, 2026. URLhttps://arxiv.org/abs/2601.03267. Anthropic. System Card: Claude Opus 4.5.https://anthropic.com/claude-opus-4-5-system-card, 2025. Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia LI, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goe...

  9. [9]

    Prove that

    URLhttps://doi.org/10.18653/v1/2025.emnlp-demos.44. Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data, 2024. URL https://arxiv.org/abs/2405.14333. Xingguang Ji, Yahui Liu, Qi Wang, Jingyuan Zhang, Yang Yue, Rui ...

  10. [10]

    (e.g., If the error involves Point F, you MUST state the exact coordinates of Point F in the question)

    **COMPLETELY SELF-CONTAINED & SOLVABLE:** The generated question MUST include the actual mathematical values, coordinates, formulas, or premises involved in the failed step. (e.g., If the error involves Point F, you MUST state the exact coordinates of Point F in the question). It must be solvable without looking at the Lean code

  11. [11]

    Frame it purely as a standard math competition or textbook problem

    **PURE MATHEMATICS (NO CODE):** The question and solution MUST NOT contain Lean 4 code, tactics (`simp`,`rw`), Lean keywords (`let`, `have`), or backticks (` `). Frame it purely as a standard math competition or textbook problem

  12. [12]

    WHY" QUESTIONS:** Do not ask

    **NO DIAGNOSTIC OR "WHY" QUESTIONS:** Do not ask "Why did this fail?" or "What is the correct logical approach?". Ask direct mathematical questions like "Calculate X", "Find Y", or "Prove that Z is true given [Premises]."

  13. [13]

    Why does the`intro`tactic fail when applied to Point F?

    **EXPLICIT STEP-BY-STEP SOLUTION:** You must provide a clear, logically sound, step-by-step mathematical solution to the synthetic question you just generated. ### EXAMPLES (Bad vs. Good) - **BAD (Diagnostic/Code-heavy):** "Why does the`intro`tactic fail when applied to Point F?" - **GOOD (Synthetic Math Problem):** "Point A is at (0, 0) and Point B is at...