pith. sign in

arxiv: 2508.03613 · v1 · pith:AXM3Z5VQnew · submitted 2025-08-05 · 💻 cs.LG · cs.AI

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Pith reviewed 2026-05-21 06:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords formal theorem provingLean verifierself-correctionscaffolded data synthesislanguage modelsPutnamBenchMiniF2Fmodel averaging
0
0 comments X

The pith

Scaffolded data synthesis and verifier-guided self-correction let smaller models outperform much larger ones in formal theorem proving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Goedel-Prover-V2 introduces three innovations to scale formal theorem proving with language models. Scaffolded data synthesis creates training tasks that grow harder over time to build skills on complex proofs. Verifier-guided self-correction lets the model fix mistakes using direct feedback from the Lean proof checker. Model averaging combines checkpoints to avoid losing variety in generated proofs. These changes allow an 8B model to beat a 671B model and a 32B model to lead open-source results on MiniF2F and PutnamBench.

Core claim

Goedel-Prover-V2 incorporates scaffolded data synthesis to generate tasks of increasing difficulty, verifier-guided self-correction that allows iterative revision of proofs using Lean compiler feedback, and model averaging to maintain output diversity. This results in the 32B model achieving 88.1% pass@32 on MiniF2F in standard mode and 90.4% with self-correction, while solving 86 problems on PutnamBench at pass@184, establishing new records for open-source models.

What carries the argument

The combination of scaffolded data synthesis for progressively harder theorems, verifier-guided self-correction using Lean compiler feedback, and model averaging to preserve output diversity within the expert iteration pipeline.

If this is right

  • The 8B model reaches 84.6% pass@32 on MiniF2F and outperforms the prior 671B model under the same evaluation settings.
  • Self-correction raises the 32B model's MiniF2F score from 88.1% to 90.4%.
  • The 32B model solves 86 PutnamBench problems at pass@184, more than the prior open-source record of 47 at pass@1024.
  • The models rank among top performers under constrained test-time compute budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training on progressively harder synthetic tasks may reduce the need for enormous model sizes in other formal reasoning domains.
  • Self-correction loops driven by compiler feedback could transfer to program synthesis or hardware verification tasks.
  • Maintaining generation diversity through checkpoint averaging may prove useful in any iterative reinforcement-learning setup that risks mode collapse.

Load-bearing premise

The performance gains come from the three specific innovations rather than from differences in total training data scale or hyperparameter choices that were not isolated in the experiments.

What would settle it

Train an otherwise identical model on the same total data volume and compute without the scaffolded difficulty progression or the self-correction loop and measure whether its MiniF2F pass@32 score falls below 88.1%.

read the original abstract

We introduce Goedel-Prover-V2, a series of open-source language models that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing the first place among open-source models on the leaderboard, surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by pass@1024 with a significantly smaller model size and compute budget. At the time of its release (July-August 2025), Goedel-Prover-V2 achieves the strongest overall performance among all open-source theorem provers. It also ranks among the top-performing models--including closed-source systems with publicly reported performance--under a constrained test-time compute budget. Our models, code, and data are released at https://github.com/Goedel-LM/Goedel-Prover-V2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Goedel-Prover-V2, a family of open-source language models for formal theorem proving in Lean. It augments the standard expert-iteration plus reinforcement-learning pipeline with three innovations: (1) scaffolded data synthesis that generates synthetic tasks of progressively increasing difficulty, (2) verifier-guided self-correction that iteratively revises proofs using Lean compiler feedback, and (3) model averaging to counteract loss of output diversity in later training stages. The 32B model is reported to reach 88.1% pass@32 on MiniF2F in standard mode and 90.4% with self-correction, while solving 86 problems on PutnamBench at pass@184, outperforming prior open-source SOTA including DeepSeek-Prover-V2-671B.

Significance. If the performance deltas can be attributed to the three listed techniques rather than to unisolated differences in data volume or training procedure, the work would constitute a meaningful advance for open-source automated theorem proving by showing that modest-sized models can surpass much larger ones under constrained test-time compute. The public release of models, code, and data is a clear strength that supports reproducibility and follow-on research.

major comments (1)
  1. [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that scaffolded synthesis, verifier-guided self-correction, and model averaging are the primary drivers of the reported gains (88.1% → 90.4% on MiniF2F pass@32; 86 vs. prior 47 problems on PutnamBench) is not supported by controlled ablations that hold base-model initialization, total training tokens, and hyper-parameters fixed while toggling each innovation individually. Without such isolation, the deltas could arise from larger synthetic data volume, different RL reward shaping, or benchmark-specific prompting.
minor comments (2)
  1. [Abstract] The abstract states results 'at the time of its release (July-August 2025)'; this date reference should be removed or clarified for archival publication.
  2. [§4] Pass@k settings and exact test-time compute budgets are described for the main results but could be tabulated more explicitly alongside all baselines for easier comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify aspects of our work. We address the major comment below and have revised the manuscript to better contextualize our claims and acknowledge limitations.

read point-by-point responses
  1. Referee: [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that scaffolded synthesis, verifier-guided self-correction, and model averaging are the primary drivers of the reported gains (88.1% → 90.4% on MiniF2F pass@32; 86 vs. prior 47 problems on PutnamBench) is not supported by controlled ablations that hold base-model initialization, total training tokens, and hyper-parameters fixed while toggling each innovation individually. Without such isolation, the deltas could arise from larger synthetic data volume, different RL reward shaping, or benchmark-specific prompting.

    Authors: We agree that the manuscript does not include fully controlled ablations that isolate each of the three innovations while holding base-model initialization, total training tokens, and all hyperparameters strictly fixed. Our reported results compare the complete system against prior open-source models and include partial component analyses (e.g., the effect of self-correction on top of a trained base model) in Section 4 and the appendix. However, performing the exact set of retraining experiments suggested would require prohibitive additional compute. In the revised manuscript we add a dedicated limitations paragraph in Section 4 that explicitly discusses potential confounding factors such as data volume and training procedure differences, and we moderate the language in the abstract and evaluation section to state that the combination of techniques yields the observed gains rather than claiming strict isolation of each contribution. All code, data, and training scripts are released to enable the community to conduct such controlled studies. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on independent public benchmarks

full rationale

The paper reports performance on fixed external benchmarks (MiniF2F, PutnamBench) using a standard expert-iteration plus RL pipeline augmented by three procedural innovations. No equations, fitted parameters, or self-referential definitions reduce the reported pass rates to quantities internal to the training process. The central claims are externally falsifiable and do not rely on load-bearing self-citations or ansatzes that loop back to the paper's own inputs. This is the typical non-circular outcome for an empirical scaling paper whose success metrics lie outside the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the assumption that the described training pipeline (expert iteration plus the three innovations) produces the measured improvements on the cited benchmarks; no new mathematical axioms or invented physical entities are introduced.

axioms (1)
  • domain assumption Standard expert iteration and reinforcement learning pipeline for theorem proving is effective when augmented with the listed techniques.
    Invoked in the abstract as the base upon which the three innovations are added.

pith-pipeline@v0.9.0 · 5960 in / 1306 out tokens · 36622 ms · 2026-05-21T06:48:23.562541+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MathAtlas: A Benchmark for Autoformalization in the Wild

    cs.AI 2026-05 accept novelty 8.0

    MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.

  2. Advancing Mathematics Research with AI-Driven Formal Proof Search

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM-based agents in Lean solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures at a few hundred dollars each.

  3. Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models

    math.ST 2026-05 unverdicted novelty 7.0

    s-step self-distillation is optimal among spectral shrinkage estimators for s-spiked covariance matrices and necessary for optimality.

  4. CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

    cs.AI 2026-05 accept novelty 7.0

    CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.

  5. Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

    cs.CL 2026-05 unverdicted novelty 7.0

    LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.

  6. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 7.0

    An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

  7. On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows

    cs.AI 2026-05 unverdicted novelty 7.0

    MCPP is a Monte Carlo simulation-based online planner that improves the probability of agentic workflows completing successfully under explicit budget and deadline constraints compared to baselines on CodeFlow and Pro...

  8. Automatic Textbook Formalization

    cs.AI 2026-04 accept novelty 7.0

    Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.

  9. OProver: A Unified Framework for Agentic Formal Theorem Proving

    cs.CL 2026-05 unverdicted novelty 6.0

    OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...

  10. Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving

    cs.AI 2026-05 unverdicted novelty 6.0

    Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.

  11. MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

    cs.LG 2026-05 unverdicted novelty 6.0

    MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

  12. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 6.0

    An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.

  13. On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows

    cs.AI 2026-05 unverdicted novelty 6.0

    MCPP uses Monte Carlo simulations of workflow executions to dynamically allocate resources and replan, raising constrained completion probability over baselines on CodeFlow and ProofFlow.

  14. Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

    cs.LG 2026-05 unverdicted novelty 6.0

    The Obfuscated Natural Number Game shows reasoning LLMs keep proof accuracy without semantic cues while general models degrade, establishing a metric for architectural reasoning in alien math domains.

  15. Ablation and the Meno: Tools for Empirical Metamathematics

    cs.LO 2026-04 unverdicted novelty 6.0

    Meno and tactic ablation on Tao's Analysis I generate proof populations that embed on low one- or two-dimensional submanifolds far from human constructions in Goedel Prover space.

  16. Scaling Self-Play with Self-Guidance

    cs.LG 2026-04 unverdicted novelty 6.0

    SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.

  17. The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data

    cs.AI 2026-04 unverdicted novelty 6.0

    The topological dual of a dataset is introduced as a transformation that encodes logical structures into topological ones to expose invariants in neural latent spaces for AlphaGeometry-style reasoning.

  18. Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

    cs.LG 2026-04 unverdicted novelty 6.0

    Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

  19. A Minimal Agent for Automated Theorem Proving

    cs.AI 2026-02 unverdicted novelty 6.0

    A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.

  20. Discovering New Theorems via LLMs with In-Context Proof Learning in Lean

    cs.LG 2025-09 unverdicted novelty 6.0

    LLMs in a conjecturing-proving loop that conditions on their own prior verified Lean proofs discover more hard-to-prove theorems than baselines that generate statements and proofs together.

  21. Pseudo-Formalization for Automatic Proof Verification

    cs.LO 2026-05 unverdicted novelty 5.0

    Pseudo-Formalization decomposes natural language proofs into modular blocks for independent LLM verification via Block Verification, outperforming LLM-as-judge baselines on error detection in olympiad and research mat...

  22. Code as Agent Harness

    cs.CL 2026-05 accept novelty 5.0

    A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...

  23. OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving

    cs.LG 2026-04 unverdicted novelty 5.0

    OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.

  24. On Reasoning-Centric LLM-based Automated Theorem Proving

    cs.SE 2026-04 unverdicted novelty 5.0

    ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 22 Pith papers

  1. [1]

    distribution across multiple files, 3

    incomplete problem statements, 2. distribution across multiple files, 3. multiple theorems per problem, and 4. incompatibility with the commonly used Mathlib. The verification process ensures that each problem contains exactly one formal theorem with its corresponding informal statement, and confirms that all formal statements can pass the compilation wit...

  2. [2]

    11 12 Determine f (4, 1981)

    - 3 := by MathOlympiadBench 1 /-! 2 # International Mathematical Olympiad 1981, Problem 6 3 4 Suppose that f : N × N → N satisfies 5 6 1) f (0, y) = y + 1 7 2) f (x + 1, 0) = f (x, 1), 8 3) f (x + 1, y + 1) = f (x, f (x + 1, y)) 9 10 for all x y ∈ N. 11 12 Determine f (4, 1981). 13 -/ 14 15 def no_eval (x : N) : N := x 16 abbrev solution_value : N := 17 n...

  3. [3]

    In contrast, MathOlympiadBench ensures that both the informal and formal statements consistently correspond to the same version of the problem

    In MiniF2F, the informal statement uses the latter (the nested square root version), but the formal statement is based on the former (the simpler difference of square roots), resulting in a mismatch between the informal and formal statements. In contrast, MathOlympiadBench ensures that both the informal and formal statements consistently correspond to the...

  4. [4]

    Key Elements: The problem’s essential components are correctly represented in LEAN code

  5. [5]

    Mathematical Accuracy: The translation preserves the accuracy of the mathematical content

  6. [6]

    6Please refer to: https://artofproblemsolving.com/wiki/index.php/1962_IMO_ Problems/Problem_2

    Structural Fidelity: The translation aligns closely with the original problem, maintaining its structure and purpose. 6Please refer to: https://artofproblemsolving.com/wiki/index.php/1962_IMO_ Problems/Problem_2. 17 Technical Report

  7. [7]

    Comprehensiveness: All assumptions, conditions, and goals present in the natural language statement are included in the LEAN translation. Your answer should be in the following format: Thought: [Your Answer] Judgement: [Your Answer, one of {Appropriate, Inappropriate}] --- Following are the example problems label for the reasonability of their translation...

  8. [8]

    Prove that

    + (c - 1) ˆ 2 / (c ˆ 2 + 2) = 1 / 2 := by sorry ‘‘‘ 19 Technical Report Thought: The Lean translation of the problem is appropriate because it accurately captures the assumptions and the goal in the natural language statement. Judgement: Appropriate ## Original natural language statement of the problem: {informal_statement} ## Translated formal statement:...

  9. [9]

    (Informal statements generation.) For the informal statements generation, one design choice is whether to generate multiple questions during the same inference, or to generate 22 Technical Report multiple times while only generating 1 (or very few) questions during one inference. From our experiment, we observe that for the Qwen3-32B model, generating mul...

  10. [10]

    Then, we then query Qwen3-8B for 3 times for each formalization, and decide if the formalization is aligned with the informal statement using majority voting (among three queries)

    (Formalization and quality checking.) For each generated informal statement, we call the trained formalizer to formalize the problem twice. Then, we then query Qwen3-8B for 3 times for each formalization, and decide if the formalization is aligned with the informal statement using majority voting (among three queries). We decide to formalize each informal...

  11. [11]

    fast thinking

    (Negation and difficulty filtering.) For each formalization, we query Qwen3-32B for 4 times, where each inference judges the correctness and the simplicity simultaneously. The final correctness is judged by strict majority voting among the 4 judges, while the final simplicity is determined if all 4 judges think the problem is easy. We use such strict crit...

  12. [12]

    E RL T RAINING DETAILS We further explain our RL training in detail

    (Final deduplication.) At the end of the pipeline, we filter out duplicated statements under exact match. E RL T RAINING DETAILS We further explain our RL training in detail. E.1 RL I MPLEMENTATION QuestionAnswerQuestionTool UsingAnswerFeedback QuestionFeedbackSelf-Correction InputRollout(1) Vanilla RL(2) Tool-assistedRL (3) Ours QuestionAnswer Single-Tas...