Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
Pith reviewed 2026-05-21 06:48 UTC · model grok-4.3
The pith
Scaffolded data synthesis and verifier-guided self-correction let smaller models outperform much larger ones in formal theorem proving.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Goedel-Prover-V2 incorporates scaffolded data synthesis to generate tasks of increasing difficulty, verifier-guided self-correction that allows iterative revision of proofs using Lean compiler feedback, and model averaging to maintain output diversity. This results in the 32B model achieving 88.1% pass@32 on MiniF2F in standard mode and 90.4% with self-correction, while solving 86 problems on PutnamBench at pass@184, establishing new records for open-source models.
What carries the argument
The combination of scaffolded data synthesis for progressively harder theorems, verifier-guided self-correction using Lean compiler feedback, and model averaging to preserve output diversity within the expert iteration pipeline.
If this is right
- The 8B model reaches 84.6% pass@32 on MiniF2F and outperforms the prior 671B model under the same evaluation settings.
- Self-correction raises the 32B model's MiniF2F score from 88.1% to 90.4%.
- The 32B model solves 86 PutnamBench problems at pass@184, more than the prior open-source record of 47 at pass@1024.
- The models rank among top performers under constrained test-time compute budgets.
Where Pith is reading between the lines
- Training on progressively harder synthetic tasks may reduce the need for enormous model sizes in other formal reasoning domains.
- Self-correction loops driven by compiler feedback could transfer to program synthesis or hardware verification tasks.
- Maintaining generation diversity through checkpoint averaging may prove useful in any iterative reinforcement-learning setup that risks mode collapse.
Load-bearing premise
The performance gains come from the three specific innovations rather than from differences in total training data scale or hyperparameter choices that were not isolated in the experiments.
What would settle it
Train an otherwise identical model on the same total data volume and compute without the scaffolded difficulty progression or the self-correction loop and measure whether its MiniF2F pass@32 score falls below 88.1%.
read the original abstract
We introduce Goedel-Prover-V2, a series of open-source language models that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing the first place among open-source models on the leaderboard, surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by pass@1024 with a significantly smaller model size and compute budget. At the time of its release (July-August 2025), Goedel-Prover-V2 achieves the strongest overall performance among all open-source theorem provers. It also ranks among the top-performing models--including closed-source systems with publicly reported performance--under a constrained test-time compute budget. Our models, code, and data are released at https://github.com/Goedel-LM/Goedel-Prover-V2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Goedel-Prover-V2, a family of open-source language models for formal theorem proving in Lean. It augments the standard expert-iteration plus reinforcement-learning pipeline with three innovations: (1) scaffolded data synthesis that generates synthetic tasks of progressively increasing difficulty, (2) verifier-guided self-correction that iteratively revises proofs using Lean compiler feedback, and (3) model averaging to counteract loss of output diversity in later training stages. The 32B model is reported to reach 88.1% pass@32 on MiniF2F in standard mode and 90.4% with self-correction, while solving 86 problems on PutnamBench at pass@184, outperforming prior open-source SOTA including DeepSeek-Prover-V2-671B.
Significance. If the performance deltas can be attributed to the three listed techniques rather than to unisolated differences in data volume or training procedure, the work would constitute a meaningful advance for open-source automated theorem proving by showing that modest-sized models can surpass much larger ones under constrained test-time compute. The public release of models, code, and data is a clear strength that supports reproducibility and follow-on research.
major comments (1)
- [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that scaffolded synthesis, verifier-guided self-correction, and model averaging are the primary drivers of the reported gains (88.1% → 90.4% on MiniF2F pass@32; 86 vs. prior 47 problems on PutnamBench) is not supported by controlled ablations that hold base-model initialization, total training tokens, and hyper-parameters fixed while toggling each innovation individually. Without such isolation, the deltas could arise from larger synthetic data volume, different RL reward shaping, or benchmark-specific prompting.
minor comments (2)
- [Abstract] The abstract states results 'at the time of its release (July-August 2025)'; this date reference should be removed or clarified for archival publication.
- [§4] Pass@k settings and exact test-time compute budgets are described for the main results but could be tabulated more explicitly alongside all baselines for easier comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify aspects of our work. We address the major comment below and have revised the manuscript to better contextualize our claims and acknowledge limitations.
read point-by-point responses
-
Referee: [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that scaffolded synthesis, verifier-guided self-correction, and model averaging are the primary drivers of the reported gains (88.1% → 90.4% on MiniF2F pass@32; 86 vs. prior 47 problems on PutnamBench) is not supported by controlled ablations that hold base-model initialization, total training tokens, and hyper-parameters fixed while toggling each innovation individually. Without such isolation, the deltas could arise from larger synthetic data volume, different RL reward shaping, or benchmark-specific prompting.
Authors: We agree that the manuscript does not include fully controlled ablations that isolate each of the three innovations while holding base-model initialization, total training tokens, and all hyperparameters strictly fixed. Our reported results compare the complete system against prior open-source models and include partial component analyses (e.g., the effect of self-correction on top of a trained base model) in Section 4 and the appendix. However, performing the exact set of retraining experiments suggested would require prohibitive additional compute. In the revised manuscript we add a dedicated limitations paragraph in Section 4 that explicitly discusses potential confounding factors such as data volume and training procedure differences, and we moderate the language in the abstract and evaluation section to state that the combination of techniques yields the observed gains rather than claiming strict isolation of each contribution. All code, data, and training scripts are released to enable the community to conduct such controlled studies. revision: yes
Circularity Check
No circularity; empirical results on independent public benchmarks
full rationale
The paper reports performance on fixed external benchmarks (MiniF2F, PutnamBench) using a standard expert-iteration plus RL pipeline augmented by three procedural innovations. No equations, fitted parameters, or self-referential definitions reduce the reported pass rates to quantities internal to the training process. The central claims are externally falsifiable and do not rely on load-bearing self-citations or ansatzes that loop back to the paper's own inputs. This is the typical non-circular outcome for an empirical scaling paper whose success metrics lie outside the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard expert iteration and reinforcement learning pipeline for theorem proving is effective when augmented with the listed techniques.
Forward citations
Cited by 24 Pith papers
-
MathAtlas: A Benchmark for Autoformalization in the Wild
MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
-
Advancing Mathematics Research with AI-Driven Formal Proof Search
LLM-based agents in Lean solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures at a few hundred dollars each.
-
Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models
s-step self-distillation is optimal among spectral shrinkage estimators for s-spiked covariance matrices and necessary for optimality.
-
CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean
CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.
-
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
-
On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
MCPP is a Monte Carlo simulation-based online planner that improves the probability of agentic workflows completing successfully under explicit budget and deadline constraints compared to baselines on CodeFlow and Pro...
-
Automatic Textbook Formalization
Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.
-
OProver: A Unified Framework for Agentic Formal Theorem Proving
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...
-
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.
-
On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
MCPP uses Monte Carlo simulations of workflow executions to dynamically allocate resources and replan, raising constrained completion probability over baselines on CodeFlow and ProofFlow.
-
Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game
The Obfuscated Natural Number Game shows reasoning LLMs keep proof accuracy without semantic cues while general models degrade, establishing a metric for architectural reasoning in alien math domains.
-
Ablation and the Meno: Tools for Empirical Metamathematics
Meno and tactic ablation on Tao's Analysis I generate proof populations that embed on low one- or two-dimensional submanifolds far from human constructions in Goedel Prover space.
-
Scaling Self-Play with Self-Guidance
SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
-
The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data
The topological dual of a dataset is introduced as a transformation that encodes logical structures into topological ones to expose invariants in neural latent spaces for AlphaGeometry-style reasoning.
-
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
-
A Minimal Agent for Automated Theorem Proving
A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.
-
Discovering New Theorems via LLMs with In-Context Proof Learning in Lean
LLMs in a conjecturing-proving loop that conditions on their own prior verified Lean proofs discover more hard-to-prove theorems than baselines that generate statements and proofs together.
-
Pseudo-Formalization for Automatic Proof Verification
Pseudo-Formalization decomposes natural language proofs into modular blocks for independent LLM verification via Block Verification, outperforming LLM-as-judge baselines on error detection in olympiad and research mat...
-
Code as Agent Harness
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
-
OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving
OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.
-
On Reasoning-Centric LLM-based Automated Theorem Proving
ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.
Reference graph
Works this paper leans on
-
[1]
distribution across multiple files, 3
incomplete problem statements, 2. distribution across multiple files, 3. multiple theorems per problem, and 4. incompatibility with the commonly used Mathlib. The verification process ensures that each problem contains exactly one formal theorem with its corresponding informal statement, and confirms that all formal statements can pass the compilation wit...
work page 1981
-
[2]
- 3 := by MathOlympiadBench 1 /-! 2 # International Mathematical Olympiad 1981, Problem 6 3 4 Suppose that f : N × N → N satisfies 5 6 1) f (0, y) = y + 1 7 2) f (x + 1, 0) = f (x, 1), 8 3) f (x + 1, y + 1) = f (x, f (x + 1, y)) 9 10 for all x y ∈ N. 11 12 Determine f (4, 1981). 13 -/ 14 15 def no_eval (x : N) : N := x 16 abbrev solution_value : N := 17 n...
work page 1981
-
[3]
In MiniF2F, the informal statement uses the latter (the nested square root version), but the formal statement is based on the former (the simpler difference of square roots), resulting in a mismatch between the informal and formal statements. In contrast, MathOlympiadBench ensures that both the informal and formal statements consistently correspond to the...
-
[4]
Key Elements: The problem’s essential components are correctly represented in LEAN code
-
[5]
Mathematical Accuracy: The translation preserves the accuracy of the mathematical content
-
[6]
6Please refer to: https://artofproblemsolving.com/wiki/index.php/1962_IMO_ Problems/Problem_2
Structural Fidelity: The translation aligns closely with the original problem, maintaining its structure and purpose. 6Please refer to: https://artofproblemsolving.com/wiki/index.php/1962_IMO_ Problems/Problem_2. 17 Technical Report
-
[7]
Comprehensiveness: All assumptions, conditions, and goals present in the natural language statement are included in the LEAN translation. Your answer should be in the following format: Thought: [Your Answer] Judgement: [Your Answer, one of {Appropriate, Inappropriate}] --- Following are the example problems label for the reasonability of their translation...
-
[8]
+ (c - 1) ˆ 2 / (c ˆ 2 + 2) = 1 / 2 := by sorry ‘‘‘ 19 Technical Report Thought: The Lean translation of the problem is appropriate because it accurately captures the assumptions and the goal in the natural language statement. Judgement: Appropriate ## Original natural language statement of the problem: {informal_statement} ## Translated formal statement:...
-
[9]
(Informal statements generation.) For the informal statements generation, one design choice is whether to generate multiple questions during the same inference, or to generate 22 Technical Report multiple times while only generating 1 (or very few) questions during one inference. From our experiment, we observe that for the Qwen3-32B model, generating mul...
-
[10]
(Formalization and quality checking.) For each generated informal statement, we call the trained formalizer to formalize the problem twice. Then, we then query Qwen3-8B for 3 times for each formalization, and decide if the formalization is aligned with the informal statement using majority voting (among three queries). We decide to formalize each informal...
-
[11]
(Negation and difficulty filtering.) For each formalization, we query Qwen3-32B for 4 times, where each inference judges the correctness and the simplicity simultaneously. The final correctness is judged by strict majority voting among the 4 judges, while the final simplicity is determined if all 4 judges think the problem is easy. We use such strict crit...
-
[12]
E RL T RAINING DETAILS We further explain our RL training in detail
(Final deduplication.) At the end of the pipeline, we filter out duplicated statements under exact match. E RL T RAINING DETAILS We further explain our RL training in detail. E.1 RL I MPLEMENTATION QuestionAnswerQuestionTool UsingAnswerFeedback QuestionFeedbackSelf-Correction InputRollout(1) Vanilla RL(2) Tool-assistedRL (3) Ours QuestionAnswer Single-Tas...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.