hub Canonical reference

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao · 2025 · cs.LG · arXiv 2508.03613

Canonical reference. 100% of citing Pith papers cite this work as background.

36 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 36 citing papers arXiv PDF

abstract

We introduce Goedel-Prover-V2, a series of open-source language models that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing the first place among open-source models on the leaderboard, surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by pass@1024 with a significantly smaller model size and compute budget. At the time of its release (July-August 2025), Goedel-Prover-V2 achieves the strongest overall performance among all open-source theorem provers. It also ranks among the top-performing models--including closed-source systems with publicly reported performance--under a constrained test-time compute budget. Our models, code, and data are released at https://github.com/Goedel-LM/Goedel-Prover-V2.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

cs.AI · 2026-06-08 · unverdicted · novelty 8.0

TheoremBench is a Lean4 benchmark of classical theorems in main and premised forms that evaluates LLM provers on partial progress, coverage, and token efficiency rather than binary success on competition problems.

MathAtlas: A Benchmark for Autoformalization in the Wild

cs.AI · 2026-05-13 · accept · novelty 8.0

MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.

LAMP: Lean-based Agentic framework with MCP and Proof Repair

cs.LO · 2026-06-27 · conditional · novelty 7.0

LAMP achieves 96.7% success generating verified Lean proofs for 90 Combinatorics on Words theorems by coordinating Planner, Builder, and Verifier agents with a CoW ontology accessed through Model Context Protocol.

Hypothesis-Disciplined Multi-Agent Automated Formalization of Asymptotic Statistical Theory

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

A hypothesis-disciplined multi-agent pipeline in Lean 4 produces axiom-clean, source-faithful formalizations of parametric and semi-parametric asymptotic distribution and efficiency theorems.

Advancing Mathematics Research with AI-Driven Formal Proof Search

cs.AI · 2026-05-21 · conditional · novelty 7.0 · 2 refs

An LLM-based agent with Lean verification autonomously solved multiple open Erdős problems and OEIS conjectures in the first large-scale test.

Pseudo-Formalization for Automatic Proof Verification

cs.LO · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

Pseudo-Formalization decomposes proofs into self-contained natural language modules for independent LLM-based Block Verification, outperforming LLM-as-judge baselines on olympiad and research math benchmarks while releasing ArxivMathGradingBench.

Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models

math.ST · 2026-05-18 · unverdicted · novelty 7.0

s-step self-distillation is optimal among spectral shrinkage estimators for s-spiked covariance matrices and necessary for optimality.

CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

cs.AI · 2026-05-17 · accept · novelty 7.0

CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

cs.CL · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

ProofRank benchmark shows substantial differences in LLM proof quality not captured by correctness, with trade-offs between quality metrics and accuracy.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.

AI co-mathematician: Accelerating mathematicians with agentic AI

cs.AI · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

Automatic Textbook Formalization

cs.AI · 2026-04-03 · accept · novelty 7.0

Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.

Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

cs.AI · 2026-06-30 · unverdicted · novelty 6.0

A 400-entry benchmark and protocol shows tool-augmented agents reach 89.5% compilation but only 60.5% consensus faithfulness, with a 29-point gap; elaboration feedback improves validity most but increases unfaithful compiles.

Automating Formal Verification with Agent-Guided Tree Search

cs.LO · 2026-05-26 · unverdicted · novelty 6.0

Agent-directed tree search improves LLM performance on Lean formal verification tasks, with context-based orchestration solving more intermediate specs at lower token cost than baseline agents.

ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

ImProver 2 combines a data-efficient expert-iteration pipeline with a neurosymbolic scaffold to train a 7B model that outperforms larger models in Lean 4 proof optimization across structural metrics.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.

On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

MCPP uses Monte Carlo simulations of workflow executions to dynamically allocate resources and replan, raising constrained completion probability over baselines on CodeFlow and ProofFlow.

Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

The Obfuscated Natural Number Game shows reasoning LLMs keep proof accuracy without semantic cues while general models degrade, establishing a metric for architectural reasoning in alien math domains.

Ablation and the Meno: Tools for Empirical Metamathematics

cs.LO · 2026-04-24 · unverdicted · novelty 6.0

Meno and tactic ablation on Tao's Analysis I generate proof populations that embed on low one- or two-dimensional submanifolds far from human constructions in Goedel Prover space.

Scaling Self-Play with Self-Guidance

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.

Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

A Minimal Agent for Automated Theorem Proving

cs.AI · 2026-02-27 · unverdicted · novelty 6.0

A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

cs.LG · 2026-01-07 · unverdicted · novelty 6.0

R³L combines reflect-then-retry exploration, pivotal credit assignment, and positive amplification in RL for LLMs, reporting 5-52% relative gains on agentic and reasoning tasks with stable training.

citing papers explorer

Showing 12 of 12 citing papers after filters.

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics cs.AI · 2026-06-08 · unverdicted · none · ref 18 · internal anchor
TheoremBench is a Lean4 benchmark of classical theorems in main and premised forms that evaluates LLM provers on partial progress, coverage, and token efficiency rather than binary success on competition problems.
Hypothesis-Disciplined Multi-Agent Automated Formalization of Asymptotic Statistical Theory cs.AI · 2026-06-03 · unverdicted · none · ref 19 · internal anchor
A hypothesis-disciplined multi-agent pipeline in Lean 4 produces axiom-clean, source-faithful formalizations of parametric and semi-parametric asymptotic distribution and efficiency theorems.
AI co-mathematician: Accelerating mathematicians with agentic AI cs.AI · 2026-05-07 · unverdicted · none · ref 27 · 2 links · internal anchor
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization cs.AI · 2026-06-30 · unverdicted · none · ref 41 · internal anchor
A 400-entry benchmark and protocol shows tool-augmented agents reach 89.5% compilation but only 60.5% consensus faithfulness, with a 29-point gap; elaboration feedback improves validity most but increases unfaithful compiles.
ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization cs.AI · 2026-05-21 · unverdicted · none · ref 4 · internal anchor
ImProver 2 combines a data-efficient expert-iteration pipeline with a neurosymbolic scaffold to train a 7B model that outperforms larger models in Lean 4 proof optimization across structural metrics.
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving cs.AI · 2026-05-12 · unverdicted · none · ref 5 · internal anchor
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows cs.AI · 2026-05-07 · unverdicted · none · ref 5 · 2 links · internal anchor
MCPP uses Monte Carlo simulations of workflow executions to dynamically allocate resources and replan, raising constrained completion probability over baselines on CodeFlow and ProofFlow.
A Minimal Agent for Automated Theorem Proving cs.AI · 2026-02-27 · unverdicted · none · ref 22 · internal anchor
A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.
Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics cs.AI · 2025-10-14 · unverdicted · none · ref 39 · internal anchor
Ax-Prover is a tool-using multi-agent LLM system that matches state-of-the-art provers on public math benchmarks and outperforms them on new abstract-algebra and quantum-theory benchmarks while also assisting an expert with a cryptography proof.
Nothing from Something: Can a Language Model Discover 0? cs.AI · 2026-06-15 · unverdicted · none · ref 24 · internal anchor
Language models require explicit examples to learn zero in arithmetic but language pretraining halves the examples needed.
Agentic Proving for Program Verification cs.AI · 2026-05-22 · unverdicted · none · ref 19 · internal anchor
Agentic Claude reaches 98.8% valid specs, 87.5% implementation certification, and 98.1% end-to-end success on CLEVER, revealing a mismatch between benchmark difficulty and current prover performance.
Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery cs.AI · 2026-06-07 · unverdicted · none · ref 207 · internal anchor
An integrated survey organizing AI mathematical reasoning into informal, formal, discovery, and technique axes while cataloging benchmarks and assessing failure modes.

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer