Agentic LLM framework autoformalizes 32 Putnam problems and main theorems plus proofs from five STOC papers into Lean 4, with two proofs using only kernel axioms.
hub
Seed-prover: Deep and broad reasoning for automated theorem proving
21 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LeanMarathon uses four contract-scoped agents on an evolving blueprint coordinated by a two-stage orchestrator to formalize seven theorems from Erdős problems in Lean, proving 258 lemmas with no sorry across three runs.
A multi-agent framework called AutoformBot autoformalized 26 textbooks spanning analysis, algebra, topology, combinatorics and probability into a verified Lean 4 library of 45k declarations, demonstrating scalable formalization of graduate math.
CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.
Lean Atlas visualizes Lean 4 dependency graphs and applies Lean Compass to reduce the nodes needing human semantic review by 27-99% across six evaluated projects.
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
Lean Refactor uses retrieval from a curated multi-objective strategy database to guide frozen LLMs in refactoring Lean proofs, reporting over 70% token compression on benchmarks and improved version transfer.
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
Synthetic data improves models only in information-open generation-training loops with external signals, and coarser signals like binary correctness enable better generalization by converging to the most information-efficient component.
Training Qwen3-8B on symbolic execution traces from Soteria improves violation detection in C programs by over 17 points, transfers across five property types, and shows superadditive gains with chain-of-thought.
Network analysis of Mathlib reveals 50.9% coupling between human taxonomies and logical dependencies, median 1.6% import usage by developers, and centrality driven by infrastructure rather than mathematical content.
SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.
Ax-Prover is a tool-using multi-agent LLM system that matches state-of-the-art provers on public math benchmarks and outperforms them on new abstract-algebra and quantum-theory benchmarks while also assisting an expert with a cryptography proof.
Aristotle reaches gold-medal-equivalent performance on 2025 IMO problems via integrated Lean proof search, informal lemma formalization, and a dedicated geometry solver.
An agentic theorem prover in Lean uses a control plane to route actions based on cost and success estimates, achieving 28.9% lower average cost than a fixed-step baseline on a PutnamBench subset while preserving performance.
A controlled user study and qualitative survey find that AI assistance raises formalization accuracy for math proofs, with users flexibly combining multiple tools while retaining oversight.
RLVR training raises verified Dafny pass rates from 9.7% to 31.1% on a filtered benchmark while a Lean proof scaffold lifts success from 46.2% to 69.2% on a pilot set and solves 7 of 42 prior unsolved tasks.
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
Agentic Claude reaches 98.8% valid specs, 87.5% implementation certification, and 98.1% end-to-end success on CLEVER, revealing a mismatch between benchmark difficulty and current prover performance.
citing papers explorer
-
Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics
Agentic LLM framework autoformalizes 32 Putnam problems and main theorems plus proofs from five STOC papers into Lean 4, with two proofs using only kernel axioms.