TheoremBench is a Lean4 benchmark of classical theorems in main and premised forms that evaluates LLM provers on partial progress, coverage, and token efficiency rather than binary success on competition problems.
MagnusHammer: A Transformer-Based Approach to Premise Selection
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 1polarities
unclear 1representative citing papers
Event-B Agent is an LLM agent that synthesizes, refines, and repairs Event-B formal models from natural language requirements via iterative verification feedback loops.
Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.
Aristotle reaches gold-medal-equivalent performance on 2025 IMO problems via integrated Lean proof search, informal lemma formalization, and a dedicated geometry solver.
citing papers explorer
-
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.