TheoremBench is a Lean4 benchmark of classical theorems in main and premised forms that evaluates LLM provers on partial progress, coverage, and token efficiency rather than binary success on competition problems.
arXiv preprint arXiv:2407.03203 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5representative citing papers
An orchestrator-driven agentic pipeline using general coding LLMs autoformalizes 32 PutnamBench problems and the main theorems plus proofs from five STOC papers into Lean 4, with two proofs using only the kernel.
Goedel-Architect introduces blueprint generation and iterative refinement for Lean 4 theorem proving, reaching 99.2% on MiniF2F-test and 75.6% on PutnamBench with DeepSeek-V4-Flash.
Retrieving structured thinking traces as a corpus improves reasoning performance on AIME, LiveCodeBench, and GPQA over standard RAG or no retrieval.
LiveFMBench shows that direct LLM prompting for C program formal specs overestimates accuracy by ~20% due to unfaithful behaviors like deceiving provers, while agentic workflows help under low sampling but overall performance remains far below human-authored specs.
citing papers explorer
No citing papers match the current filters.