Benchmarking Testing in Automated Theorem Proving
Pith reviewed 2026-05-08 06:03 UTC · model grok-4.3
The pith
A new test using dependent theorems shows AI models generate statements that compile yet support only 39 percent of successor proofs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a generated theorem counts as semantically correct only if all its dependent successor theorems compile successfully without error. They automatically extract a benchmark of 2,206 problems paired with 41 successors on average from five public Lean 4 repositories. Experiments demonstrate that models perform far worse on this successor-based test than on simple compilation, with Claude-Sonnet-4.5 reaching only 38.9 percent testing accuracy on the full set when given both natural language proof and successor theorems as context.
What carries the argument
The T framework, which defines the semantic correctness of a generated theorem by requiring that every successor theorem depending on it compiles and proves successfully.
If this is right
- Theorem generation systems must be judged on whether their outputs enable proofs of dependent statements rather than on standalone compilation alone.
- Existing evaluations based on lexical overlap or isolated compilation overestimate the practical usefulness of current models for building formal libraries.
- The successor-based benchmark can be extended automatically to additional Lean repositories without further human annotation.
- Future model improvements should target the ability to produce theorems that integrate cleanly into existing proof chains.
Where Pith is reading between the lines
- Training objectives that directly optimize for successor compatibility could close the observed gap between compilation and semantic correctness.
- The same successor-testing idea could be adapted to other proof assistants to create scalable, low-cost evaluations of AI-generated formal content.
- If models improve under this metric, it would support more autonomous extension of large formal libraries with less human intervention.
Load-bearing premise
Successful compilation of all dependent successor theorems is a sufficient indicator that the generated theorem is semantically correct.
What would settle it
A human expert review of a random sample of generated theorems that pass the successor compilation test, measuring how often they contain undetected semantic errors.
Figures
read the original abstract
Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the T evaluation framework for assessing semantic correctness of theorems generated by LLMs for formal theorem proving. A generated theorem is deemed correct only if all its dependent successor theorems (extracted automatically from real Lean 4 repositories) compile successfully. The benchmark comprises 2,206 problems from 5 repositories, each paired with ~41 successors on average. Experiments report that while models achieve high compilation success rates, the best model (Claude-Sonnet-4.5) reaches only 38.9% Testing Accuracy on the full set even with natural-language proof and successor context, highlighting a gap in current theorem-generation capabilities.
Significance. If the core metric holds, the work offers a scalable, human-effort-free alternative to lexical or manual evaluation by adapting integration-testing ideas to theorem proving. The automatic extraction from real repositories and the resulting benchmark are clear strengths that enable reproducible assessment of semantic (vs. merely syntactic) correctness. This could usefully shift focus in the field toward models that generate theorems whose consequences integrate cleanly.
major comments (1)
- [§3] §3 (T framework definition): The central claim that successful compilation of all successor theorems is a sufficient proxy for semantic correctness of the generated theorem is not supported by evidence in the manuscript. Compilation verifies only that the statement is well-typed and that dependents can reference it without type errors; it does not establish that the theorem is mathematically true, matches the supplied natural-language intent, or that an erroneous statement would reliably cause successor failures. No human validation study, correlation with ground-truth correctness, or analysis of false-positive/negative cases is reported, which directly undermines the interpretation of the 38.9% Testing Accuracy gap as a genuine capability deficit rather than a metric artifact.
minor comments (2)
- [Abstract] Abstract and §4: The exact formula for 'Testing Accuracy' (e.g., whether it is a strict all-successors-pass rate or allows partial credit) should be stated explicitly with an equation or pseudocode before reporting the 38.9% figure.
- [§4] §4.2 (model comparison): Clarify the prompting setup for the 'given both natural language proof and successor theorems as context' condition and report variance across multiple runs or seeds to support the significance of the reported performance gap.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which has helped us better articulate the scope and limitations of the T framework. We address the major comment point by point below and have made revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (T framework definition): The central claim that successful compilation of all successor theorems is a sufficient proxy for semantic correctness of the generated theorem is not supported by evidence in the manuscript. Compilation verifies only that the statement is well-typed and that dependents can reference it without type errors; it does not establish that the theorem is mathematically true, matches the supplied natural-language intent, or that an erroneous statement would reliably cause successor failures. No human validation study, correlation with ground-truth correctness, or analysis of false-positive/negative cases is reported, which directly undermines the interpretation of the 38.9% Testing Accuracy gap as a genuine capability deficit rather than a metric artifact.
Authors: We agree that the T metric functions as a proxy for semantic correctness rather than a direct verification of mathematical truth or exact natural-language intent. As the referee notes, successful compilation of dependents confirms type compatibility but does not guarantee the generated statement is true or identical to the intended meaning. We positioned T as an integration-testing analogue: in formal libraries, a statement that allows all dependent theorems to compile without type errors is a practical indicator that its semantics are sufficiently preserved for the existing development. In the revised manuscript, we have expanded §3 with an explicit limitations subsection that discusses this proxy nature, potential failure cases (e.g., equivalent reformulations or unreferenced theorems), and the distinction from full semantic verification. We have also added a small-scale manual analysis of false-positive and false-negative rates on a random sample of 100 problems, which we report in the revision. A comprehensive human validation study correlating T scores with expert judgments is a valuable next step but lies beyond the current scope given the benchmark size. revision: partial
Circularity Check
No circularity in definition or evaluation of the T metric.
full rationale
The paper explicitly defines its semantic metric T as the requirement that all automatically extracted successor theorems compile successfully after the generated theorem is added, constructs the benchmark by extracting 41 successors on average from five Lean 4 repositories without human annotation, and reports empirical Testing Accuracy (e.g., 38.9% for Claude-Sonnet-4.5) under this definition. No equation or claim reduces a reported result to a fitted parameter, renames a prior result, or relies on a self-citation chain whose validity is presupposed by the present work. The comparison between raw compilation success and T accuracy is a direct measurement against the independently defined proxy, leaving the framework self-contained as a proposed benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Compilation success of successor theorems serves as a valid proxy for semantic correctness
invented entities (1)
-
T evaluation framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Evaluating large language models trained on code.Preprint, arXiv:2107.03374. H. B. Curry. 1934. Functionality in combinatory logic. Proceedings of the National Academy of Sciences of the United States of America, 20(11):584–590. Guoxiong Gao, Yutong Wang, Jiedong Jiang, Qi Gao, Zihan Qin, Tianyi Xu, and Bin Dong. 2025. Her- ald: A natural language annotat...
work page internal anchor Pith review arXiv 1934
-
[2]
InInternational Conference on Learning Representations
Minif2f: a cross-system benchmark for for- mal olympiad-level mathematics. InInternational Conference on Learning Representations. A Model Details Family Models Claude (Anthropic, 2025a, 2024, 2025b) Claude-3.7-Sonnet Claude-4-Sonnet Claude-Sonnet-4.5 GPT (OpenAI, 2024; Singh et al., 2025) GPT-4o-mini GPT-5 GPT-5-Mini GPT-5-Nano GPT-OSS (Agarwal et al., 2...
2024
-
[3]
• Consider the context fromCode BeforeandCode Afterif relevant
ATHOUGHTsection where you explain your reasoning: • Analyze theTarget Codeto understand what it proves or defines. • Consider the context fromCode BeforeandCode Afterif relevant. • Plan the natural language description
-
[4]
The function determines the sum of two integers
A single plaintext code block with your description. Format your responses like this: <format_example> Natural Language THOUGHT : The target code defines a function \ texttt { foo } that adds two numbers . I will describe it as " The function determines the sum of two integers ." Natural Language The function determines the sum of two integers . </format_...
-
[5]
• Plan the implementation
ATHOUGHTsection where you explain your reasoning: • Analyze the Code Before and Code After to deduce the purpose and signature of {{ target_code_name }}. • Plan the implementation
-
[6]
Format your responses like this: <format_example> Natural Language THOUGHT : The context shows a function \ texttt { foo } is used to add two numbers
A single lean code block with your implementation. Format your responses like this: <format_example> Natural Language THOUGHT : The context shows a function \ texttt { foo } is used to add two numbers . I will implement \ texttt { foo } as a definition taking two Nats and returning their sum . Lean 4 def foo ( n m :Nat) :Nat:= n + m </format_example> CRIT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.