Benchmarking Testing in Automated Theorem Proving

Hojae Han; Jongyoon Kim; Seung-won Hwang

arxiv: 2604.23698 · v1 · submitted 2026-04-26 · 💻 cs.CL · cs.FL

Benchmarking Testing in Automated Theorem Proving

Jongyoon Kim , Hojae Han , Seung-won Hwang This is my paper

Pith reviewed 2026-05-08 06:03 UTC · model grok-4.3

classification 💻 cs.CL cs.FL

keywords automated theorem provinglarge language modelsLean 4semantic evaluationbenchmarkformal verificationtheorem generationproof assistants

0 comments

The pith

A new test using dependent theorems shows AI models generate statements that compile yet support only 39 percent of successor proofs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an evaluation framework that treats a generated formal theorem as correct only when every theorem depending on it still compiles and proves successfully. This method works like integration testing by checking whether the new statement fits into the existing library without breaking downstream proofs. The authors built an automatic benchmark from five real Lean 4 repositories with 2,206 problems, each linked to an average of 41 successors. Experiments reveal that top models reach high standalone compilation rates but drop sharply under this semantic test, with the strongest model achieving just 38.9 percent accuracy even when supplied with natural language hints and the successor theorems themselves. A reader cares because the result indicates that current AI tools cannot yet be trusted to extend or maintain large formal mathematics libraries reliably.

Core claim

The authors establish that a generated theorem counts as semantically correct only if all its dependent successor theorems compile successfully without error. They automatically extract a benchmark of 2,206 problems paired with 41 successors on average from five public Lean 4 repositories. Experiments demonstrate that models perform far worse on this successor-based test than on simple compilation, with Claude-Sonnet-4.5 reaching only 38.9 percent testing accuracy on the full set when given both natural language proof and successor theorems as context.

What carries the argument

The T framework, which defines the semantic correctness of a generated theorem by requiring that every successor theorem depending on it compiles and proves successfully.

If this is right

Theorem generation systems must be judged on whether their outputs enable proofs of dependent statements rather than on standalone compilation alone.
Existing evaluations based on lexical overlap or isolated compilation overestimate the practical usefulness of current models for building formal libraries.
The successor-based benchmark can be extended automatically to additional Lean repositories without further human annotation.
Future model improvements should target the ability to produce theorems that integrate cleanly into existing proof chains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that directly optimize for successor compatibility could close the observed gap between compilation and semantic correctness.
The same successor-testing idea could be adapted to other proof assistants to create scalable, low-cost evaluations of AI-generated formal content.
If models improve under this metric, it would support more autonomous extension of large formal libraries with less human intervention.

Load-bearing premise

Successful compilation of all dependent successor theorems is a sufficient indicator that the generated theorem is semantically correct.

What would settle it

A human expert review of a random sample of generated theorems that pass the successor compilation test, measuring how often they contain undetected semantic errors.

Figures

Figures reproduced from arXiv: 2604.23698 by Hojae Han, Jongyoon Kim, Seung-won Hwang.

**Figure 1.** Figure 1: Two candidate theorems for proving commu view at source ↗

**Figure 2.** Figure 2: An example problem of T 2 . The target tGT (algebraMap_apply) has predecessors Tpred, (e.g. definition of InfiniteAdeleRing), and successors Tsucc that invoke it (e.g. mixedEmbedding_eq_algebraMap_comp). ing to the generated candidate theorem tfl. We denote the theorems tGT depends on as predecessors Tpred “ tt p1q pred, . . . , tpmq predu, and the theorems that transitively depend on tGT as successors… view at source ↗

**Figure 5.** Figure 5: (a) Distribution of successor theorem (ST) view at source ↗

**Figure 4.** Figure 4: Proportion of testing-verified (green) and view at source ↗

**Figure 6.** Figure 6: Compilation accuracy and testing accuracy on view at source ↗

**Figure 7.** Figure 7: Prompt template for informalizing Lean 4 code. view at source ↗

**Figure 8.** Figure 8: Prompt template for autoformalizing to Lean 4 code. view at source ↗

read the original abstract

Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

T gives a practical automatic benchmark for generated theorems by checking successor compilation, but the claim that this measures semantic correctness rests on an unvalidated proxy.

read the letter

The paper adapts the idea of integration testing from code generation to Lean theorem statements. They build a benchmark by pulling 2,206 problems from five real repositories and automatically extracting an average of 41 successor theorems per problem. A generated theorem passes only if every successor still compiles. The experiments show a clear drop: models that compile their own output reasonably well fall to 38.9 percent under this stricter test for the top model when given both the natural-language description and the successors.

Referee Report

1 major / 2 minor

Summary. The paper proposes the T evaluation framework for assessing semantic correctness of theorems generated by LLMs for formal theorem proving. A generated theorem is deemed correct only if all its dependent successor theorems (extracted automatically from real Lean 4 repositories) compile successfully. The benchmark comprises 2,206 problems from 5 repositories, each paired with ~41 successors on average. Experiments report that while models achieve high compilation success rates, the best model (Claude-Sonnet-4.5) reaches only 38.9% Testing Accuracy on the full set even with natural-language proof and successor context, highlighting a gap in current theorem-generation capabilities.

Significance. If the core metric holds, the work offers a scalable, human-effort-free alternative to lexical or manual evaluation by adapting integration-testing ideas to theorem proving. The automatic extraction from real repositories and the resulting benchmark are clear strengths that enable reproducible assessment of semantic (vs. merely syntactic) correctness. This could usefully shift focus in the field toward models that generate theorems whose consequences integrate cleanly.

major comments (1)

[§3] §3 (T framework definition): The central claim that successful compilation of all successor theorems is a sufficient proxy for semantic correctness of the generated theorem is not supported by evidence in the manuscript. Compilation verifies only that the statement is well-typed and that dependents can reference it without type errors; it does not establish that the theorem is mathematically true, matches the supplied natural-language intent, or that an erroneous statement would reliably cause successor failures. No human validation study, correlation with ground-truth correctness, or analysis of false-positive/negative cases is reported, which directly undermines the interpretation of the 38.9% Testing Accuracy gap as a genuine capability deficit rather than a metric artifact.

minor comments (2)

[Abstract] Abstract and §4: The exact formula for 'Testing Accuracy' (e.g., whether it is a strict all-successors-pass rate or allows partial credit) should be stated explicitly with an equation or pseudocode before reporting the 38.9% figure.
[§4] §4.2 (model comparison): Clarify the prompting setup for the 'given both natural language proof and successor theorems as context' condition and report variance across multiple runs or seeds to support the significance of the reported performance gap.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us better articulate the scope and limitations of the T framework. We address the major comment point by point below and have made revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (T framework definition): The central claim that successful compilation of all successor theorems is a sufficient proxy for semantic correctness of the generated theorem is not supported by evidence in the manuscript. Compilation verifies only that the statement is well-typed and that dependents can reference it without type errors; it does not establish that the theorem is mathematically true, matches the supplied natural-language intent, or that an erroneous statement would reliably cause successor failures. No human validation study, correlation with ground-truth correctness, or analysis of false-positive/negative cases is reported, which directly undermines the interpretation of the 38.9% Testing Accuracy gap as a genuine capability deficit rather than a metric artifact.

Authors: We agree that the T metric functions as a proxy for semantic correctness rather than a direct verification of mathematical truth or exact natural-language intent. As the referee notes, successful compilation of dependents confirms type compatibility but does not guarantee the generated statement is true or identical to the intended meaning. We positioned T as an integration-testing analogue: in formal libraries, a statement that allows all dependent theorems to compile without type errors is a practical indicator that its semantics are sufficiently preserved for the existing development. In the revised manuscript, we have expanded §3 with an explicit limitations subsection that discusses this proxy nature, potential failure cases (e.g., equivalent reformulations or unreferenced theorems), and the distinction from full semantic verification. We have also added a small-scale manual analysis of false-positive and false-negative rates on a random sample of 100 problems, which we report in the revision. A comprehensive human validation study correlating T scores with expert judgments is a valuable next step but lies beyond the current scope given the benchmark size. revision: partial

Circularity Check

0 steps flagged

No circularity in definition or evaluation of the T metric.

full rationale

The paper explicitly defines its semantic metric T as the requirement that all automatically extracted successor theorems compile successfully after the generated theorem is added, constructs the benchmark by extracting 41 successors on average from five Lean 4 repositories without human annotation, and reports empirical Testing Accuracy (e.g., 38.9% for Claude-Sonnet-4.5) under this definition. No equation or claim reduces a reported result to a fitted parameter, renames a prior result, or relies on a self-citation chain whose validity is presupposed by the present work. The comparison between raw compilation success and T accuracy is a direct measurement against the independently defined proxy, leaving the framework self-contained as a proposed benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim depends on the domain assumption about Lean's compilation reflecting semantic dependencies and the accuracy of the automatic benchmark extraction process.

axioms (1)

domain assumption Compilation success of successor theorems serves as a valid proxy for semantic correctness
This underpins the entire T framework and is stated in the abstract as the definition of correctness.

invented entities (1)

T evaluation framework no independent evidence
purpose: To assess semantic correctness of generated theorems via dependent compilation
The framework is introduced in this paper as a new evaluation approach.

pith-pipeline@v0.9.0 · 5472 in / 1408 out tokens · 60391 ms · 2026-05-08T06:03:05.644919+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. H. B. Curry. 1934. Functionality in combinatory logic. Proceedings of the National Academy of Sciences of the United States of America, 20(11):584–590. Guoxiong Gao, Yutong Wang, Jiedong Jiang, Qi Gao, Zihan Qin, Tianyi Xu, and Bin Dong. 2025. Her- ald: A natural language annotat...

work page internal anchor Pith review arXiv 1934
[2]

InInternational Conference on Learning Representations

Minif2f: a cross-system benchmark for for- mal olympiad-level mathematics. InInternational Conference on Learning Representations. A Model Details Family Models Claude (Anthropic, 2025a, 2024, 2025b) Claude-3.7-Sonnet Claude-4-Sonnet Claude-Sonnet-4.5 GPT (OpenAI, 2024; Singh et al., 2025) GPT-4o-mini GPT-5 GPT-5-Mini GPT-5-Nano GPT-OSS (Agarwal et al., 2...

2024
[3]

• Consider the context fromCode BeforeandCode Afterif relevant

ATHOUGHTsection where you explain your reasoning: • Analyze theTarget Codeto understand what it proves or defines. • Consider the context fromCode BeforeandCode Afterif relevant. • Plan the natural language description
[4]

The function determines the sum of two integers

A single plaintext code block with your description. Format your responses like this: <format_example> Natural Language THOUGHT : The target code defines a function \ texttt { foo } that adds two numbers . I will describe it as " The function determines the sum of two integers ." Natural Language The function determines the sum of two integers . </format_...
[5]

• Plan the implementation

ATHOUGHTsection where you explain your reasoning: • Analyze the Code Before and Code After to deduce the purpose and signature of {{ target_code_name }}. • Plan the implementation
[6]

Format your responses like this: <format_example> Natural Language THOUGHT : The context shows a function \ texttt { foo } is used to add two numbers

A single lean code block with your implementation. Format your responses like this: <format_example> Natural Language THOUGHT : The context shows a function \ texttt { foo } is used to add two numbers . I will implement \ texttt { foo } as a definition taking two Nats and returning their sum . Lean 4 def foo ( n m :Nat) :Nat:= n + m </format_example> CRIT...

[1] [1]

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. H. B. Curry. 1934. Functionality in combinatory logic. Proceedings of the National Academy of Sciences of the United States of America, 20(11):584–590. Guoxiong Gao, Yutong Wang, Jiedong Jiang, Qi Gao, Zihan Qin, Tianyi Xu, and Bin Dong. 2025. Her- ald: A natural language annotat...

work page internal anchor Pith review arXiv 1934

[2] [2]

InInternational Conference on Learning Representations

Minif2f: a cross-system benchmark for for- mal olympiad-level mathematics. InInternational Conference on Learning Representations. A Model Details Family Models Claude (Anthropic, 2025a, 2024, 2025b) Claude-3.7-Sonnet Claude-4-Sonnet Claude-Sonnet-4.5 GPT (OpenAI, 2024; Singh et al., 2025) GPT-4o-mini GPT-5 GPT-5-Mini GPT-5-Nano GPT-OSS (Agarwal et al., 2...

2024

[3] [3]

• Consider the context fromCode BeforeandCode Afterif relevant

ATHOUGHTsection where you explain your reasoning: • Analyze theTarget Codeto understand what it proves or defines. • Consider the context fromCode BeforeandCode Afterif relevant. • Plan the natural language description

[4] [4]

The function determines the sum of two integers

A single plaintext code block with your description. Format your responses like this: <format_example> Natural Language THOUGHT : The target code defines a function \ texttt { foo } that adds two numbers . I will describe it as " The function determines the sum of two integers ." Natural Language The function determines the sum of two integers . </format_...

[5] [5]

• Plan the implementation

ATHOUGHTsection where you explain your reasoning: • Analyze the Code Before and Code After to deduce the purpose and signature of {{ target_code_name }}. • Plan the implementation

[6] [6]

Format your responses like this: <format_example> Natural Language THOUGHT : The context shows a function \ texttt { foo } is used to add two numbers

A single lean code block with your implementation. Format your responses like this: <format_example> Natural Language THOUGHT : The context shows a function \ texttt { foo } is used to add two numbers . I will implement \ texttt { foo } as a definition taking two Nats and returning their sum . Lean 4 def foo ( n m :Nat) :Nat:= n + m </format_example> CRIT...