Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

Dimitar I. Dimitrov; Ivo Petrov; Jasper Dekoninck; Martin Vechev

arxiv: 2605.10379 · v2 · pith:QWIGJSQNnew · submitted 2026-05-11 · 💻 cs.CL

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

Ivo Petrov , Jasper Dekoninck , Dimitar I. Dimitrov , Martin Vechev This is my paper

Pith reviewed 2026-05-12 04:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM mathematical reasoningproof quality evaluationLLM benchmarksconcisenesscognitive simplicityproof diversityadaptivity in proofs

0 comments

The pith

LLM-generated proofs differ substantially in quality beyond mere correctness, with measurable trade-offs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can solve hard math problems with correct proofs, but these proofs still differ in clarity, brevity, and how easily they transfer to new cases. The paper introduces ProofRank, a benchmark built from competition problems that scores proofs using five automatic proxies for quality. Experiments across models reveal large gaps in these scores that correctness benchmarks miss entirely. The data also show clear conflicts, where higher quality on the proxies often reduces the chance of getting the proof right at all. This implies that future tests of LLM reasoning need to track how useful the actual proofs turn out to be.

Core claim

The central claim is that correctness alone fails to capture important differences in LLM proof quality, as shown by substantial model-to-model variation and systematic trade-offs on the ProofRank benchmark's five proxies for conciseness, computational ease, cognitive simplicity, diversity, and adaptivity.

What carries the argument

ProofRank benchmark and its five scalable proxies that separately score proof quality independent of correctness.

If this is right

Correctness-only benchmarks miss substantial differences in proof quality across models.
Significant trade-offs exist between the quality proxies and correctness rates.
Future evaluations of mathematical reasoning in LLMs should measure how useful the generated proofs are.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These proxies could be turned into training signals to push models toward more practical proofs.
The same approach might apply to evaluating LLM outputs in formal verification or code generation.
Human preference studies could test whether the automatic scores align with what mathematicians actually value.

Load-bearing premise

The five chosen proxies accurately reflect aspects of proof quality that matter to human readers and can be measured objectively at scale.

What would settle it

A large-scale study in which expert mathematicians consistently prefer proofs that the proxies score as low-quality over those scored high-quality would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.10379 by Dimitar I. Dimitrov, Ivo Petrov, Jasper Dekoninck, Martin Vechev.

**Figure 2.** Figure 2: Main results for several models. Experimental results As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Tradeoff between accuracy and conciseness under different prompting strategies. Results We run these ablations on the openweight models GPT-OSS-120B, DEEPSEEKV3.2, and KIMI-K2.5-THINK, as well as the proprietary model GEMINI-3-FLASH. In Figure 3, we show how conciseness and accuracy change under the different prompting strategies. As expected, solutions become substantially more concise under the alter… view at source ↗

**Figure 4.** Figure 4: Elo changes when only considering valid rephrasings for the conciseness metric. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Elo changes when only considering valid rephrasings for the conciseness metric. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of the number of solutions per problem in our dataset. For each problem, we collect human-written solution attempts from public Art of Problem Solving (AoPS) posts or the official competition sources. We discard problems for which no public solution could be found. This filtering removed approximately 100 problems from the original IMO-ANSWERBENCH dataset of 400 problems. In addition, some I… view at source ↗

read the original abstract

Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model's proofs for a single problem are; and (v) adaptivity, measuring whether a model can follow a specified proof technique. Across models, we find substantial differences in proof quality that are not captured by correctness-only benchmarks. We also observe significant trade-offs between proof-quality metrics and correctness, suggesting that future evaluations of mathematical reasoning should measure how useful LLM-generated proofs are.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces ProofRank and five proxies to move LLM proof eval past correctness, but the proxies for cognitive simplicity and adaptivity rest on unvalidated automatic rules.

read the letter

The main contribution is a new benchmark, ProofRank, built from competition problems, plus five scalable proxies meant to capture proof quality beyond right-or-wrong: conciseness by step count, computational ease by calculation load, cognitive simplicity by technique accessibility, diversity by output variation on the same problem, and adaptivity by adherence to a requested technique. They report that models differ on these dimensions even when correct and that some trade off against correctness. That framing is useful because it matches what people actually want from a proof in teaching or research assistance contexts. The curation from real contests gives the benchmark some grounding that synthetic datasets often lack. The central claim holds up at the level of motivation: binary correctness clearly misses important variation. The soft spots sit in the proxies. The abstract defines them but supplies no concrete scoring rules, thresholds, or any check against human expert ratings, especially for cognitive simplicity and adaptivity. If those turn out to be surface heuristics (token counts, simple keyword lists, or LLM self-scores), the reported differences and trade-offs could be artifacts rather than evidence of usefulness. No statistical tests or error analysis appear in the summary either, so it is hard to judge how stable the findings are. This is for researchers building or benchmarking LLM mathematical reasoning systems who want to move past pass/fail tables. A reader looking for a concrete starting point on multi-dimensional evaluation will find the benchmark idea worth examining, even if the current metrics need tightening. It deserves peer review because the direction is sound and the benchmark itself could be adopted and improved by others, provided the authors add validation studies and full implementation details.

Referee Report

3 major / 2 minor

Summary. The paper introduces ProofRank, a benchmark of challenging competition problems, to evaluate LLM-generated proofs using five automatic proxies for quality beyond binary correctness: conciseness (step count), computational ease (calculation intensity), cognitive simplicity (technique accessibility), diversity (output variation across samples), and adaptivity (adherence to a specified technique). It claims that models exhibit substantial differences in these dimensions not captured by correctness-only metrics and that significant trade-offs exist between proof-quality proxies and correctness rates.

Significance. If the proxies can be shown to be reproducible and to correlate with human judgments of proof usefulness, the work would meaningfully advance LLM evaluation in mathematical reasoning by shifting focus from correctness alone to practical utility, clarity, and transferability. The empirical benchmark approach itself is a constructive contribution.

major comments (3)

[Abstract / Proxy definitions] Abstract and proxy definitions: the five proxies are defined at a high level (e.g., cognitive simplicity via “technique accessibility,” adaptivity via “technique adherence”), but no concrete heuristics, thresholds, models, or code for automatic computation are supplied. This is load-bearing for the central claims, as the reported differences and trade-offs cannot be assessed or reproduced without the measurement rules.
[Results / Evaluation] Results and evaluation sections: no correlation study, inter-rater agreement, or validation against human expert ratings is reported for the proxies (particularly cognitive simplicity and adaptivity), nor any statistical tests, confidence intervals, or error analysis on the model comparisons. Without this, it remains unclear whether observed differences reflect genuine proof-quality variation or proxy artifacts.
[Benchmark construction] Benchmark construction: details on problem selection from competitions, proof-generation prompting, sampling strategy for diversity, and how “specified proof technique” is operationalized for adaptivity are missing, preventing assessment of dataset bias or reproducibility.

minor comments (2)

[Methods] Clarify whether the proxies are fully automatic or involve any LLM-as-judge components, and provide pseudocode or formulas for each metric.
[Results tables] Tables comparing models should report effect sizes or p-values alongside raw differences to support the “substantial differences” claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. The comments highlight important aspects for improving the reproducibility and validity of our benchmark. We respond to each major comment point by point below.

read point-by-point responses

Referee: [Abstract / Proxy definitions] Abstract and proxy definitions: the five proxies are defined at a high level (e.g., cognitive simplicity via “technique accessibility,” adaptivity via “technique adherence”), but no concrete heuristics, thresholds, models, or code for automatic computation are supplied. This is load-bearing for the central claims, as the reported differences and trade-offs cannot be assessed or reproduced without the measurement rules.

Authors: We agree that the proxy definitions require more concrete details to support reproducibility. The manuscript describes the proxies conceptually, but we will revise the relevant sections to include specific heuristics (e.g., for conciseness: automated step counting via proof tree parsing with a threshold of fewer than 10 steps; for cognitive simplicity: a scoring system based on technique frequency in competition solutions), thresholds, the models used for any LLM-assisted scoring, and a link to the open-source code implementing these proxies. This will allow readers to assess and reproduce the measurements. revision: yes
Referee: [Results / Evaluation] Results and evaluation sections: no correlation study, inter-rater agreement, or validation against human expert ratings is reported for the proxies (particularly cognitive simplicity and adaptivity), nor any statistical tests, confidence intervals, or error analysis on the model comparisons. Without this, it remains unclear whether observed differences reflect genuine proof-quality variation or proxy artifacts.

Authors: We recognize the value of validating the proxies against human judgments. Our focus was on developing scalable automatic proxies that can be applied broadly without human intervention. We did not include a full correlation study in this work, which is a limitation. In the revision, we will add statistical tests (e.g., t-tests for model differences) and confidence intervals to the results. We will also include a discussion of proxy validation and commit to conducting a human evaluation study in follow-up work. We believe the current proxies are grounded in established mathematical values but acknowledge the need for empirical validation. revision: partial
Referee: [Benchmark construction] Benchmark construction: details on problem selection from competitions, proof-generation prompting, sampling strategy for diversity, and how “specified proof technique” is operationalized for adaptivity are missing, preventing assessment of dataset bias or reproducibility.

Authors: We agree that additional details on the benchmark construction are essential. The revised manuscript will expand the Benchmark section to describe: the criteria for selecting problems from competitions (focusing on those with known multiple proof approaches and difficulty level), the exact prompting templates used for generating proofs, the sampling strategy (e.g., number of samples per problem, temperature settings for diversity measurement), and the operationalization of adaptivity (including the technique specification in prompts and the method for checking adherence, such as through rule-based matching or secondary LLM classification). These additions will facilitate reproducibility and bias assessment. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark introduction with no derivation chain or self-referential reductions

full rationale

The paper presents ProofRank as an empirical benchmark that defines five independent proxies for proof quality (conciseness, computational ease, cognitive simplicity, diversity, adaptivity) and applies them to LLM outputs on competition problems. No mathematical derivation, first-principles prediction, or fitted parameter is claimed; the work consists of metric definitions followed by observational comparisons across models. No equations reduce to inputs by construction, no uniqueness theorems are invoked via self-citation, and no ansatz or renaming of known results occurs. The central claims rest on the empirical measurements themselves rather than any closed loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the listed proxies can be measured scalably and meaningfully reflect proof quality valued by readers.

free parameters (1)

Proxy measurement rules
Exact rules for quantifying conciseness, cognitive simplicity, etc., are not specified in the abstract and likely involve implementation choices.

axioms (1)

domain assumption Mathematical proofs have measurable qualities beyond correctness such as conciseness and cognitive simplicity that are concrete and broadly valued.
Explicitly stated in the abstract as the foundation for creating the benchmark.

pith-pipeline@v0.9.0 · 5513 in / 1284 out tokens · 31828 ms · 2026-05-12T04:43:51.181080+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify such components and introduce PROOFRANK, a benchmark curated from challenging mathematical competitions. PROOFRANK evaluates several scalable proxies of proof quality: (i) conciseness... (v) adaptivity...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use pairwise comparisons... Bradley-Terry ranking... correctness as a prerequisite

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.