pith. machine review for the scientific record. sign in

arxiv: 2605.10379 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM mathematical reasoningproof quality evaluationLLM benchmarksconcisenesscognitive simplicityproof diversityadaptivity in proofs
0
0 comments X

The pith

LLM-generated proofs differ substantially in quality beyond mere correctness, with measurable trade-offs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can solve hard math problems with correct proofs, but these proofs still differ in clarity, brevity, and how easily they transfer to new cases. The paper introduces ProofRank, a benchmark built from competition problems that scores proofs using five automatic proxies for quality. Experiments across models reveal large gaps in these scores that correctness benchmarks miss entirely. The data also show clear conflicts, where higher quality on the proxies often reduces the chance of getting the proof right at all. This implies that future tests of LLM reasoning need to track how useful the actual proofs turn out to be.

Core claim

The central claim is that correctness alone fails to capture important differences in LLM proof quality, as shown by substantial model-to-model variation and systematic trade-offs on the ProofRank benchmark's five proxies for conciseness, computational ease, cognitive simplicity, diversity, and adaptivity.

What carries the argument

ProofRank benchmark and its five scalable proxies that separately score proof quality independent of correctness.

If this is right

  • Correctness-only benchmarks miss substantial differences in proof quality across models.
  • Significant trade-offs exist between the quality proxies and correctness rates.
  • Future evaluations of mathematical reasoning in LLMs should measure how useful the generated proofs are.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These proxies could be turned into training signals to push models toward more practical proofs.
  • The same approach might apply to evaluating LLM outputs in formal verification or code generation.
  • Human preference studies could test whether the automatic scores align with what mathematicians actually value.

Load-bearing premise

The five chosen proxies accurately reflect aspects of proof quality that matter to human readers and can be measured objectively at scale.

What would settle it

A large-scale study in which expert mathematicians consistently prefer proofs that the proxies score as low-quality over those scored high-quality would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.10379 by Dimitar I. Dimitrov, Ivo Petrov, Jasper Dekoninck, Martin Vechev.

Figure 1
Figure 1. Figure 1: Example of proof quality. Even for the same problem, correct LLM-generated proofs can [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Main results for several models. Experimental results As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Tradeoff between accuracy and concise￾ness under different prompting strategies. Results We run these ablations on the open￾weight models GPT-OSS-120B, DEEPSEEK￾V3.2, and KIMI-K2.5-THINK, as well as the proprietary model GEMINI-3-FLASH. In Fig￾ure 3, we show how conciseness and accuracy change under the different prompting strategies. As expected, solutions become substantially more concise under the alter… view at source ↗
Figure 4
Figure 4. Figure 4: Elo changes when only considering valid rephrasings for the conciseness metric. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Elo changes when only considering valid rephrasings for the conciseness metric. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of the number of solutions per problem in our dataset. For each problem, we collect human-written solution attempts from public Art of Problem Solving (AoPS) posts or the official competi￾tion sources. We discard problems for which no public solution could be found. This fil￾tering removed approximately 100 problems from the original IMO-ANSWERBENCH dataset of 400 problems. In addition, some I… view at source ↗
read the original abstract

Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model's proofs for a single problem are; and (v) adaptivity, measuring whether a model can follow a specified proof technique. Across models, we find substantial differences in proof quality that are not captured by correctness-only benchmarks. We also observe significant trade-offs between proof-quality metrics and correctness, suggesting that future evaluations of mathematical reasoning should measure how useful LLM-generated proofs are.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ProofRank, a benchmark of challenging competition problems, to evaluate LLM-generated proofs using five automatic proxies for quality beyond binary correctness: conciseness (step count), computational ease (calculation intensity), cognitive simplicity (technique accessibility), diversity (output variation across samples), and adaptivity (adherence to a specified technique). It claims that models exhibit substantial differences in these dimensions not captured by correctness-only metrics and that significant trade-offs exist between proof-quality proxies and correctness rates.

Significance. If the proxies can be shown to be reproducible and to correlate with human judgments of proof usefulness, the work would meaningfully advance LLM evaluation in mathematical reasoning by shifting focus from correctness alone to practical utility, clarity, and transferability. The empirical benchmark approach itself is a constructive contribution.

major comments (3)
  1. [Abstract / Proxy definitions] Abstract and proxy definitions: the five proxies are defined at a high level (e.g., cognitive simplicity via “technique accessibility,” adaptivity via “technique adherence”), but no concrete heuristics, thresholds, models, or code for automatic computation are supplied. This is load-bearing for the central claims, as the reported differences and trade-offs cannot be assessed or reproduced without the measurement rules.
  2. [Results / Evaluation] Results and evaluation sections: no correlation study, inter-rater agreement, or validation against human expert ratings is reported for the proxies (particularly cognitive simplicity and adaptivity), nor any statistical tests, confidence intervals, or error analysis on the model comparisons. Without this, it remains unclear whether observed differences reflect genuine proof-quality variation or proxy artifacts.
  3. [Benchmark construction] Benchmark construction: details on problem selection from competitions, proof-generation prompting, sampling strategy for diversity, and how “specified proof technique” is operationalized for adaptivity are missing, preventing assessment of dataset bias or reproducibility.
minor comments (2)
  1. [Methods] Clarify whether the proxies are fully automatic or involve any LLM-as-judge components, and provide pseudocode or formulas for each metric.
  2. [Results tables] Tables comparing models should report effect sizes or p-values alongside raw differences to support the “substantial differences” claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. The comments highlight important aspects for improving the reproducibility and validity of our benchmark. We respond to each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract / Proxy definitions] Abstract and proxy definitions: the five proxies are defined at a high level (e.g., cognitive simplicity via “technique accessibility,” adaptivity via “technique adherence”), but no concrete heuristics, thresholds, models, or code for automatic computation are supplied. This is load-bearing for the central claims, as the reported differences and trade-offs cannot be assessed or reproduced without the measurement rules.

    Authors: We agree that the proxy definitions require more concrete details to support reproducibility. The manuscript describes the proxies conceptually, but we will revise the relevant sections to include specific heuristics (e.g., for conciseness: automated step counting via proof tree parsing with a threshold of fewer than 10 steps; for cognitive simplicity: a scoring system based on technique frequency in competition solutions), thresholds, the models used for any LLM-assisted scoring, and a link to the open-source code implementing these proxies. This will allow readers to assess and reproduce the measurements. revision: yes

  2. Referee: [Results / Evaluation] Results and evaluation sections: no correlation study, inter-rater agreement, or validation against human expert ratings is reported for the proxies (particularly cognitive simplicity and adaptivity), nor any statistical tests, confidence intervals, or error analysis on the model comparisons. Without this, it remains unclear whether observed differences reflect genuine proof-quality variation or proxy artifacts.

    Authors: We recognize the value of validating the proxies against human judgments. Our focus was on developing scalable automatic proxies that can be applied broadly without human intervention. We did not include a full correlation study in this work, which is a limitation. In the revision, we will add statistical tests (e.g., t-tests for model differences) and confidence intervals to the results. We will also include a discussion of proxy validation and commit to conducting a human evaluation study in follow-up work. We believe the current proxies are grounded in established mathematical values but acknowledge the need for empirical validation. revision: partial

  3. Referee: [Benchmark construction] Benchmark construction: details on problem selection from competitions, proof-generation prompting, sampling strategy for diversity, and how “specified proof technique” is operationalized for adaptivity are missing, preventing assessment of dataset bias or reproducibility.

    Authors: We agree that additional details on the benchmark construction are essential. The revised manuscript will expand the Benchmark section to describe: the criteria for selecting problems from competitions (focusing on those with known multiple proof approaches and difficulty level), the exact prompting templates used for generating proofs, the sampling strategy (e.g., number of samples per problem, temperature settings for diversity measurement), and the operationalization of adaptivity (including the technique specification in prompts and the method for checking adherence, such as through rule-based matching or secondary LLM classification). These additions will facilitate reproducibility and bias assessment. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark introduction with no derivation chain or self-referential reductions

full rationale

The paper presents ProofRank as an empirical benchmark that defines five independent proxies for proof quality (conciseness, computational ease, cognitive simplicity, diversity, adaptivity) and applies them to LLM outputs on competition problems. No mathematical derivation, first-principles prediction, or fitted parameter is claimed; the work consists of metric definitions followed by observational comparisons across models. No equations reduce to inputs by construction, no uniqueness theorems are invoked via self-citation, and no ansatz or renaming of known results occurs. The central claims rest on the empirical measurements themselves rather than any closed loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the listed proxies can be measured scalably and meaningfully reflect proof quality valued by readers.

free parameters (1)
  • Proxy measurement rules
    Exact rules for quantifying conciseness, cognitive simplicity, etc., are not specified in the abstract and likely involve implementation choices.
axioms (1)
  • domain assumption Mathematical proofs have measurable qualities beyond correctness such as conciseness and cognitive simplicity that are concrete and broadly valued.
    Explicitly stated in the abstract as the foundation for creating the benchmark.

pith-pipeline@v0.9.0 · 5513 in / 1284 out tokens · 31828 ms · 2026-05-12T04:43:51.181080+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages · 7 internal anchors

  1. [1]

    Extremal descendant integrals on moduli spaces of curves: An inequality dis- covered and proved in collaboration with ai, 2025

    Johannes Schmitt. Extremal descendant integrals on moduli spaces of curves: An inequality dis- covered and proved in collaboration with ai, 2025. URL https://arxiv.org/abs/2512.14575

  2. [2]

    Lattice-valued bottleneck duality, 2024

    Robert Ghrist, Julian Gould, and Miguel Lopez. Lattice-valued bottleneck duality, 2024. URL https://arxiv.org/abs/2410.00315

  3. [3]

    From euler to ai: Unifying formulas for mathematical constants, 2025

    Tomer Raz, Michael Shalyt, Elyasheev Leibtag, Rotem Kalisch, Shachar Weinbaum, Yaron Hadad, and Ido Kaminer. From euler to ai: Unifying formulas for mathematical constants, 2025. URLhttps://arxiv.org/abs/2502.17533

  4. [4]

    Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunovic, Nikola Jovanovic, and Martin T. Vechev. Proof or bluff? evaluating llms on 2025 USA math olympiad.CoRR, abs/2503.21934, 2025. doi: 10 .48550/ARXIV.2503.21934. URL https://doi.org/10.48550/arXiv.2503.21934

  5. [5]

    Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin T. Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, and Margulan Ismoldayev. The open proof corpus: A large-scale study of llm-generated mathematical proofs...

  6. [6]

    Thomas, and Charles Vial

    Johannes Schmitt, Gergely Bérczi, Jasper Dekoninck, Jeremy Feusi, Tim Gehrunger, Raphael Appenzeller, Jim Bryan, Niklas Canova, Timo de Wolff, Filippo Gaia, Michel van Garrel, Baran Hashemi, David Holmes, Aitor Iribar Lopez, Victor Jaeck, Martina Jørgensen, Steven Kelk, Stefan Kuhlmann, Adam Kurpisz, Chiara Meroni, Ingmar Metzler, Martin Möller, Samuel Mu...

  7. [7]

    Hamed Mahdavi, Alireza Hashemi, Majid Daliri, Pegah Mohammadipour, Alireza Farhadi, Samira Malek, Yekta Yazdanifard, Amir Khasahmadi, and Vasant G. Honavar. Brains vs. bytes: Evaluating LLM proficiency in olympiad mathematics.CoRR, abs/2504.01995, 2025. doi: 10.48550/ARXIV.2504.01995. URLhttps://doi.org/10.48550/arXiv.2504.01995

  8. [8]

    Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning.CoRR, abs/2511.01846, 2025. ...

  9. [9]

    Mathematical methods and human thought in the age of ai,

    Tanya Klowden and Terence Tao. Mathematical methods and human thought in the age of ai,

  10. [10]

    URLhttps://arxiv.org/abs/2603.26524

  11. [11]

    Mathematical capabilities of chatgpt

    Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...

  12. [12]

    Dadi Guo, Jiayu Liu, Zhiyuan Fan, Zhitao He, Haoran Li, Yumeng Wang, and Yi R. Fung. Mathematical proof as a litmus test: Revealing failure modes of advanced large reasoning models.CoRR, abs/2506.17114, 2025. doi: 10 .48550/ARXIV.2506.17114. URL https: //doi.org/10.48550/arXiv.2506.17114

  13. [13]

    Putnam-like dataset summary: Llms as mathematical competition contestants.CoRR, abs/2509.24827, 2025

    Bartosz Bieganowski, Daniel Strzelecki, Robert Skiba, and Mateusz Topolewski. Putnam-like dataset summary: Llms as mathematical competition contestants.CoRR, abs/2509.24827, 2025. doi: 10.48550/ARXIV.2509.24827. URLhttps://doi.org/10.48550/arXiv.2509.24827

  14. [14]

    Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Alekse- jevs, Todor Antic, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppány István Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo, Meghal Gupta, Stefano Huber, Annika...

  15. [15]

    Lemmabench: A live, research-level benchmark to evaluate LLM capabilities in mathematics.CoRR, abs/2602.24173, 2026

    Antoine Peyronnet, Fabian Gloeckle, and Amaury Hayat. Lemmabench: A live, research-level benchmark to evaluate LLM capabilities in mathematics.CoRR, abs/2602.24173, 2026. doi: 10.48550/ARXIV.2602.24173. URLhttps://doi.org/10.48550/arXiv.2602.24173

  16. [16]

    Ivo Petrov, Jasper Dekoninck, and Martin T. Vechev. Brokenmath: A benchmark for sycophancy in theorem proving with llms.CoRR, abs/2510.04721, 2025. doi: 10.48550/ARXIV.2510.04721. URLhttps://doi.org/10.48550/arXiv.2510.04721

  17. [17]

    Reliable fine-grained evaluation of natural language math proofs.CoRR, abs/2510.13888, 2025

    Wenjie Ma, Andrei Cojocaru, Neel Kolhe, Bradley Louie, Robin Said Sharif, Haihan Zhang, Vincent Zhuang, Matei Zaharia, and Sewon Min. Reliable fine-grained evaluation of natural language math proofs.CoRR, abs/2510.13888, 2025. doi: 10 .48550/ARXIV.2510.13888. URL https://doi.org/10.48550/arXiv.2510.13888

  18. [18]

    Solving inequality proofs with large language models.CoRR, abs/2506.07927, 2025

    Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, and Pan Lu. Solving inequality proofs with large language models.CoRR, abs/2506.07927, 2025. doi: 10 .48550/ ARXIV.2506.07927. URLhttps://doi.org/10.48550/arXiv.2506.07927

  19. [19]

    Pandit, A

    Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, and Shafiq Joty. Hard2verify: A step-level verification benchmark for open-ended frontier math.CoRR, abs/2510.13744, 2025. doi: 10 .48550/ARXIV.2510.13744. URL https://doi.org/10.48550/ arXiv.2510.13744

  20. [20]

    Hamed Mahdavi, Pouria Mahdavinia, Samira Malek, Pegah Mohammadipour, Alireza Hashemi, Majid Daliri, Alireza Farhadi, Amir Khasahmadi, Niloofar Mireshghallah, and Vasant G. Honavar. Refgrader: Automated grading of mathematical competition proofs using agentic workflows.CoRR, abs/2510.09021, 2025. doi: 10 .48550/ARXIV.2510.09021. URL https: //doi.org/10.485...

  21. [21]

    Collins, A

    Katherine M. Collins, Albert Q. Jiang, Simon Frieder, Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Lukasiewicz, Yuhuai Wu, Joshua B. Tenenbaum, William Hart, Timothy Gowers, Wenda Li, Adrian Weller, and Mateja Jamnik. Evaluating language models for mathematics through interactions.CoRR, abs/2306.01694, 2023. doi: 10 .48550/ARXIV.2306.01694. URL https://do...

  22. [22]

    Reliablemath: Benchmark of reliable mathemati- cal reasoning on large language models.CoRR, abs/2507.03133, 2025

    Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, and Kam-Fai Wong. Reliablemath: Benchmark of reliable mathemati- cal reasoning on large language models.CoRR, abs/2507.03133, 2025. doi: 10 .48550/ ARXIV.2507.03133. URLhttps://doi.org/10.48550/arXiv.2507.03133. 11

  23. [23]

    Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. Abstentionbench: Reasoning llms fail on unanswerable questions.CoRR, abs/2506.09038, 2025. doi: 10 .48550/ ARXIV.2506.09038. URLhttps://doi.org/10.48550/arXiv.2506.09038

  24. [24]

    Interactive Benchmarks

    Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, and Mengdi Wang. Interactive benchmarks.CoRR, abs/2603.04737, 2026. doi: 10 .48550/ARXIV.2603.04737. URL https: //doi.org/10.48550/arXiv.2603.04737

  25. [25]

    GAUSS: benchmarking structured mathematical skills for large language models.CoRR, abs/2509.18122, 2025

    Yue Zhang, Jiaxin Zhang, Qiuyu Ren, Tahsin Saffat, Xiaoxuan Liu, Zitong Yang, Banghua Zhu, and Yi Ma. GAUSS: benchmarking structured mathematical skills for large language models.CoRR, abs/2509.18122, 2025. doi: 10 .48550/ARXIV.2509.18122. URL https: //doi.org/10.48550/arXiv.2509.18122

  26. [26]

    Marlene Steinbach, Shreya Bhandari, Jennifer Meyer, and Zach A. Pardos. When llms hal- lucinate: Examining the effects of erroneous feedback in math tutoring systems. In Caitlin Mills, Giora Alexandron, Davide Taibi, Giosuè Lo Bosco, and Luc Paquette, editors,Proceed- ings of the 18th International Conference on Educational Data Mining, EDM 2025, Palermo,...

  27. [27]

    A scoping survey of chatgpt in mathematics education.Digital Experiences in Mathematics Education, 11:9–41, 2025

    Birgit Pepin, Nils Buchholtz, and Ulises Salinas-Hernández. A scoping survey of chatgpt in mathematics education.Digital Experiences in Mathematics Education, 11:9–41, 2025. doi: 10.1007/s40751-025-00172-1. URLhttps://doi.org/10.1007/s40751-025-00172-1

  28. [28]

    Chatgpt in school mathemat- ics education: A systematic review of opportunities, challenges, and pedagogical implications

    Muhammad Turmuzi, Syahrul Azmi, and Ni Made Intan Kertiyani. Chatgpt in school mathemat- ics education: A systematic review of opportunities, challenges, and pedagogical implications. Teaching and Teacher Education, 170:105286, 2026. doi: 10.1016/j.tate.2025.105286. URL https://doi.org/10.1016/j.tate.2025.105286

  29. [29]

    Reddig, Arav Arora, and Christopher J

    Jennifer M. Reddig, Arav Arora, and Christopher J. MacLellan. Generating in-context, personalized feedback for intelligent tutors with large language models.Int. J. Artif. In- tell. Educ., 35(6):3459–3500, 2025. doi: 10 .1007/S40593-025-00505-6. URL https: //doi.org/10.1007/s40593-025-00505-6

  30. [30]

    2013.Links Recon- struction Attack

    Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InCADE, volume 12699 ofLecture Notes in Computer Science, pages 625–635. Springer, 2021. doi: 10 .1007/978-3-030-79876-5 _37. URL https://doi.org/10.1007/978- 3-030-79876-5_37

  31. [31]

    Hilbert: Recursively building formal proofs with informal reasoning.CoRR, abs/2509.22819, 2025

    Sumanth Varambally, Thomas V oice, Yanchao Sun, Zhifeng Chen, Rose Yu, and Ke Ye. Hilbert: Recursively building formal proofs with informal reasoning.CoRR, abs/2509.22819, 2025. doi: 10.48550/ARXIV.2509.22819. URLhttps://doi.org/10.48550/arXiv.2509.22819

  32. [32]

    Numina-lean-agent: An open and general agentic reasoning system for formal mathematics.arXiv preprint arXiv:2601.14027,

    Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, Lihong Zhi, Jia Li, and Wenda Li. Numina-lean- agent: An open and general agentic reasoning system for formal mathematics, 2026. URL https://arxiv.org/abs/2601.14027

  33. [33]

    Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

    Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction, 2025. URL ht...

  34. [34]

    Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition

    George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jen- nings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, ed- itors,Advances in ...

  35. [35]

    arXiv:2302.12433 [cs.CL] https://arxiv.org/abs/2302.12433

    Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate- level mathematics.CoRR, abs/2302.12433, 2023. doi: 10 .48550/ARXIV.2302.12433. URL https://doi.org/10.48550/arXiv.2302.12433

  36. [36]

    minif2f: a cross-system benchmark for formal olympiad-level mathematics

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9ZPegFuFTFv

  37. [37]

    Formalproofbench: Can models write grad- uate level math proofs that are formally verified?CoRR, abs/2603.26996, 2026

    Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, and Langston Nashold. Formalproofbench: Can models write grad- uate level math proofs that are formally verified?CoRR, abs/2603.26996, 2026. doi: 10.48550/ARXIV.2603.26996. URLhttps://doi.org/10.48550/arXiv.2603.26996

  38. [38]

    Jiang, W

    Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailing Guan, Peihao Wu, Chunbo Dai, Liang Xiao, and Bin Dong. FATE: A formal benchmark series for frontier algebra of multiple difficulty levels.CoRR, abs/2511.02872, 2025. doi: 10.48550/ARXIV.2511.02872. URLhttps://doi.org/10.48550/arXiv.2511.02872

  39. [39]

    Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026

    Nikil Ravi, Langston Nashold, and Rayan Krishnan. Proofbench: Can models write math proofs that are formally verified? Vals AI, January 2026. URL https://www.vals.ai/benchmarks/ proof_bench. Accessed: 2026-04-29

  40. [40]

    11 Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yi- han Geng, Jiawei Ge, Jingruo Sun, et al

    Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, and Lenny Taelman. Sorrydb: Can AI provers complete real-world lean theorems?CoRR, abs/2603.02668, 2026. doi: 10 .48550/ARXIV.2603.02668. URL https://doi.org/10.48550/ arXiv.2603.02668

  41. [41]

    Taylor, Junyi Zhang, Ethan Ji, Vigyan Sahai, Haikang Deng, Yuanzhou Chen, Yifan Yuan, Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng, Amit Sahai, and Wei Wang

    Alexander K. Taylor, Junyi Zhang, Ethan Ji, Vigyan Sahai, Haikang Deng, Yuanzhou Chen, Yifan Yuan, Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng, Amit Sahai, and Wei Wang. Taobench: Do automated theorem prover llms generalize beyond mathlib?CoRR, abs/2603.12744, 2026. doi: 10 .48550/ARXIV.2603.12744. URL https://doi.org/10.48550/ arXiv.2603.12744

  42. [42]

    Markosyan

    Alex Gu, Bartosz Piotrowski, Fabian Gloeckle, Kaiyu Yang, and Aram H. Markosyan. Proofoptimizer: Training language models to simplify proofs without human demonstra- tions.CoRR, abs/2510.15700, 2025. doi: 10 .48550/ARXIV.2510.15700. URL https: //doi.org/10.48550/arXiv.2510.15700

  43. [43]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, edi- tors,Advances in Neural ...

  44. [44]

    Wildbench: Benchmarking llms with challenging tasks from real users in the wild

    Bill Yuchen Lin, Yuntian Deng, Khyathi Raghavi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. U...

  45. [45]

    Chawla, and Xiangliang Zhang

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. ...

  46. [46]

    arXiv preprint arXiv:2410.21819 (2025)

    Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge. CoRR, abs/2410.21819, 2024. doi: 10 .48550/ARXIV.2410.21819. URL https://doi.org/ 10.48550/arXiv.2410.21819

  47. [47]

    José Pombal, Ricardo Rei, and André F. T. Martins. Self-preference bias in rubric-based evaluation of large language models.CoRR, abs/2604.06996, 2026. doi: 10 .48550/ ARXIV.2604.06996. URLhttps://doi.org/10.48550/arXiv.2604.06996

  48. [48]

    arXiv preprint arXiv:2310.10076 , year=

    Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in pref- erence labeling by large language models.CoRR, abs/2310.10076, 2023. doi: 10 .48550/ ARXIV.2310.10076. URLhttps://doi.org/10.48550/arXiv.2310.10076

  49. [49]

    Judging the judges: A systematic study of position bias in llm-as-a-judge

    Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings of the 14th Inter...

  50. [50]

    Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovic, and Martin T. Vechev. Math- arena: Evaluating llms on uncontaminated math competitions.CoRR, abs/2505.23281, 2025. doi: 10.48550/ARXIV.2505.23281. URLhttps://doi.org/10.48550/arXiv.2505.23281

  51. [51]

    Gpt-5.4 thinking system card

    OpenAI. Gpt-5.4 thinking system card. Technical report, OpenAI, March 2026. URL https: //openai.com/index/gpt-5-4-thinking-system-card/. Accessed: 2026-04-09

  52. [52]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI. gpt-oss-120b & gpt-oss-20b model card.CoRR, abs/2508.10925, 2025. doi: 10 .48550/ ARXIV.2508.10925. URLhttps://doi.org/10.48550/arXiv.2508.10925

  53. [53]

    Gemini 3.1 pro model card

    Google. Gemini 3.1 pro model card. Technical report, Google, February 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro- Model-Card.pdf. Accessed: 2026-04-09

  54. [54]

    Gemini 3 flash model card

    Google. Gemini 3 flash model card. Technical report, Google, December 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash- Model-Card.pdf. Accessed: 2026-04-09

  55. [55]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  56. [56]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM. GLM-5: from vibe coding to agentic engineering.CoRR, abs/2602.15763, 2026. doi: 10.48550/ARXIV.2602.15763. URLhttps://doi.org/10.48550/arXiv.2602.15763

  57. [57]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team. Kimi K2.5: visual agentic intelligence.CoRR, abs/2602.02276, 2026. doi: 10.48550/ARXIV.2602.02276. URLhttps://doi.org/10.48550/arXiv.2602.02276

  58. [58]

    Grok 4.1 fast and agent tools api.xAI News, November 2025

    xAI. Grok 4.1 fast and agent tools api.xAI News, November 2025. URL https://x.ai/news/ grok-4-1-fast. Accessed: 2026-04-09

  59. [59]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.CoRR, abs/2512.02556, 2025. doi: 10 .48550/ARXIV.2512.02556. URL https://doi.org/10.48550/ arXiv.2512.02556

  60. [60]

    Step 3.5 flash: Open frontier-level intelligence with 11b active parameters

    StepFun Team. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters. CoRR, abs/2602.10604, 2026. doi: 10 .48550/ARXIV.2602.10604. URL https://doi.org/ 10.48550/arXiv.2602.10604. 14 A Additional Experimental Details We provide additional experimental details that are not included in the main text to ensure full transparency and reprod...

  61. [61]

    **Golden Answer:** State the Golden Answer

  62. [62]

    No clear final answer found

    **Extracted Model Answer:** State the extracted answer based on the Extraction Protocol. If none found, state "No clear final answer found."

  63. [63]

    Detail the steps taken to verify mathematical equivalence (e.g., simplification, algebraic manipulation)

    **Equivalence Analysis:** Compare the two answers using the Grading Guidelines. Detail the steps taken to verify mathematical equivalence (e.g., simplification, algebraic manipulation). You must actively try to prove they are the same before concluding they are different

  64. [64]

    Correct" or

    **Conclusion:** State the final determination ("Correct" or "Incorrect"). 24 **Part 2: Final Grade** Immediately following the closing </thinking> tag, output **ONLY** the final grade. * If Correct: \\boxed{{Correct}} * If Incorrect: \\boxed{{Incorrect}} **CRITICAL CONSTRAINT: Do not add any text, explanations, or formatting outside the <thinking> tags or...

  65. [65]

    **Golden Answer:** (-\infty, -4) \cup (4, \infty)

  66. [66]

    **Extracted Model Answer:** ∅(the empty set)

  67. [67]

    The Model Answer is the empty set

    **Equivalence Analysis:** The Golden Answer is a non-empty set of real numbers. The Model Answer is the empty set. These two sets are not equivalent. The empty set contains no elements, while the Golden Answer contains an infinite number of elements

  68. [68]

    \\(" and

    **Conclusion:** Incorrect </thinking> \\boxed{{Incorrect}} ------- # 4. Input Data Here is the problem, model solution, and golden answer to grade: ### Problem Statement: {problem} ### Model Solution: {solution} ### Golden Answer: {gold_answer} F.3.3 Solution Verification Prompt We use the following prompt for running GPT-5.4 to determine the overall corr...

  69. [70]

    **Backbone** The main structural move that sets up the solution, such as a reflection construction, polynomial-root encoding, greedy maintenance strategy, block decomposition, reduction to a theorem, or symmetry-based transformation

  70. [71]

    The **backbone** is the most important field

    **Closing engine** The theorem, lemma, or decisive move that turns the backbone into the conclusion. The **backbone** is the most important field. The **closing engine** should be recorded when it is a real mathematical ingredient, not just routine simplification. # Requirements

  71. [72]

    Identify what makes the solution work conceptually, not how to carry out the computations

  72. [73]

    Do not solve the problem on your own. 26

  73. [74]

    Do not include chain-of-thought, hidden reasoning, or speculative reconstruction

  74. [75]

    Do not fill in missing steps

  75. [76]

    Work only with what is explicitly present or very strongly signaled in the solution text

  76. [77]

    If the solution is incomplete but its intended method is still recognizable, describe that intended method

  77. [78]

    If the text is too vague to identify a genuine nontrivial method, set`noCoreIdea`to`true`and use` null`where appropriate

  78. [79]

    Short verbatim evidence quotes may contain formulas if they appear in the provided solution

    Avoid generating equations or calculations yourself. Short verbatim evidence quotes may contain formulas if they appear in the provided solution

  79. [80]

    Write`coreIdea`and`supportingIdeas`in an imperative, hint-like style

  80. [81]

    use algebra,

    Keep labels method-specific. Avoid broad labels such as "use algebra," "use symmetry," or "use geometry " unless no more specific description is supported. # Output format Your response must include: - **noCoreIdea**: whether the provided solution lacks an identifiable nontrivial core idea - **solutionSummary**: a concise summary of the solution's method ...

Showing first 80 references.