pith. sign in

arxiv: 2605.20531 · v1 · pith:QOTC7V2Xnew · submitted 2026-05-19 · 💻 cs.LO · cs.LG

Pseudo-Formalization for Automatic Proof Verification

Pith reviewed 2026-05-21 06:21 UTC · model grok-4.3

classification 💻 cs.LO cs.LG
keywords Pseudo-FormalizationBlock Verificationproof verificationnatural language proofslarge language modelsmathematical reasoningerror detectionautomatic verification
0
0 comments X

The pith

Natural language proofs can be verified automatically by translating them into self-contained Pseudo-Formal modules checked independently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Pseudo-Formalization to address the difficulty of verifying proofs written in natural language, especially those generated by AI systems. A proof is broken into modules, each containing its own premises, conclusion, and natural language argument. An LLM first converts the original proof into this modular format and then confirms each module on its own through Block Verification. This is tested on olympiad problems and research-level mathematics, where it detects errors with higher precision and recall than methods that judge an entire proof at once. Readers interested in AI-assisted mathematics would care because it offers a middle path between rigid formal languages and unreliable informal checking.

Core claim

The paper claims that decomposing a natural language proof into Pseudo-Formal modules—each stating premises, conclusion, and a natural language proof—allows an LLM to translate the original text and then verify the modules independently, establishing overall correctness more effectively than direct whole-proof judgments by an LLM.

What carries the argument

Pseudo-Formal modules, each a self-contained unit with premises, conclusion, and natural language proof that supports independent verification.

If this is right

  • Error detection improves because issues can be isolated to individual modules instead of requiring re-assessment of the full proof.
  • Verification extends to research-level mathematics where complete formalization remains impractical.
  • Precision and recall for finding proof errors exceed those of direct LLM-as-judge baselines on the tested benchmarks.
  • The released ArxivMathGradingBench supplies a shared test set for comparing future verification approaches on research-level problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Granular module-level feedback could support iterative refinement when training AI systems to generate or correct proofs.
  • The same decomposition strategy might transfer to other structured reasoning tasks such as checking long chains of code or logical arguments.
  • Hybrid systems could combine Pseudo-Formal modules with partial formalization in targeted sections to increase reliability further.

Load-bearing premise

An LLM can translate any natural language proof into accurate Pseudo-Formal modules without omitting or adding critical logical steps.

What would settle it

A natural language proof containing a logical error that the LLM translation into Pseudo-Formal form fails to capture, so that independent module verification still approves the flawed proof.

Figures

Figures reproduced from arXiv: 2605.20531 by Kaiyue Wen, Luke Bailey, Mohammed Abouzaid, Slim Barkallah, Tengyu Ma.

Figure 1
Figure 1. Figure 1: Left: Diagram indicating how Pseudo-Formalization can be used to verify proofs. We translate a natural language proof to Pseudo-Formal representation (Definition 1), and then verify each block (see [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example Pseudo-Formalized Proof on IMO 2024 P6. The block verifier correctly assigns Lemma 5.1 score 0 because the proof incorrectly strengthens P2 from Lemma 4.2 “S cannot contain both d and −d” to “all nonzero elements of S have the same sign”. 3.1 Step 1: Translation to Pseudo-Formal The translation LLM receives the natural-language proof Π, and is prompted to produce a Pseudo￾Formal proof Πˆ in the … view at source ↗
Figure 3
Figure 3. Figure 3: Per-proof/paper and per-error metrics on [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Precision-recall on Hard2Verify at the whole-proof label level (positive class: proof contains 1 or more incorrect steps). For each k, the verifier’s per-rollout proof-level verdicts are aggregated with pessimistic-min (the proof is pre￾dicted incorrect iff at least one 1 or more steps over the k rollouts is predicted incorrect). Quantitative Results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Reliable verification of proofs remains a bottleneck for training and evaluating AI systems on hard mathematical reasoning. Fully formal proofs, in languages like Lean, are easy to verify because they are unambiguous and modular. Most proofs, particularly those written by AI systems, have neither property, and translating them into formal languages remains challenging in many frontier math settings. We propose Pseudo-Formalization (PF), a proof format that captures the modularity and precision of formal proofs while retaining the flexibility of natural language. A Pseudo-Formal proof is decomposed into self-contained modules, each stating its premises, conclusion, and proof in natural language. To verify the correctness of a regular natural language proof, an LLM translates it to Pseudo-Formal and then verifies each module independently, an algorithm we call Block Verification (BV). We evaluate PF+BV on two benchmarks spanning olympiad and research-level mathematics, where it pareto-dominates LLM-as-judge baselines on error-finding precision and recall. To support future work, we release our research-level proof verification benchmark ArxivMathGradingBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Pseudo-Formalization (PF), a modular proof representation in which natural-language proofs are decomposed into self-contained modules each stating premises, conclusion, and proof steps. It defines Block Verification (BV) as the procedure of using an LLM to translate a natural-language proof into PF modules and then verifying each module independently. The central empirical claim is that PF+BV Pareto-dominates standard LLM-as-judge baselines on error-finding precision and recall across two benchmarks (olympiad and research-level mathematics), and the authors release the ArxivMathGradingBench dataset to support further work.

Significance. If the soundness of independent module verification for overall proof correctness can be established, the approach would supply a useful intermediate point between fully formal verification and unstructured natural-language judging, potentially easing the bottleneck in training and evaluating AI mathematical reasoners. The public release of a research-level proof-verification benchmark is a concrete positive contribution that enables reproducible follow-up research.

major comments (2)
  1. [Block Verification algorithm description] The description of Block Verification states that each PF module is verified independently, yet provides no mechanism (entailment check, dependency graph validation, or global consistency step) to ensure that the conclusion of module i entails the premises of module i+1. Because the central claim is that local verification suffices to certify the entire proof, the absence of any cross-module chaining verification is load-bearing; an LLM translation error that misaligns adjacent modules would produce locally valid but globally invalid arguments.
  2. [Evaluation and experimental results] The evaluation reports Pareto dominance on precision and recall but supplies no information on benchmark construction, statistical tests, error-bar reporting, or controls for data leakage between training data and the released ArxivMathGradingBench. Without these details the empirical claim cannot be assessed at the level required to support the superiority statement.
minor comments (2)
  1. [Abstract] The abstract uses the term 'pareto-dominates' without specifying the exact set of metrics or the baseline configurations; a short table or explicit statement of the comparison points would improve clarity.
  2. [Pseudo-Formalization definition] Notation for the PF module components (premises, conclusion, proof text) is introduced informally; a compact definition or diagram would reduce ambiguity when readers compare PF to existing intermediate representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments in detail below, indicating the revisions we plan to make.

read point-by-point responses
  1. Referee: [Block Verification algorithm description] The description of Block Verification states that each PF module is verified independently, yet provides no mechanism (entailment check, dependency graph validation, or global consistency step) to ensure that the conclusion of module i entails the premises of module i+1. Because the central claim is that local verification suffices to certify the entire proof, the absence of any cross-module chaining verification is load-bearing; an LLM translation error that misaligns adjacent modules would produce locally valid but globally invalid arguments.

    Authors: We agree that explicitly addressing the chaining between modules strengthens the presentation. In the Pseudo-Formalization approach, the translation to PF modules is designed such that the premises of each subsequent module are directly based on the conclusions of the prior ones, as part of the decomposition process. The independent verification then confirms that within each module, the proof steps correctly entail the conclusion from the stated premises. While we do not include an automated cross-module entailment check in the current BV algorithm, this is because the modularity assumes faithful translation; any misalignment would be a failure of the translation step rather than the verification per se. To clarify this, we will revise the manuscript to include a dedicated subsection explaining the assumptions underlying independent verification and the role of the translation in maintaining global structure. We will also add a discussion of potential failure modes due to translation errors and how they might be mitigated. revision: partial

  2. Referee: [Evaluation and experimental results] The evaluation reports Pareto dominance on precision and recall but supplies no information on benchmark construction, statistical tests, error-bar reporting, or controls for data leakage between training data and the released ArxivMathGradingBench. Without these details the empirical claim cannot be assessed at the level required to support the superiority statement.

    Authors: We acknowledge the need for greater transparency in the experimental setup to allow proper assessment of the results. In the revised manuscript, we will expand the evaluation section to provide: (1) a detailed description of how the ArxivMathGradingBench was constructed, including the selection criteria for proofs and error annotations; (2) statistical tests (e.g., paired t-tests or Wilcoxon tests) comparing PF+BV to baselines, along with p-values; (3) error bars or confidence intervals on the precision and recall metrics, computed over multiple runs or bootstrap samples; and (4) explicit controls for data leakage, such as verification that the benchmark proofs were not part of the LLM training data and use of held-out splits where applicable. These additions will support the empirical claims more rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of new format on external benchmarks

full rationale

The paper defines Pseudo-Formalization (PF) as a modular natural-language proof format and Block Verification (BV) as independent per-module checking after LLM translation. It then reports empirical results on two benchmarks (including the newly released ArxivMathGradingBench), showing Pareto dominance over LLM-as-judge baselines on error-finding precision and recall. No equations, fitted parameters, or self-citations appear in the derivation; the central claim rests on external LLM capabilities and released test data rather than reducing to self-definition or internal fitting. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities; the approach implicitly assumes reliable LLM translation and modular independence without detailing supporting evidence or assumptions from prior literature.

pith-pipeline@v0.9.0 · 5721 in / 1137 out tokens · 36208 ms · 2026-05-21T06:21:24.419987+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 1 internal anchor

  1. [1]

    Learning to give checkable answers with prover-verifier games.arXiv preprint arXiv:2108.12099, 2021

    Cem Anil, Guodong Zhang, Yuhuai Wu, and Roger Grosse. Learning to give checkable answers with prover-verifier games.arXiv preprint arXiv:2108.12099, 2021

  2. [2]

    Towards a Lean proof of Fermat’s Last Theorem.https: //imperialcollegelondon.github.io/FLT/blueprint.pdf, 2024

    Kevin Buzzard and Richard Taylor. Towards a Lean proof of Fermat’s Last Theorem.https: //imperialcollegelondon.github.io/FLT/blueprint.pdf, 2024

  3. [3]

    Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K. Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning.arXiv preprint arXiv:2504.19162, 2025

  4. [4]

    Power-free palindromes and reversed primes: S

    Shashi Chourasiya and Daniel R Johnston. Power-free palindromes and reversed primes: S. chourasiya, dr johnston.Monatshefte für Mathematik, pages 1–17, 2025

  5. [5]

    Escaping the cognitive well: Efficient competition math with off-the-shelf models,

    Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, and Sanjeev Arora. Escaping the cognitive well: Efficient competition math with off-the-shelf models,

  6. [6]

    URLhttps://arxiv.org/abs/2602.16793

  7. [7]

    The lean theorem prover (system description)

    Leonardo De Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob von Raumer. The lean theorem prover (system description). InInternational Conference on Automated Deduction, pages 378–388. Springer, 2015

  8. [8]

    Finite codimension stability of invariant surfaces.arXiv preprint arXiv:2502.00898, 2025

    Giovanni Forni. Finite codimension stability of invariant surfaces.arXiv preprint arXiv:2502.00898, 2025

  9. [9]

    Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback

    Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Dayiheng Liu, Chang Zhou, Wen Xiao, Tianyu Liu, and Baobao Chang. Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback. InFindings of the Association for Computational Linguistics: ACL 2025, pages 14588–14604. Associati...

  10. [10]

    Formal proof – theory and practice.Notices of the American Mathematical Society, 55:1395–1406, 2008

    John Harrison. Formal proof – theory and practice.Notices of the American Mathematical Society, 55:1395–1406, 2008

  11. [11]

    Context rot: How increasing input tokens impacts llm performance.URL https://research

    Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance.URL https://research. trychroma. com/context-rot, retrieved October, 20: 2025, 2025

  12. [12]

    Pessimistic verification for open ended math questions.arXiv preprint arXiv:2511.21522, 2025

    Yanxing Huang, Zihan Tang, Zejin Lin, Peng Li, and Yang Liu. Pessimistic verification for open ended math questions.arXiv preprint arXiv:2511.21522, 2025

  13. [13]

    Winning gold at imo 2025 with a model-agnostic verification- and-refinement pipeline.arXiv preprint arXiv:2507.15855, 2025

    Yichen Huang and Lin F Yang. Winning gold at imo 2025 with a model-agnostic verification- and-refinement pipeline.arXiv preprint arXiv:2507.15855, 2025

  14. [14]

    Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024

    Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024

  15. [15]

    PrimeNumberTheoremAnd: A formalization of the prime number theorem and related results in Lean 4

    Alex Kontorovich and Terence Tao. PrimeNumberTheoremAnd: A formalization of the prime number theorem and related results in Lean 4. https://github.com/ AlexKontorovich/PrimeNumberTheoremAnd, 2024

  16. [16]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, 2024. 12

  17. [17]

    Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

    Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, et al. Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

  18. [18]

    Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

    Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

  19. [19]

    Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards

    Xiaoyuan Liu, Tian Liang, Zhiwei He, Jiahao Xu, Wenxuan Wang, Pinjia He, Zhaopeng Tu, Haitao Mi, and Dong Yu. Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards. InProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), 2025

  20. [20]

    Towards robust mathematical reasoning

    Minh-Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, et al. Towards robust mathematical reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35406–35430, 2025

  21. [21]

    Scaling generative verifiers for natural language mathematical proof verification and selection.arXiv preprint arXiv:2511.13027,

    Sadegh Mahdavi, Branislav Kisacanin, Shubham Toshniwal, Wei Du, Ivan Moshkov, George Armstrong, Renjie Liao, Christos Thrampoulidis, and Igor Gitman. Scaling generative ver- ifiers for natural language mathematical proof verification and selection.arXiv preprint arXiv:2511.13027, 2025

  22. [22]

    The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923, 2023

    William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923, 2023

  23. [23]

    The lean 4 theorem prover and programming language

    Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021

  24. [24]

    Springer, 2002

    Tobias Nipkow, Markus Wenzel, and Lawrence C Paulson.Isabelle/HOL: a proof assistant for higher-order logic. Springer, 2002

  25. [25]

    Hard2verify: A step-level verification benchmark for open-ended frontier math.arXiv preprint arXiv:2510.13744, 2025

    Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, and Shafiq Joty. Hard2verify: A step-level verification benchmark for open-ended frontier math.arXiv preprint arXiv:2510.13744, 2025

  26. [26]

    Real algebraic surfaces biholomorphically equivalent but not algebraically equivalent.arXiv preprint arXiv:2503.08957, 2025

    Guillaume Rond. Real algebraic surfaces biholomorphically equivalent but not algebraically equivalent.arXiv preprint arXiv:2503.08957, 2025

  27. [27]

    Liquid tensor experiment.Experimental Mathematics, 31(2):349–354, 2022

    Peter Scholze. Liquid tensor experiment.Experimental Mathematics, 31(2):349–354, 2022

  28. [28]

    Deepseekmath-v2: Towards self-verifiable mathematical reasoning

    Zhihong Shao, Yuxiang Luo, Chengda Lu, ZZ Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. Deepseekmath-v2: Towards self-verifiable mathematical reasoning. arXiv preprint arXiv:2511.22570, 2025

  29. [29]

    Heimdall: test-time scaling on the generative verification.arXiv preprint arXiv:2504.10337, 2025

    Wenlei Shi and Xing Jin. Heimdall: test-time scaling on the generative verification.arXiv preprint arXiv:2504.10337, 2025

  30. [30]

    A promising path towards autoformalization and general artificial intelli- gence

    Christian Szegedy. A promising path towards autoformalization and general artificial intelli- gence. InInternational Conference on Intelligent Computer Mathematics, pages 3–20. Springer, 2020

  31. [31]

    From solving to verifying: A unified objective for robust reasoning in llms, 2025

    Xiaoxuan Wang, Bo Liu, Song Jiang, Jingzhou Liu, Jingyuan Qi, Xia Chen, and Baosheng He. From solving to verifying: A unified objective for robust reasoning in llms, 2025. URL https://arxiv.org/abs/2511.15137. 13

  32. [32]

    Autoformalization with large language models.Advances in neural information processing systems, 35:32353–32368, 2022

    Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models.Advances in neural information processing systems, 35:32353–32368, 2022

  33. [33]

    Pencil: Long thoughts with short memory, 2025

    Chenxiao Yang, Nathan Srebro, David McAllester, and Zhiyuan Li. Pencil: Long thoughts with short memory, 2025. URLhttps://arxiv.org/abs/2503.14337

  34. [34]

    Recursive models for long-horizon reasoning

    Chenxiao Yang, Nathan Srebro, and Zhiyuan Li. Recursive models for long-horizon reasoning. arXiv preprint arXiv:2603.02112, 2026

  35. [35]

    Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

    Weihao Zeng, Yuzhen Huang, and Junxian He. Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

  36. [36]

    arXiv:2408.15240

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. InProceedings of the International Conference on Learning Representations (ICLR), 2025. URL https://arxiv. org/abs/2408.15240

  37. [37]

    Sample, scrutinize and scale: Effective inference-time search by scaling verification.arXiv preprint arXiv:2502.01839, 2025

    Eric Zhao, Pranjal Awasthi, and Sreenivas Gollapudi. Sample, scrutinize and scale: Effective inference-time search by scaling verification.arXiv preprint arXiv:2502.01839, 2025

  38. [38]

    Processbench: Identifying process errors in mathematical reasoning

    Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. InThe 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  39. [39]

    Leanarchitect: Automating blueprint generation for humans and ai, 2026

    Thomas Zhu, Pietro Monticone, Jeremy Avigad, and Sean Welleck. Leanarchitect: Automating blueprint generation for humans and ai, 2026. URL https://arxiv.org/abs/2601. 22554. 14 A Context cost of verification: formal statement and proof Definition 2(Good Pseudo-Formal proof).Agood Pseudo-Formal proofis a Pseudo-Formal proof (Definition 1) with the followin...

  40. [40]

    The depth of T is D<n

  41. [41]

    Definition 3(Block-wise verification algorithm).Let P= (G , T) be a Pseudo-Formal proof with node set V={v 1,

    The lengths of the conclusion c, the premises, and the proof are bounded above by L. Definition 3(Block-wise verification algorithm).Let P= (G , T) be a Pseudo-Formal proof with node set V={v 1, . . . ,vn}, where each node is written as v= (P v,c v,π v). For a node v∈V , let AncT(v) denote its ancestors in the scope-inheritance forest T, and define its de...

  42. [42]

    1"> Assumptions / Conditions / Definitions. - ... Statement : ... </PROPOSITION_STATEMENT> <LEMMA_STATEMENT id=

    Lemmas - Do not introduce any deeper hierarchy. - If a deeper tree seems natural, flatten it into a sequential list of lemmas inside the relevant proposition. - The theorem proof may only cite propositions; a proposition proof may only cite its own lemmas and the statements of earlier propositions. - No trivial decompositions: do not decompose the theorem...

  43. [43]

    These are often the parent or ancestor statements of the Assertion, and can be the same as the global theorem

    **Contexts**: A sequence of statements from which the Assertion may or may not inherit definitions, assumptions, or conditions. These are often the parent or ancestor statements of the Assertion, and can be the same as the global theorem. They are provided solely so you can understand the definitions and assumptions of the Assertion. They have NOT been ve...

  44. [44]

    You may assume all established results are correct and use them freely - do NOT re-verify them

    **Established Results **: Statements that have already been verified or can be assumed to be correct. You may assume all established results are correct and use them freely - do NOT re-verify them. The proof of the Assertion can invoke these results as long as the assumptions are properly justified and the definitions are consistent

  45. [45]

    **Assertion**: The specific statement whose proof you must verify

  46. [46]

    verdict":

    **Proposed Proof **: The proof of the Assertion to verify. Instructions: - Verify ONLY the proposed proof of the Assertion. - Read the Assertion carefully and analyze the proof step by step. - Identify any incorrect, unjustified, or logically invalid reasoning. - Pay close attention to potentially confusing or ambiguous interpretations of concepts. - When...

  47. [47]

    Math Problem: The original problem statement. 20

  48. [48]

    Original Solution Steps: The solver’s original solution split into indexed steps, written as: <step>[0] ...</step> <step>[1] ...</step> and so on

  49. [49]

    This rewrite was produced automatically and may contain artifacts that are not present in the original solution

    Rewritten Proof: A structured rewrite of the original solution, decomposed into theorem / proposition / lemma / claim / fact blocks. This rewrite was produced automatically and may contain artifacts that are not present in the original solution

  50. [50]

    These may or may not be genuine errors in the original solution

    Potential Errors: A list of potential errors flagged by automated verification of the rewritten proof. These may or may not be genuine errors in the original solution. Evaluation process:

  51. [51]

    For each flagged issue, decide whether it is a genuine mathematical error in the ORIGINAL solution or a false alarm caused by the rewrite

    Validate each potential error. For each flagged issue, decide whether it is a genuine mathematical error in the ORIGINAL solution or a false alarm caused by the rewrite. Cross-check against the original solution steps. If the original solution contains the missing reasoning, a stronger equivalent argument, or a correct alternative route, do not count the ...

  52. [52]

    The potential errors are useful leads, but your final output must be based on the original solution steps

    Inspect the original steps as needed. The potential errors are useful leads, but your final output must be based on the original solution steps. Check any original steps needed to identify the first genuine incorrect step. If you find that the first incorrect step was not covered by any potential error, report it in <additional_errors>

  53. [53]

    For each original step, determine whether that step is correct or incorrect

    Grade the original steps, not the rewrite. For each original step, determine whether that step is correct or incorrect. - A correct step is one where all mathematical content is correct and logically consistent with the problem and all previous correct steps. - An incorrect step is one that contains a mathematical error, an unjustified load- bearing claim...

  54. [54]

    corrected typos in the proof of Theorem 3

    Identify the first incorrect step. The first incorrect step is the smallest original step index whose content is mathematically incorrect or whose reasoning first depends on an error. If every original step is correct, output -1. Important calibration rules: - The original solution steps are authoritative for the final verdict. - The rewritten proof and p...

  55. [55]

    Use this to read the paper as a human reader would, with all theorem/lemma/proposition numbers rendered

    The rendered PDF (attached as a file). Use this to read the paper as a human reader would, with all theorem/lemma/proposition numbers rendered

  56. [56]

    Theorem 19

    The raw LaTeX source (included below). Use this to inspect precise notation, equations, and label names if needed. Your task is to identify any mathematical errors in the paper that would require revision before publication. Focus on errors of mathematical substance: incorrect proofs, unjustified steps, false claims, gaps in reasoning, miscomputed quantit...

  57. [57]

    Use it to read the paper as a human would and to identify the rendered numerical or letter labels of every theorem/lemma/proposition/corollary/claim

    The rendered PDF of the paper (attached as a file). Use it to read the paper as a human would and to identify the rendered numerical or letter labels of every theorem/lemma/proposition/corollary/claim

  58. [58]

    Use it to inspect precise notation, equations, and the exact wording of each statement and proof

    The raw LaTeX source of the paper (included below). Use it to inspect precise notation, equations, and the exact wording of each statement and proof. Structure: - The top level of the rewrite is a list of **theorems**. Use one <THEOREM_STATEMENT id="N"> block per top-level result the paper proves. Top-level results are the paper’s flagship theorem(s) plus...

  59. [59]

    Propositions (shared pool, numbered globally across all theorems)

  60. [60]

    N"> may cite any of the propositions; a <PROPOSITION_PROOF id=

    Lemmas (numbered within each proposition) - Do not introduce any deeper hierarchy. If a deeper tree seems natural, flatten it into a sequential list of lemmas inside the relevant proposition. - Each <THEOREM_PROOF id="N"> may cite any of the propositions; a <PROPOSITION_PROOF id="K"> may cite its own lemmas and the statements of earlier propositions. A th...

  61. [61]

    The rendered PDF of the original paper (attached as a file)

  62. [62]

    The raw LaTeX source of the paper (included below -- ground truth -- your rewrite must faithfully represent this)

  63. [63]

    Your previous rewritten paper, which contained faithfulness errors

  64. [64]

    Requirements for the new rewrite: - Fix every issue listed in the Identified Errors

    The specific discrepancies flagged by the checker. Requirements for the new rewrite: - Fix every issue listed in the Identified Errors. For each issue, make sure the new rewrite no longer deviates from the original paper in that way. - Do NOT introduce new discrepancies: do not strengthen/weaken claims, omit steps, add arguments, drift in notation, or mis...

  65. [65]

    **Original Paper (LaTeX) **: The full LaTeX source of the original paper

  66. [66]

    These are provided so you can understand the scope of the Assertion

    **Contexts**: Statements from the rewritten paper that the Assertion inherits definitions or assumptions from (e.g., the enclosing theorem statement, the enclosing proposition statement). These are provided so you can understand the scope of the Assertion

  67. [67]

    You may use them as reference points

    **Established Results **: Statements from the rewritten paper that have already been checked and can be assumed to be faithfully rewritten. You may use them as reference points. 28

  68. [68]

    **Assertion**: The specific rewritten statement whose faithfulness you must verify

  69. [69]

    Your task is to determine whether the Assertion and its Proposed Proof **faithfully represent** the corresponding part of the Original Paper

    **Proposed Proof **: The rewritten proof of the Assertion (may be empty for a top- level theorem statement). Your task is to determine whether the Assertion and its Proposed Proof **faithfully represent** the corresponding part of the Original Paper. Check for:

  70. [70]

    **Strengthened or weakened claims **: Does the Assertion claim more or less than the original paper establishes at the corresponding point?

  71. [71]

    **Omitted content **: Does the Proposed Proof drop a non-trivial argument that appears in the original paper?

  72. [72]

    **Added content **: Does the Proposed Proof introduce new arguments, repairs, or proof ideas not present in the original paper?

  73. [73]

    **Notation drift **: Are variables, functions, or definitions used differently than in the original paper?

  74. [74]

    **Misinterpretation**: Does the Assertion or Proposed Proof misunderstand the original’s reasoning or logical structure?

  75. [75]

    verdict":

    **Scope errors **: Are assumptions incorrectly inherited, dropped, or added compared to the original? Instructions: - Do NOT judge whether the original paper is mathematically correct. Your sole task is faithfulness. - Only flag changes that alter mathematical meaning. Cosmetic rephrasing is fine. - Use the Contexts to understand what the Assertion is all...