Pseudo-Formalization for Automatic Proof Verification

Kaiyue Wen; Luke Bailey; Mohammed Abouzaid; Slim Barkallah; Tengyu Ma

arxiv: 2605.20531 · v1 · pith:QOTC7V2Xnew · submitted 2026-05-19 · 💻 cs.LO · cs.LG

Pseudo-Formalization for Automatic Proof Verification

Slim Barkallah , Luke Bailey , Kaiyue Wen , Mohammed Abouzaid , Tengyu Ma This is my paper

Pith reviewed 2026-05-21 06:21 UTC · model grok-4.3

classification 💻 cs.LO cs.LG

keywords Pseudo-FormalizationBlock Verificationproof verificationnatural language proofslarge language modelsmathematical reasoningerror detectionautomatic verification

0 comments

The pith

Natural language proofs can be verified automatically by translating them into self-contained Pseudo-Formal modules checked independently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Pseudo-Formalization to address the difficulty of verifying proofs written in natural language, especially those generated by AI systems. A proof is broken into modules, each containing its own premises, conclusion, and natural language argument. An LLM first converts the original proof into this modular format and then confirms each module on its own through Block Verification. This is tested on olympiad problems and research-level mathematics, where it detects errors with higher precision and recall than methods that judge an entire proof at once. Readers interested in AI-assisted mathematics would care because it offers a middle path between rigid formal languages and unreliable informal checking.

Core claim

The paper claims that decomposing a natural language proof into Pseudo-Formal modules—each stating premises, conclusion, and a natural language proof—allows an LLM to translate the original text and then verify the modules independently, establishing overall correctness more effectively than direct whole-proof judgments by an LLM.

What carries the argument

Pseudo-Formal modules, each a self-contained unit with premises, conclusion, and natural language proof that supports independent verification.

If this is right

Error detection improves because issues can be isolated to individual modules instead of requiring re-assessment of the full proof.
Verification extends to research-level mathematics where complete formalization remains impractical.
Precision and recall for finding proof errors exceed those of direct LLM-as-judge baselines on the tested benchmarks.
The released ArxivMathGradingBench supplies a shared test set for comparing future verification approaches on research-level problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Granular module-level feedback could support iterative refinement when training AI systems to generate or correct proofs.
The same decomposition strategy might transfer to other structured reasoning tasks such as checking long chains of code or logical arguments.
Hybrid systems could combine Pseudo-Formal modules with partial formalization in targeted sections to increase reliability further.

Load-bearing premise

An LLM can translate any natural language proof into accurate Pseudo-Formal modules without omitting or adding critical logical steps.

What would settle it

A natural language proof containing a logical error that the LLM translation into Pseudo-Formal form fails to capture, so that independent module verification still approves the flawed proof.

Figures

Figures reproduced from arXiv: 2605.20531 by Kaiyue Wen, Luke Bailey, Mohammed Abouzaid, Slim Barkallah, Tengyu Ma.

**Figure 1.** Figure 1: Left: Diagram indicating how Pseudo-Formalization can be used to verify proofs. We translate a natural language proof to Pseudo-Formal representation (Definition 1), and then verify each block (see [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: An example Pseudo-Formalized Proof on IMO 2024 P6. The block verifier correctly assigns Lemma 5.1 score 0 because the proof incorrectly strengthens P2 from Lemma 4.2 “S cannot contain both d and −d” to “all nonzero elements of S have the same sign”. 3.1 Step 1: Translation to Pseudo-Formal The translation LLM receives the natural-language proof Π, and is prompted to produce a PseudoFormal proof Πˆ in the … view at source ↗

**Figure 3.** Figure 3: Per-proof/paper and per-error metrics on [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Precision-recall on Hard2Verify at the whole-proof label level (positive class: proof contains 1 or more incorrect steps). For each k, the verifier’s per-rollout proof-level verdicts are aggregated with pessimistic-min (the proof is predicted incorrect iff at least one 1 or more steps over the k rollouts is predicted incorrect). Quantitative Results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Reliable verification of proofs remains a bottleneck for training and evaluating AI systems on hard mathematical reasoning. Fully formal proofs, in languages like Lean, are easy to verify because they are unambiguous and modular. Most proofs, particularly those written by AI systems, have neither property, and translating them into formal languages remains challenging in many frontier math settings. We propose Pseudo-Formalization (PF), a proof format that captures the modularity and precision of formal proofs while retaining the flexibility of natural language. A Pseudo-Formal proof is decomposed into self-contained modules, each stating its premises, conclusion, and proof in natural language. To verify the correctness of a regular natural language proof, an LLM translates it to Pseudo-Formal and then verifies each module independently, an algorithm we call Block Verification (BV). We evaluate PF+BV on two benchmarks spanning olympiad and research-level mathematics, where it pareto-dominates LLM-as-judge baselines on error-finding precision and recall. To support future work, we release our research-level proof verification benchmark ArxivMathGradingBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PF+BV gives a workable middle step between raw NL proofs and full formalization, with better error spotting than plain LLM judges on the reported benchmarks, but the chaining between modules looks like an unaddressed weak point.

read the letter

The main takeaway is that this paper gives a concrete way to split a natural-language proof into self-contained modules, each with its own premises, conclusion, and sketch, then lets an LLM check the modules one by one. They call the format Pseudo-Formalization and the checking procedure Block Verification. On the two benchmarks they describe, the combination beats direct LLM-as-judge baselines on precision and recall for spotting errors, and they release the research-level ArxivMathGradingBench to help others test similar ideas. That release and the explicit modular format are the clearest new pieces; the rest builds on existing LLM translation and verification work but packages it differently for this use case. The approach is practical for settings where full Lean formalization is still too expensive or brittle, and the benchmarks span both olympiad problems and harder research-level statements, which is a reasonable scope. The central assumption is that an accurate translation into these modules plus independent local checks is enough to certify the whole proof. The stress-test note is right to flag that this leaves inter-module links unchecked: nothing in the description shows an explicit step that verifies whether the conclusion of one module actually supports the premises of the next, or that the overall dependency graph is sound. If the initial LLM translation drops or misaligns a logical connection, locally valid modules could still add up to an invalid argument. The abstract also gives no numbers on benchmark construction, variance, or leakage checks, so the pareto-dominance claim is hard to weigh without those details. Readers working on scalable verification for AI math systems or on hybrid formal-informal pipelines would find the benchmark and the modular idea useful to build on. The work shows clear engagement with the verification bottleneck and ships a reusable resource, so it is worth sending to referees even though the chaining issue and evaluation transparency will need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces Pseudo-Formalization (PF), a modular proof representation in which natural-language proofs are decomposed into self-contained modules each stating premises, conclusion, and proof steps. It defines Block Verification (BV) as the procedure of using an LLM to translate a natural-language proof into PF modules and then verifying each module independently. The central empirical claim is that PF+BV Pareto-dominates standard LLM-as-judge baselines on error-finding precision and recall across two benchmarks (olympiad and research-level mathematics), and the authors release the ArxivMathGradingBench dataset to support further work.

Significance. If the soundness of independent module verification for overall proof correctness can be established, the approach would supply a useful intermediate point between fully formal verification and unstructured natural-language judging, potentially easing the bottleneck in training and evaluating AI mathematical reasoners. The public release of a research-level proof-verification benchmark is a concrete positive contribution that enables reproducible follow-up research.

major comments (2)

[Block Verification algorithm description] The description of Block Verification states that each PF module is verified independently, yet provides no mechanism (entailment check, dependency graph validation, or global consistency step) to ensure that the conclusion of module i entails the premises of module i+1. Because the central claim is that local verification suffices to certify the entire proof, the absence of any cross-module chaining verification is load-bearing; an LLM translation error that misaligns adjacent modules would produce locally valid but globally invalid arguments.
[Evaluation and experimental results] The evaluation reports Pareto dominance on precision and recall but supplies no information on benchmark construction, statistical tests, error-bar reporting, or controls for data leakage between training data and the released ArxivMathGradingBench. Without these details the empirical claim cannot be assessed at the level required to support the superiority statement.

minor comments (2)

[Abstract] The abstract uses the term 'pareto-dominates' without specifying the exact set of metrics or the baseline configurations; a short table or explicit statement of the comparison points would improve clarity.
[Pseudo-Formalization definition] Notation for the PF module components (premises, conclusion, proof text) is introduced informally; a compact definition or diagram would reduce ambiguity when readers compare PF to existing intermediate representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments in detail below, indicating the revisions we plan to make.

read point-by-point responses

Referee: [Block Verification algorithm description] The description of Block Verification states that each PF module is verified independently, yet provides no mechanism (entailment check, dependency graph validation, or global consistency step) to ensure that the conclusion of module i entails the premises of module i+1. Because the central claim is that local verification suffices to certify the entire proof, the absence of any cross-module chaining verification is load-bearing; an LLM translation error that misaligns adjacent modules would produce locally valid but globally invalid arguments.

Authors: We agree that explicitly addressing the chaining between modules strengthens the presentation. In the Pseudo-Formalization approach, the translation to PF modules is designed such that the premises of each subsequent module are directly based on the conclusions of the prior ones, as part of the decomposition process. The independent verification then confirms that within each module, the proof steps correctly entail the conclusion from the stated premises. While we do not include an automated cross-module entailment check in the current BV algorithm, this is because the modularity assumes faithful translation; any misalignment would be a failure of the translation step rather than the verification per se. To clarify this, we will revise the manuscript to include a dedicated subsection explaining the assumptions underlying independent verification and the role of the translation in maintaining global structure. We will also add a discussion of potential failure modes due to translation errors and how they might be mitigated. revision: partial
Referee: [Evaluation and experimental results] The evaluation reports Pareto dominance on precision and recall but supplies no information on benchmark construction, statistical tests, error-bar reporting, or controls for data leakage between training data and the released ArxivMathGradingBench. Without these details the empirical claim cannot be assessed at the level required to support the superiority statement.

Authors: We acknowledge the need for greater transparency in the experimental setup to allow proper assessment of the results. In the revised manuscript, we will expand the evaluation section to provide: (1) a detailed description of how the ArxivMathGradingBench was constructed, including the selection criteria for proofs and error annotations; (2) statistical tests (e.g., paired t-tests or Wilcoxon tests) comparing PF+BV to baselines, along with p-values; (3) error bars or confidence intervals on the precision and recall metrics, computed over multiple runs or bootstrap samples; and (4) explicit controls for data leakage, such as verification that the benchmark proofs were not part of the LLM training data and use of held-out splits where applicable. These additions will support the empirical claims more rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of new format on external benchmarks

full rationale

The paper defines Pseudo-Formalization (PF) as a modular natural-language proof format and Block Verification (BV) as independent per-module checking after LLM translation. It then reports empirical results on two benchmarks (including the newly released ArxivMathGradingBench), showing Pareto dominance over LLM-as-judge baselines on error-finding precision and recall. No equations, fitted parameters, or self-citations appear in the derivation; the central claim rests on external LLM capabilities and released test data rather than reducing to self-definition or internal fitting. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities; the approach implicitly assumes reliable LLM translation and modular independence without detailing supporting evidence or assumptions from prior literature.

pith-pipeline@v0.9.0 · 5721 in / 1137 out tokens · 36208 ms · 2026-05-21T06:21:24.419987+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A Pseudo-Formal proof is a sequence of modules V={v_i = (P_i, c_i, π_i)} equipped with a directed acyclic graph (DAG) G for the invocation dependency and a forest T for scope inheritance
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Context cost of block-wise verification): ... O(n) calls to T1 with context length independent of n

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 1 internal anchor

[1]

Learning to give checkable answers with prover-verifier games.arXiv preprint arXiv:2108.12099, 2021

Cem Anil, Guodong Zhang, Yuhuai Wu, and Roger Grosse. Learning to give checkable answers with prover-verifier games.arXiv preprint arXiv:2108.12099, 2021

work page arXiv 2021
[2]

Towards a Lean proof of Fermat’s Last Theorem.https: //imperialcollegelondon.github.io/FLT/blueprint.pdf, 2024

Kevin Buzzard and Richard Taylor. Towards a Lean proof of Fermat’s Last Theorem.https: //imperialcollegelondon.github.io/FLT/blueprint.pdf, 2024

work page 2024
[3]

Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K. Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning.arXiv preprint arXiv:2504.19162, 2025

work page arXiv 2025
[4]

Power-free palindromes and reversed primes: S

Shashi Chourasiya and Daniel R Johnston. Power-free palindromes and reversed primes: S. chourasiya, dr johnston.Monatshefte für Mathematik, pages 1–17, 2025

work page 2025
[5]

Escaping the cognitive well: Efficient competition math with off-the-shelf models,

Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, and Sanjeev Arora. Escaping the cognitive well: Efficient competition math with off-the-shelf models,

work page
[6]

URLhttps://arxiv.org/abs/2602.16793

work page arXiv
[7]

The lean theorem prover (system description)

Leonardo De Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob von Raumer. The lean theorem prover (system description). InInternational Conference on Automated Deduction, pages 378–388. Springer, 2015

work page 2015
[8]

Finite codimension stability of invariant surfaces.arXiv preprint arXiv:2502.00898, 2025

Giovanni Forni. Finite codimension stability of invariant surfaces.arXiv preprint arXiv:2502.00898, 2025

work page arXiv 2025
[9]

Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback

Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Dayiheng Liu, Chang Zhou, Wen Xiao, Tianyu Liu, and Baobao Chang. Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback. InFindings of the Association for Computational Linguistics: ACL 2025, pages 14588–14604. Associati...

work page 2025
[10]

Formal proof – theory and practice.Notices of the American Mathematical Society, 55:1395–1406, 2008

John Harrison. Formal proof – theory and practice.Notices of the American Mathematical Society, 55:1395–1406, 2008

work page 2008
[11]

Context rot: How increasing input tokens impacts llm performance.URL https://research

Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance.URL https://research. trychroma. com/context-rot, retrieved October, 20: 2025, 2025

work page 2025
[12]

Pessimistic verification for open ended math questions.arXiv preprint arXiv:2511.21522, 2025

Yanxing Huang, Zihan Tang, Zejin Lin, Peng Li, and Yang Liu. Pessimistic verification for open ended math questions.arXiv preprint arXiv:2511.21522, 2025

work page arXiv 2025
[13]

Winning gold at imo 2025 with a model-agnostic verification- and-refinement pipeline.arXiv preprint arXiv:2507.15855, 2025

Yichen Huang and Lin F Yang. Winning gold at imo 2025 with a model-agnostic verification- and-refinement pipeline.arXiv preprint arXiv:2507.15855, 2025

work page arXiv 2025
[14]

Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024

Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024

work page arXiv 2024
[15]

PrimeNumberTheoremAnd: A formalization of the prime number theorem and related results in Lean 4

Alex Kontorovich and Terence Tao. PrimeNumberTheoremAnd: A formalization of the prime number theorem and related results in Lean 4. https://github.com/ AlexKontorovich/PrimeNumberTheoremAnd, 2024

work page 2024
[16]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, 2024. 12

work page 2024
[17]

Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, et al. Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

work page arXiv 2025
[18]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

work page internal anchor Pith review arXiv 2025
[19]

Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards

Xiaoyuan Liu, Tian Liang, Zhiwei He, Jiahao Xu, Wenxuan Wang, Pinjia He, Zhaopeng Tu, Haitao Mi, and Dong Yu. Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards. InProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), 2025

work page 2025
[20]

Towards robust mathematical reasoning

Minh-Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, et al. Towards robust mathematical reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35406–35430, 2025

work page 2025
[21]

Scaling generative verifiers for natural language mathematical proof verification and selection.arXiv preprint arXiv:2511.13027,

Sadegh Mahdavi, Branislav Kisacanin, Shubham Toshniwal, Wei Du, Ivan Moshkov, George Armstrong, Renjie Liao, Christos Thrampoulidis, and Igor Gitman. Scaling generative ver- ifiers for natural language mathematical proof verification and selection.arXiv preprint arXiv:2511.13027, 2025

work page arXiv 2025
[22]

The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923, 2023

William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923, 2023

work page arXiv 2023
[23]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021

work page 2021
[24]

Springer, 2002

Tobias Nipkow, Markus Wenzel, and Lawrence C Paulson.Isabelle/HOL: a proof assistant for higher-order logic. Springer, 2002

work page 2002
[25]

Hard2verify: A step-level verification benchmark for open-ended frontier math.arXiv preprint arXiv:2510.13744, 2025

Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, and Shafiq Joty. Hard2verify: A step-level verification benchmark for open-ended frontier math.arXiv preprint arXiv:2510.13744, 2025

work page arXiv 2025
[26]

Real algebraic surfaces biholomorphically equivalent but not algebraically equivalent.arXiv preprint arXiv:2503.08957, 2025

Guillaume Rond. Real algebraic surfaces biholomorphically equivalent but not algebraically equivalent.arXiv preprint arXiv:2503.08957, 2025

work page arXiv 2025
[27]

Liquid tensor experiment.Experimental Mathematics, 31(2):349–354, 2022

Peter Scholze. Liquid tensor experiment.Experimental Mathematics, 31(2):349–354, 2022

work page 2022
[28]

Deepseekmath-v2: Towards self-verifiable mathematical reasoning

Zhihong Shao, Yuxiang Luo, Chengda Lu, ZZ Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. Deepseekmath-v2: Towards self-verifiable mathematical reasoning. arXiv preprint arXiv:2511.22570, 2025

work page arXiv 2025
[29]

Heimdall: test-time scaling on the generative verification.arXiv preprint arXiv:2504.10337, 2025

Wenlei Shi and Xing Jin. Heimdall: test-time scaling on the generative verification.arXiv preprint arXiv:2504.10337, 2025

work page arXiv 2025
[30]

A promising path towards autoformalization and general artificial intelli- gence

Christian Szegedy. A promising path towards autoformalization and general artificial intelli- gence. InInternational Conference on Intelligent Computer Mathematics, pages 3–20. Springer, 2020

work page 2020
[31]

From solving to verifying: A unified objective for robust reasoning in llms, 2025

Xiaoxuan Wang, Bo Liu, Song Jiang, Jingzhou Liu, Jingyuan Qi, Xia Chen, and Baosheng He. From solving to verifying: A unified objective for robust reasoning in llms, 2025. URL https://arxiv.org/abs/2511.15137. 13

work page arXiv 2025
[32]

Autoformalization with large language models.Advances in neural information processing systems, 35:32353–32368, 2022

Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models.Advances in neural information processing systems, 35:32353–32368, 2022

work page 2022
[33]

Pencil: Long thoughts with short memory, 2025

Chenxiao Yang, Nathan Srebro, David McAllester, and Zhiyuan Li. Pencil: Long thoughts with short memory, 2025. URLhttps://arxiv.org/abs/2503.14337

work page arXiv 2025
[34]

Recursive models for long-horizon reasoning

Chenxiao Yang, Nathan Srebro, and Zhiyuan Li. Recursive models for long-horizon reasoning. arXiv preprint arXiv:2603.02112, 2026

work page arXiv 2026
[35]

Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

Weihao Zeng, Yuzhen Huang, and Junxian He. Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

work page arXiv 2026
[36]

arXiv:2408.15240

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. InProceedings of the International Conference on Learning Representations (ICLR), 2025. URL https://arxiv. org/abs/2408.15240

work page arXiv 2025
[37]

Sample, scrutinize and scale: Effective inference-time search by scaling verification.arXiv preprint arXiv:2502.01839, 2025

Eric Zhao, Pranjal Awasthi, and Sreenivas Gollapudi. Sample, scrutinize and scale: Effective inference-time search by scaling verification.arXiv preprint arXiv:2502.01839, 2025

work page arXiv 2025
[38]

Processbench: Identifying process errors in mathematical reasoning

Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. InThe 63rd Annual Meeting of the Association for Computational Linguistics, 2025

work page 2025
[39]

Leanarchitect: Automating blueprint generation for humans and ai, 2026

Thomas Zhu, Pietro Monticone, Jeremy Avigad, and Sean Welleck. Leanarchitect: Automating blueprint generation for humans and ai, 2026. URL https://arxiv.org/abs/2601. 22554. 14 A Context cost of verification: formal statement and proof Definition 2(Good Pseudo-Formal proof).Agood Pseudo-Formal proofis a Pseudo-Formal proof (Definition 1) with the followin...

work page 2026
[40]

The depth of T is D<n

work page
[41]

Definition 3(Block-wise verification algorithm).Let P= (G , T) be a Pseudo-Formal proof with node set V={v 1,

The lengths of the conclusion c, the premises, and the proof are bounded above by L. Definition 3(Block-wise verification algorithm).Let P= (G , T) be a Pseudo-Formal proof with node set V={v 1, . . . ,vn}, where each node is written as v= (P v,c v,π v). For a node v∈V , let AncT(v) denote its ancestors in the scope-inheritance forest T, and define its de...

work page 2026
[42]

1"> Assumptions / Conditions / Definitions. - ... Statement : ... </PROPOSITION_STATEMENT> <LEMMA_STATEMENT id=

Lemmas - Do not introduce any deeper hierarchy. - If a deeper tree seems natural, flatten it into a sequential list of lemmas inside the relevant proposition. - The theorem proof may only cite propositions; a proposition proof may only cite its own lemmas and the statements of earlier propositions. - No trivial decompositions: do not decompose the theorem...

work page
[43]

These are often the parent or ancestor statements of the Assertion, and can be the same as the global theorem

**Contexts**: A sequence of statements from which the Assertion may or may not inherit definitions, assumptions, or conditions. These are often the parent or ancestor statements of the Assertion, and can be the same as the global theorem. They are provided solely so you can understand the definitions and assumptions of the Assertion. They have NOT been ve...

work page
[44]

You may assume all established results are correct and use them freely - do NOT re-verify them

**Established Results **: Statements that have already been verified or can be assumed to be correct. You may assume all established results are correct and use them freely - do NOT re-verify them. The proof of the Assertion can invoke these results as long as the assumptions are properly justified and the definitions are consistent

work page
[45]

**Assertion**: The specific statement whose proof you must verify

work page
[46]

verdict":

**Proposed Proof **: The proof of the Assertion to verify. Instructions: - Verify ONLY the proposed proof of the Assertion. - Read the Assertion carefully and analyze the proof step by step. - Identify any incorrect, unjustified, or logically invalid reasoning. - Pay close attention to potentially confusing or ambiguous interpretations of concepts. - When...

work page
[47]

Math Problem: The original problem statement. 20

work page
[48]

Original Solution Steps: The solver’s original solution split into indexed steps, written as: <step>[0] ...</step> <step>[1] ...</step> and so on

work page
[49]

This rewrite was produced automatically and may contain artifacts that are not present in the original solution

Rewritten Proof: A structured rewrite of the original solution, decomposed into theorem / proposition / lemma / claim / fact blocks. This rewrite was produced automatically and may contain artifacts that are not present in the original solution

work page
[50]

These may or may not be genuine errors in the original solution

Potential Errors: A list of potential errors flagged by automated verification of the rewritten proof. These may or may not be genuine errors in the original solution. Evaluation process:

work page
[51]

For each flagged issue, decide whether it is a genuine mathematical error in the ORIGINAL solution or a false alarm caused by the rewrite

Validate each potential error. For each flagged issue, decide whether it is a genuine mathematical error in the ORIGINAL solution or a false alarm caused by the rewrite. Cross-check against the original solution steps. If the original solution contains the missing reasoning, a stronger equivalent argument, or a correct alternative route, do not count the ...

work page
[52]

The potential errors are useful leads, but your final output must be based on the original solution steps

Inspect the original steps as needed. The potential errors are useful leads, but your final output must be based on the original solution steps. Check any original steps needed to identify the first genuine incorrect step. If you find that the first incorrect step was not covered by any potential error, report it in <additional_errors>

work page
[53]

For each original step, determine whether that step is correct or incorrect

Grade the original steps, not the rewrite. For each original step, determine whether that step is correct or incorrect. - A correct step is one where all mathematical content is correct and logically consistent with the problem and all previous correct steps. - An incorrect step is one that contains a mathematical error, an unjustified load- bearing claim...

work page
[54]

corrected typos in the proof of Theorem 3

Identify the first incorrect step. The first incorrect step is the smallest original step index whose content is mathematically incorrect or whose reasoning first depends on an error. If every original step is correct, output -1. Important calibration rules: - The original solution steps are authoritative for the final verdict. - The rewritten proof and p...

work page 2025
[55]

Use this to read the paper as a human reader would, with all theorem/lemma/proposition numbers rendered

The rendered PDF (attached as a file). Use this to read the paper as a human reader would, with all theorem/lemma/proposition numbers rendered

work page
[56]

Theorem 19

The raw LaTeX source (included below). Use this to inspect precise notation, equations, and label names if needed. Your task is to identify any mathematical errors in the paper that would require revision before publication. Focus on errors of mathematical substance: incorrect proofs, unjustified steps, false claims, gaps in reasoning, miscomputed quantit...

work page
[57]

Use it to read the paper as a human would and to identify the rendered numerical or letter labels of every theorem/lemma/proposition/corollary/claim

The rendered PDF of the paper (attached as a file). Use it to read the paper as a human would and to identify the rendered numerical or letter labels of every theorem/lemma/proposition/corollary/claim

work page
[58]

Use it to inspect precise notation, equations, and the exact wording of each statement and proof

The raw LaTeX source of the paper (included below). Use it to inspect precise notation, equations, and the exact wording of each statement and proof. Structure: - The top level of the rewrite is a list of **theorems**. Use one <THEOREM_STATEMENT id="N"> block per top-level result the paper proves. Top-level results are the paper’s flagship theorem(s) plus...

work page
[59]

Propositions (shared pool, numbered globally across all theorems)

work page
[60]

N"> may cite any of the propositions; a <PROPOSITION_PROOF id=

Lemmas (numbered within each proposition) - Do not introduce any deeper hierarchy. If a deeper tree seems natural, flatten it into a sequential list of lemmas inside the relevant proposition. - Each <THEOREM_PROOF id="N"> may cite any of the propositions; a <PROPOSITION_PROOF id="K"> may cite its own lemmas and the statements of earlier propositions. A th...

work page
[61]

The rendered PDF of the original paper (attached as a file)

work page
[62]

The raw LaTeX source of the paper (included below -- ground truth -- your rewrite must faithfully represent this)

work page
[63]

Your previous rewritten paper, which contained faithfulness errors

work page
[64]

Requirements for the new rewrite: - Fix every issue listed in the Identified Errors

The specific discrepancies flagged by the checker. Requirements for the new rewrite: - Fix every issue listed in the Identified Errors. For each issue, make sure the new rewrite no longer deviates from the original paper in that way. - Do NOT introduce new discrepancies: do not strengthen/weaken claims, omit steps, add arguments, drift in notation, or mis...

work page
[65]

**Original Paper (LaTeX) **: The full LaTeX source of the original paper

work page
[66]

These are provided so you can understand the scope of the Assertion

**Contexts**: Statements from the rewritten paper that the Assertion inherits definitions or assumptions from (e.g., the enclosing theorem statement, the enclosing proposition statement). These are provided so you can understand the scope of the Assertion

work page
[67]

You may use them as reference points

**Established Results **: Statements from the rewritten paper that have already been checked and can be assumed to be faithfully rewritten. You may use them as reference points. 28

work page
[68]

**Assertion**: The specific rewritten statement whose faithfulness you must verify

work page
[69]

Your task is to determine whether the Assertion and its Proposed Proof **faithfully represent** the corresponding part of the Original Paper

**Proposed Proof **: The rewritten proof of the Assertion (may be empty for a top- level theorem statement). Your task is to determine whether the Assertion and its Proposed Proof **faithfully represent** the corresponding part of the Original Paper. Check for:

work page
[70]

**Strengthened or weakened claims **: Does the Assertion claim more or less than the original paper establishes at the corresponding point?

work page
[71]

**Omitted content **: Does the Proposed Proof drop a non-trivial argument that appears in the original paper?

work page
[72]

**Added content **: Does the Proposed Proof introduce new arguments, repairs, or proof ideas not present in the original paper?

work page
[73]

**Notation drift **: Are variables, functions, or definitions used differently than in the original paper?

work page
[74]

**Misinterpretation**: Does the Assertion or Proposed Proof misunderstand the original’s reasoning or logical structure?

work page
[75]

verdict":

**Scope errors **: Are assumptions incorrectly inherited, dropped, or added compared to the original? Instructions: - Do NOT judge whether the original paper is mathematically correct. Your sole task is faithfulness. - Only flag changes that alter mathematical meaning. Cosmetic rephrasing is fine. - Use the Contexts to understand what the Assertion is all...

work page

[1] [1]

Learning to give checkable answers with prover-verifier games.arXiv preprint arXiv:2108.12099, 2021

Cem Anil, Guodong Zhang, Yuhuai Wu, and Roger Grosse. Learning to give checkable answers with prover-verifier games.arXiv preprint arXiv:2108.12099, 2021

work page arXiv 2021

[2] [2]

Towards a Lean proof of Fermat’s Last Theorem.https: //imperialcollegelondon.github.io/FLT/blueprint.pdf, 2024

Kevin Buzzard and Richard Taylor. Towards a Lean proof of Fermat’s Last Theorem.https: //imperialcollegelondon.github.io/FLT/blueprint.pdf, 2024

work page 2024

[3] [3]

Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K. Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning.arXiv preprint arXiv:2504.19162, 2025

work page arXiv 2025

[4] [4]

Power-free palindromes and reversed primes: S

Shashi Chourasiya and Daniel R Johnston. Power-free palindromes and reversed primes: S. chourasiya, dr johnston.Monatshefte für Mathematik, pages 1–17, 2025

work page 2025

[5] [5]

Escaping the cognitive well: Efficient competition math with off-the-shelf models,

Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, and Sanjeev Arora. Escaping the cognitive well: Efficient competition math with off-the-shelf models,

work page

[6] [6]

URLhttps://arxiv.org/abs/2602.16793

work page arXiv

[7] [7]

The lean theorem prover (system description)

Leonardo De Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob von Raumer. The lean theorem prover (system description). InInternational Conference on Automated Deduction, pages 378–388. Springer, 2015

work page 2015

[8] [8]

Finite codimension stability of invariant surfaces.arXiv preprint arXiv:2502.00898, 2025

Giovanni Forni. Finite codimension stability of invariant surfaces.arXiv preprint arXiv:2502.00898, 2025

work page arXiv 2025

[9] [9]

Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback

Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Dayiheng Liu, Chang Zhou, Wen Xiao, Tianyu Liu, and Baobao Chang. Llm critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback. InFindings of the Association for Computational Linguistics: ACL 2025, pages 14588–14604. Associati...

work page 2025

[10] [10]

Formal proof – theory and practice.Notices of the American Mathematical Society, 55:1395–1406, 2008

John Harrison. Formal proof – theory and practice.Notices of the American Mathematical Society, 55:1395–1406, 2008

work page 2008

[11] [11]

Context rot: How increasing input tokens impacts llm performance.URL https://research

Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance.URL https://research. trychroma. com/context-rot, retrieved October, 20: 2025, 2025

work page 2025

[12] [12]

Pessimistic verification for open ended math questions.arXiv preprint arXiv:2511.21522, 2025

Yanxing Huang, Zihan Tang, Zejin Lin, Peng Li, and Yang Liu. Pessimistic verification for open ended math questions.arXiv preprint arXiv:2511.21522, 2025

work page arXiv 2025

[13] [13]

Winning gold at imo 2025 with a model-agnostic verification- and-refinement pipeline.arXiv preprint arXiv:2507.15855, 2025

Yichen Huang and Lin F Yang. Winning gold at imo 2025 with a model-agnostic verification- and-refinement pipeline.arXiv preprint arXiv:2507.15855, 2025

work page arXiv 2025

[14] [14]

Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024

Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024

work page arXiv 2024

[15] [15]

PrimeNumberTheoremAnd: A formalization of the prime number theorem and related results in Lean 4

Alex Kontorovich and Terence Tao. PrimeNumberTheoremAnd: A formalization of the prime number theorem and related results in Lean 4. https://github.com/ AlexKontorovich/PrimeNumberTheoremAnd, 2024

work page 2024

[16] [16]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, 2024. 12

work page 2024

[17] [17]

Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, et al. Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

work page arXiv 2025

[18] [18]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

work page internal anchor Pith review arXiv 2025

[19] [19]

Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards

Xiaoyuan Liu, Tian Liang, Zhiwei He, Jiahao Xu, Wenxuan Wang, Pinjia He, Zhaopeng Tu, Haitao Mi, and Dong Yu. Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards. InProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), 2025

work page 2025

[20] [20]

Towards robust mathematical reasoning

Minh-Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, et al. Towards robust mathematical reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35406–35430, 2025

work page 2025

[21] [21]

Scaling generative verifiers for natural language mathematical proof verification and selection.arXiv preprint arXiv:2511.13027,

Sadegh Mahdavi, Branislav Kisacanin, Shubham Toshniwal, Wei Du, Ivan Moshkov, George Armstrong, Renjie Liao, Christos Thrampoulidis, and Igor Gitman. Scaling generative ver- ifiers for natural language mathematical proof verification and selection.arXiv preprint arXiv:2511.13027, 2025

work page arXiv 2025

[22] [22]

The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923, 2023

William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923, 2023

work page arXiv 2023

[23] [23]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021

work page 2021

[24] [24]

Springer, 2002

Tobias Nipkow, Markus Wenzel, and Lawrence C Paulson.Isabelle/HOL: a proof assistant for higher-order logic. Springer, 2002

work page 2002

[25] [25]

Hard2verify: A step-level verification benchmark for open-ended frontier math.arXiv preprint arXiv:2510.13744, 2025

Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, and Shafiq Joty. Hard2verify: A step-level verification benchmark for open-ended frontier math.arXiv preprint arXiv:2510.13744, 2025

work page arXiv 2025

[26] [26]

Real algebraic surfaces biholomorphically equivalent but not algebraically equivalent.arXiv preprint arXiv:2503.08957, 2025

Guillaume Rond. Real algebraic surfaces biholomorphically equivalent but not algebraically equivalent.arXiv preprint arXiv:2503.08957, 2025

work page arXiv 2025

[27] [27]

Liquid tensor experiment.Experimental Mathematics, 31(2):349–354, 2022

Peter Scholze. Liquid tensor experiment.Experimental Mathematics, 31(2):349–354, 2022

work page 2022

[28] [28]

Deepseekmath-v2: Towards self-verifiable mathematical reasoning

Zhihong Shao, Yuxiang Luo, Chengda Lu, ZZ Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. Deepseekmath-v2: Towards self-verifiable mathematical reasoning. arXiv preprint arXiv:2511.22570, 2025

work page arXiv 2025

[29] [29]

Heimdall: test-time scaling on the generative verification.arXiv preprint arXiv:2504.10337, 2025

Wenlei Shi and Xing Jin. Heimdall: test-time scaling on the generative verification.arXiv preprint arXiv:2504.10337, 2025

work page arXiv 2025

[30] [30]

A promising path towards autoformalization and general artificial intelli- gence

Christian Szegedy. A promising path towards autoformalization and general artificial intelli- gence. InInternational Conference on Intelligent Computer Mathematics, pages 3–20. Springer, 2020

work page 2020

[31] [31]

From solving to verifying: A unified objective for robust reasoning in llms, 2025

Xiaoxuan Wang, Bo Liu, Song Jiang, Jingzhou Liu, Jingyuan Qi, Xia Chen, and Baosheng He. From solving to verifying: A unified objective for robust reasoning in llms, 2025. URL https://arxiv.org/abs/2511.15137. 13

work page arXiv 2025

[32] [32]

Autoformalization with large language models.Advances in neural information processing systems, 35:32353–32368, 2022

Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models.Advances in neural information processing systems, 35:32353–32368, 2022

work page 2022

[33] [33]

Pencil: Long thoughts with short memory, 2025

Chenxiao Yang, Nathan Srebro, David McAllester, and Zhiyuan Li. Pencil: Long thoughts with short memory, 2025. URLhttps://arxiv.org/abs/2503.14337

work page arXiv 2025

[34] [34]

Recursive models for long-horizon reasoning

Chenxiao Yang, Nathan Srebro, and Zhiyuan Li. Recursive models for long-horizon reasoning. arXiv preprint arXiv:2603.02112, 2026

work page arXiv 2026

[35] [35]

Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

Weihao Zeng, Yuzhen Huang, and Junxian He. Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

work page arXiv 2026

[36] [36]

arXiv:2408.15240

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. InProceedings of the International Conference on Learning Representations (ICLR), 2025. URL https://arxiv. org/abs/2408.15240

work page arXiv 2025

[37] [37]

Sample, scrutinize and scale: Effective inference-time search by scaling verification.arXiv preprint arXiv:2502.01839, 2025

Eric Zhao, Pranjal Awasthi, and Sreenivas Gollapudi. Sample, scrutinize and scale: Effective inference-time search by scaling verification.arXiv preprint arXiv:2502.01839, 2025

work page arXiv 2025

[38] [38]

Processbench: Identifying process errors in mathematical reasoning

Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. InThe 63rd Annual Meeting of the Association for Computational Linguistics, 2025

work page 2025

[39] [39]

Leanarchitect: Automating blueprint generation for humans and ai, 2026

Thomas Zhu, Pietro Monticone, Jeremy Avigad, and Sean Welleck. Leanarchitect: Automating blueprint generation for humans and ai, 2026. URL https://arxiv.org/abs/2601. 22554. 14 A Context cost of verification: formal statement and proof Definition 2(Good Pseudo-Formal proof).Agood Pseudo-Formal proofis a Pseudo-Formal proof (Definition 1) with the followin...

work page 2026

[40] [40]

The depth of T is D<n

work page

[41] [41]

Definition 3(Block-wise verification algorithm).Let P= (G , T) be a Pseudo-Formal proof with node set V={v 1,

The lengths of the conclusion c, the premises, and the proof are bounded above by L. Definition 3(Block-wise verification algorithm).Let P= (G , T) be a Pseudo-Formal proof with node set V={v 1, . . . ,vn}, where each node is written as v= (P v,c v,π v). For a node v∈V , let AncT(v) denote its ancestors in the scope-inheritance forest T, and define its de...

work page 2026

[42] [42]

1"> Assumptions / Conditions / Definitions. - ... Statement : ... </PROPOSITION_STATEMENT> <LEMMA_STATEMENT id=

Lemmas - Do not introduce any deeper hierarchy. - If a deeper tree seems natural, flatten it into a sequential list of lemmas inside the relevant proposition. - The theorem proof may only cite propositions; a proposition proof may only cite its own lemmas and the statements of earlier propositions. - No trivial decompositions: do not decompose the theorem...

work page

[43] [43]

These are often the parent or ancestor statements of the Assertion, and can be the same as the global theorem

**Contexts**: A sequence of statements from which the Assertion may or may not inherit definitions, assumptions, or conditions. These are often the parent or ancestor statements of the Assertion, and can be the same as the global theorem. They are provided solely so you can understand the definitions and assumptions of the Assertion. They have NOT been ve...

work page

[44] [44]

You may assume all established results are correct and use them freely - do NOT re-verify them

**Established Results **: Statements that have already been verified or can be assumed to be correct. You may assume all established results are correct and use them freely - do NOT re-verify them. The proof of the Assertion can invoke these results as long as the assumptions are properly justified and the definitions are consistent

work page

[45] [45]

**Assertion**: The specific statement whose proof you must verify

work page

[46] [46]

verdict":

**Proposed Proof **: The proof of the Assertion to verify. Instructions: - Verify ONLY the proposed proof of the Assertion. - Read the Assertion carefully and analyze the proof step by step. - Identify any incorrect, unjustified, or logically invalid reasoning. - Pay close attention to potentially confusing or ambiguous interpretations of concepts. - When...

work page

[47] [47]

Math Problem: The original problem statement. 20

work page

[48] [48]

Original Solution Steps: The solver’s original solution split into indexed steps, written as: <step>[0] ...</step> <step>[1] ...</step> and so on

work page

[49] [49]

This rewrite was produced automatically and may contain artifacts that are not present in the original solution

Rewritten Proof: A structured rewrite of the original solution, decomposed into theorem / proposition / lemma / claim / fact blocks. This rewrite was produced automatically and may contain artifacts that are not present in the original solution

work page

[50] [50]

These may or may not be genuine errors in the original solution

Potential Errors: A list of potential errors flagged by automated verification of the rewritten proof. These may or may not be genuine errors in the original solution. Evaluation process:

work page

[51] [51]

For each flagged issue, decide whether it is a genuine mathematical error in the ORIGINAL solution or a false alarm caused by the rewrite

Validate each potential error. For each flagged issue, decide whether it is a genuine mathematical error in the ORIGINAL solution or a false alarm caused by the rewrite. Cross-check against the original solution steps. If the original solution contains the missing reasoning, a stronger equivalent argument, or a correct alternative route, do not count the ...

work page

[52] [52]

The potential errors are useful leads, but your final output must be based on the original solution steps

Inspect the original steps as needed. The potential errors are useful leads, but your final output must be based on the original solution steps. Check any original steps needed to identify the first genuine incorrect step. If you find that the first incorrect step was not covered by any potential error, report it in <additional_errors>

work page

[53] [53]

For each original step, determine whether that step is correct or incorrect

Grade the original steps, not the rewrite. For each original step, determine whether that step is correct or incorrect. - A correct step is one where all mathematical content is correct and logically consistent with the problem and all previous correct steps. - An incorrect step is one that contains a mathematical error, an unjustified load- bearing claim...

work page

[54] [54]

corrected typos in the proof of Theorem 3

Identify the first incorrect step. The first incorrect step is the smallest original step index whose content is mathematically incorrect or whose reasoning first depends on an error. If every original step is correct, output -1. Important calibration rules: - The original solution steps are authoritative for the final verdict. - The rewritten proof and p...

work page 2025

[55] [55]

Use this to read the paper as a human reader would, with all theorem/lemma/proposition numbers rendered

The rendered PDF (attached as a file). Use this to read the paper as a human reader would, with all theorem/lemma/proposition numbers rendered

work page

[56] [56]

Theorem 19

The raw LaTeX source (included below). Use this to inspect precise notation, equations, and label names if needed. Your task is to identify any mathematical errors in the paper that would require revision before publication. Focus on errors of mathematical substance: incorrect proofs, unjustified steps, false claims, gaps in reasoning, miscomputed quantit...

work page

[57] [57]

Use it to read the paper as a human would and to identify the rendered numerical or letter labels of every theorem/lemma/proposition/corollary/claim

The rendered PDF of the paper (attached as a file). Use it to read the paper as a human would and to identify the rendered numerical or letter labels of every theorem/lemma/proposition/corollary/claim

work page

[58] [58]

Use it to inspect precise notation, equations, and the exact wording of each statement and proof

The raw LaTeX source of the paper (included below). Use it to inspect precise notation, equations, and the exact wording of each statement and proof. Structure: - The top level of the rewrite is a list of **theorems**. Use one <THEOREM_STATEMENT id="N"> block per top-level result the paper proves. Top-level results are the paper’s flagship theorem(s) plus...

work page

[59] [59]

Propositions (shared pool, numbered globally across all theorems)

work page

[60] [60]

N"> may cite any of the propositions; a <PROPOSITION_PROOF id=

Lemmas (numbered within each proposition) - Do not introduce any deeper hierarchy. If a deeper tree seems natural, flatten it into a sequential list of lemmas inside the relevant proposition. - Each <THEOREM_PROOF id="N"> may cite any of the propositions; a <PROPOSITION_PROOF id="K"> may cite its own lemmas and the statements of earlier propositions. A th...

work page

[61] [61]

The rendered PDF of the original paper (attached as a file)

work page

[62] [62]

The raw LaTeX source of the paper (included below -- ground truth -- your rewrite must faithfully represent this)

work page

[63] [63]

Your previous rewritten paper, which contained faithfulness errors

work page

[64] [64]

Requirements for the new rewrite: - Fix every issue listed in the Identified Errors

The specific discrepancies flagged by the checker. Requirements for the new rewrite: - Fix every issue listed in the Identified Errors. For each issue, make sure the new rewrite no longer deviates from the original paper in that way. - Do NOT introduce new discrepancies: do not strengthen/weaken claims, omit steps, add arguments, drift in notation, or mis...

work page

[65] [65]

**Original Paper (LaTeX) **: The full LaTeX source of the original paper

work page

[66] [66]

These are provided so you can understand the scope of the Assertion

**Contexts**: Statements from the rewritten paper that the Assertion inherits definitions or assumptions from (e.g., the enclosing theorem statement, the enclosing proposition statement). These are provided so you can understand the scope of the Assertion

work page

[67] [67]

You may use them as reference points

**Established Results **: Statements from the rewritten paper that have already been checked and can be assumed to be faithfully rewritten. You may use them as reference points. 28

work page

[68] [68]

**Assertion**: The specific rewritten statement whose faithfulness you must verify

work page

[69] [69]

Your task is to determine whether the Assertion and its Proposed Proof **faithfully represent** the corresponding part of the Original Paper

**Proposed Proof **: The rewritten proof of the Assertion (may be empty for a top- level theorem statement). Your task is to determine whether the Assertion and its Proposed Proof **faithfully represent** the corresponding part of the Original Paper. Check for:

work page

[70] [70]

**Strengthened or weakened claims **: Does the Assertion claim more or less than the original paper establishes at the corresponding point?

work page

[71] [71]

**Omitted content **: Does the Proposed Proof drop a non-trivial argument that appears in the original paper?

work page

[72] [72]

**Added content **: Does the Proposed Proof introduce new arguments, repairs, or proof ideas not present in the original paper?

work page

[73] [73]

**Notation drift **: Are variables, functions, or definitions used differently than in the original paper?

work page

[74] [74]

**Misinterpretation**: Does the Assertion or Proposed Proof misunderstand the original’s reasoning or logical structure?

work page

[75] [75]

verdict":

**Scope errors **: Are assumptions incorrectly inherited, dropped, or added compared to the original? Instructions: - Do NOT judge whether the original paper is mathematically correct. Your sole task is faithfulness. - Only flag changes that alter mathematical meaning. Cosmetic rephrasing is fine. - Use the Contexts to understand what the Assertion is all...

work page