Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization

Aryan Sharma; Ethan Lou; William Feng

arxiv: 2604.23135 · v2 · pith:R4DJ6UDXnew · submitted 2026-04-25 · 💻 cs.LG

Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization

William Feng , Ethan Lou , Aryan Sharma This is my paper

Pith reviewed 2026-05-21 00:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords Lean 4autoformalizationparaphrase sensitivitycode generationformal verificationlanguage modelsmathematical theorems

0 comments

The pith

Paraphrase sensitivity in Lean 4 autoformalization stems mainly from code-generation failures that differ by dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how rephrased but semantically equivalent math theorems affect the ability of language models to produce correct Lean 4 formalizations. It applies fixed paraphrase rules to both undergraduate and Olympiad-level problems and tests the outputs from multiple frontier and open-weight systems. The core observation is that most divergences arise during the generation of valid Lean code rather than from misunderstanding the original statement, and the specific error patterns shift depending on the difficulty and style of the source problems. These patterns hold across both closed and open models, indicating that current autoformalizers remain brittle to natural wording changes even when meaning is preserved.

Core claim

Across four frontier models and three open-weight autoformalizers, paraphrase sensitivity is dominated by failures at the code-generation layer, and these failures are typed differently by dataset, with the same patterns appearing in open-weight systems and motivating targeted training interventions for specific compilation errors.

What carries the argument

Deterministic paraphrase rules applied to undergraduate and Olympiad math datasets to expose divergences in generated Lean code.

If this is right

Training for autoformalization should target specific compilation failure modes rather than generic robustness to input variation.
Evaluation benchmarks for Lean 4 translation must incorporate paraphrased versions of problems to measure real reliability.
The same code-generation dominance and dataset-dependent error types appear in both frontier and open-weight models.
Future systems could reduce paraphrase sensitivity by focusing interventions on the code synthesis step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding varied phrasings during training data creation could reduce the observed sensitivity without changing model scale.
The dataset-dependent error typing suggests separate handling for elementary versus competition-level problems may be useful.
This failure taxonomy could be applied to other formal languages beyond Lean to check for similar patterns.

Load-bearing premise

The paraphrase rules create inputs that carry exactly the same mathematical meaning as the originals, so any robust system should produce equivalent formalizations.

What would settle it

Observing that models generate equivalent valid Lean code for paraphrased inputs at rates comparable to original inputs, with error types showing no systematic difference between undergraduate and Olympiad datasets, would undermine the claim.

Figures

Figures reproduced from arXiv: 2604.23135 by Aryan Sharma, Ethan Lou, William Feng.

**Figure 1.** Figure 1: (a) Per-model surface consistency under paraphrased inputs on ProofNet# (teal) and miniF2F (orange). Pair agreement ranges from 19–56% across the GPT panel. Error bars are 95% bootstrap CIs. (b) Per-category BEq+ surface consistency, pooled across the GPT panel, with ProofNet# categories (teal) and miniF2F categories (orange) shown together. 4.2. Compiling Variance Failures are typed differently by dataset… view at source ↗

**Figure 2.** Figure 2: ProofNet# surface consistency on the 10 rules shared across both panels (95% bootstrap CIs, paired by source, 5,000 iters). 5. Discussion We study paraphrase sensitivity in Lean 4 autoformalization, and find it to primarily be a compile-boundary phenomenon, not a semantic one. Specifically, 602/602 pairs where both outputs compile are semantically equivalent under BEq+ and structurally near-identical under… view at source ↗

**Figure 3.** Figure 3: Per-(rule, model) BEq+ equivalence on ProofNet# (GPT panel). Rows sorted by panel-mean ascending. Dots denote cells with no paired predictions. A.6. Per-Rule Open-Weight Surface Consistency view at source ↗

read the original abstract

Lean 4 autoformalization has become increasingly popular in recent years, with frontier language models and open-weight autoformalizers now producing valid formalizations of mathematical theorems. However, these evaluations often rely on single canonical phrasings of theorems and rarely probe whether outputs are robust to natural variation in inputs, while prior work has shown that semantically equivalent paraphrases often induce divergent formal outputs. We study the structure of these divergences in Lean 4 by applying deterministic paraphrase rules to datasets of undergraduate and Olympiad-level math problems. Across four frontier models and three open-weight autoformalizers, we find that paraphrase sensitivity is dominated by failures at the code-generation layer, and that these failures are typed differently by dataset. Furthermore, these patterns generalize to open-weight models, showing that state-of-the-art autoformalizers still struggle to generate valid Lean code. Our results provide a failure-mode taxonomy for autoformalization and motivate training-time interventions targeted at specific compilation failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows paraphrase sensitivity in Lean 4 autoformalization is mostly code-generation failures that differ by dataset difficulty, but the rules may not always keep inputs equivalent.

read the letter

The main takeaway is that deterministic paraphrases expose consistent failure patterns in current Lean 4 systems, with most issues landing at the code-generation layer rather than earlier parsing or understanding steps. This holds across frontier models and open-weight ones on both undergrad and Olympiad problems, and the authors build a simple taxonomy that separates the failure types by dataset hardness. That targeted measurement is the concrete new piece; prior work noted paraphrase divergence but did not break it down this way for Lean 4 with fixed rules. The study is straightforward and useful for anyone trying to harden autoformalizers, since it points to training interventions aimed at specific compilation errors. The data and model coverage look reasonable for an empirical characterization. The soft spot is the assumption that the paraphrase rules always preserve logical equivalence. For Olympiad statements, reordering quantifiers or changing scoping can shift meaning in ways that would legitimately produce different formalizations, so some of the reported divergences might reflect input changes rather than pure robustness gaps. Without the full methods section it is hard to judge how carefully they checked equivalence or whether the taxonomy was built after seeing the outputs. Still, the patterns are consistent enough that the core observation stands. This is the kind of paper that belongs in a formal-methods or AI-for-math venue. Readers working on autoformalization or robustness testing will get direct value from the taxonomy and the dataset split. It is solid enough to deserve referee time rather than a desk reject, even if revisions will need to tighten the equivalence checks and add clearer classification criteria.

Referee Report

2 major / 2 minor

Summary. The manuscript examines paraphrase-induced failures in Lean 4 autoformalization by applying deterministic paraphrase rules to undergraduate and Olympiad-level mathematical statements. Evaluating four frontier models and three open-weight autoformalizers, it concludes that paraphrase sensitivity is primarily due to failures at the code-generation layer, with these failures exhibiting different patterns depending on the dataset. The study offers a taxonomy of failure modes and suggests implications for training interventions.

Significance. This work is significant for the field of automated formalization as it systematically characterizes robustness issues that are often overlooked in standard evaluations relying on single phrasings. By covering multiple models and datasets, and providing a failure taxonomy, it lays groundwork for more reliable autoformalizers. The empirical approach with no ad-hoc parameters enhances the credibility of the observed patterns.

major comments (2)

[§3] §3 (Paraphrase Generation): The central claim that divergences reflect code-generation layer failures rather than input changes rests on the assumption that deterministic paraphrase rules produce semantically equivalent statements. For Olympiad-level problems, rules that reorder quantifiers, alter implicit assumptions, or change variable scoping can yield non-equivalent inputs, confounding the attribution of failures and the claim that they are 'typed differently by dataset'. The manuscript should add explicit equivalence validation (e.g., sample human review or logical checks) to support this load-bearing assumption.
[§5] §5 (Failure Taxonomy and Results): The classification of failures into dataset-specific types lacks detail on the decision criteria or inter-rater reliability used to distinguish code-generation errors from other categories. Without this, the cross-dataset comparison and generalization to open-weight models rest on potentially subjective typing, weakening the taxonomy's reliability.

minor comments (2)

[Figures] Figure captions should explicitly define error categories and model abbreviations to improve readability.
The abstract could briefly note the total number of problems per dataset for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify key methodological assumptions in our study of paraphrase sensitivity in Lean 4 autoformalization. We respond to each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§3] §3 (Paraphrase Generation): The central claim that divergences reflect code-generation layer failures rather than input changes rests on the assumption that deterministic paraphrase rules produce semantically equivalent statements. For Olympiad-level problems, rules that reorder quantifiers, alter implicit assumptions, or change variable scoping can yield non-equivalent inputs, confounding the attribution of failures and the claim that they are 'typed differently by dataset'. The manuscript should add explicit equivalence validation (e.g., sample human review or logical checks) to support this load-bearing assumption.

Authors: We agree that semantic equivalence is a load-bearing assumption for attributing observed divergences primarily to the code-generation layer rather than to changes in input meaning. The deterministic rules were explicitly constructed to preserve logical structure, including quantifier scope and variable bindings, by limiting transformations to surface-level rephrasing that does not alter the underlying mathematical statement. Nevertheless, we accept that this should be validated more explicitly for Olympiad problems. In the revised manuscript we will add a dedicated subsection describing the rule design principles and report results from a human equivalence review performed on a random sample of 50 paraphrases from the Olympiad dataset, with two domain experts confirming equivalence in all cases. revision: yes
Referee: [§5] §5 (Failure Taxonomy and Results): The classification of failures into dataset-specific types lacks detail on the decision criteria or inter-rater reliability used to distinguish code-generation errors from other categories. Without this, the cross-dataset comparison and generalization to open-weight models rest on potentially subjective typing, weakening the taxonomy's reliability.

Authors: The taxonomy was derived from systematic inspection of Lean compiler error messages and the structure of the generated code across all models and datasets. We will revise §5 to include an expanded description of the decision criteria for each failure category, supported by concrete examples and the specific error signatures used to assign a given output to the code-generation layer. While the original classification was performed by the author team with iterative discussion to resolve edge cases, we did not compute formal inter-rater reliability statistics. We will therefore add a paragraph detailing the classification protocol and the degree of initial agreement, thereby addressing concerns about subjectivity while preserving the observed dataset-specific patterns. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurement study with no derivations or self-referential reductions

full rationale

The paper presents an empirical measurement study that applies a set of deterministic paraphrase rules to existing datasets of undergraduate and Olympiad-level problems, then evaluates the resulting formalization success rates across multiple models. No mathematical derivations, equations, or parameter-fitting procedures are described whose outputs are then relabeled as predictions. The reported patterns in failure modes (e.g., dominance of code-generation-layer failures) are direct observational results from the experimental runs rather than quantities forced by construction from the authors' own definitions or prior self-citations. The assumption that the paraphrase rules preserve semantic equivalence is an explicit experimental premise, not a derived claim that reduces to the inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the premise that the chosen paraphrase rules preserve mathematical semantics exactly; this is treated as a background domain assumption rather than something tested within the paper.

axioms (1)

domain assumption Deterministic paraphrase rules produce semantically equivalent statements that should map to the same Lean formalization under a robust model.
Invoked when the authors interpret divergent outputs as model failures rather than changes in meaning.

pith-pipeline@v0.9.0 · 5692 in / 1204 out tokens · 52000 ms · 2026-05-21T00:44:34.831610+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across four frontier models and three open-weight autoformalizers, we find that paraphrase sensitivity is dominated by failures at the code-generation layer, and that these failures are typed differently by dataset.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

All N=602 paired predictions where both outputs compile are BEq+-equivalent

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics

Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., and Avigad, J. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. CoRR, abs/2302.12433,

work page arXiv
[2]

NLPerturbator: Study- ing the robustness of code LLMs to natural language variations.arXiv preprint arXiv:2406.19783,

Chen, J., Li, Z., Hu, X., and Xia, X. NLPerturbator: Study- ing the robustness of code LLMs to natural language variations.arXiv preprint arXiv:2406.19783,

work page arXiv
[3]

Hao, Y ., Wan, X., and Zhai, C. An investigation of robust- ness of LLMs in mathematical reasoning: Benchmark- ing with mathematically-equivalent transformation of ad- vanced mathematical problems.CoRR, abs/2508.08833,

work page arXiv
[4]

Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations

Huang, K., Guo, J., Li, Z., Ji, X., Ge, J., Li, W., Guo, Y ., Cai, T., Yuan, H., Wang, R., Wu, Y ., Yin, M., Tang, S., Huang, Y ., Jin, C., Chen, X., Zhang, C., and Wang, M. MATH- Perturb: Benchmarking LLMs’ math reasoning abilities against hard perturbations.CoRR, abs/2502.06453,

work page arXiv
[5]

Rethinking and improving autoformalization: Towards a faithful met- ric and a dependency retrieval-based approach

Liu, Q., Zheng, X., Lu, X., Cao, Q., and Yan, J. Rethinking and improving autoformalization: Towards a faithful met- ric and a dependency retrieval-based approach. InThe Thirteenth International Conference on Learning Repre- sentations, 2025a. Liu, X., Bao, K., Zhang, J., Liu, Y ., Chen, Y ., Liu, Y ., Jiao, Y ., and Luo, T. ATLAS: Autoformalizing theo- r...

work page arXiv
[6]

CoRR, abs/2406.01940,

work page arXiv
[7]

and Shah, A

Moore, H. and Shah, A. Evaluating autoformalization ro- bustness via semantically similar paraphrasing.CoRR, abs/2511.12784,

work page arXiv
[8]

Re- liable evaluation and benchmarks for statement autofor- malization

Poiroux, A., Weiss, G., Kunˇcak, V ., and Bosselut, A. Re- liable evaluation and benchmarks for statement autofor- malization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17958–17980,

work page 2025
[9]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Ren, Z. Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z. F., Gou, Z., Ma, S., Tang, H., Liu, Y ., Gao, W., Guo, D., and Ruan, C. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.CoRR, abs/2504.21801,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Dos San- tos, M., Sung, F., Vinyes, M., Ying, Z., Zhu, Z., Lu, J., de Saxc´e, H., et al. Kimina-Prover Preview: Towards large formal reasoning models with reinforcement learn- ing.arXiv preprint arXiv:2504.11354,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Promptbench: Towards evaluating the robustness of large language models on adversarial prompts

Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y ., Yang, L., Ye, W., Gong, N. Z., Zhang, Y ., and Xie, X. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts.CoRR, abs/2306.04528,

work page arXiv
[12]

ifPthenQ

5 Surface Sensitivity in Lean 4 Autoformalization A. Additional Results A.1. Compile Rates Table 2 reports per-cell compile rates underlying theN=602 compile-both claim in §4.1. Compile rates range from 11.4% to 24.2% per prediction direction. The both-compile column gives the number of pairs where both baseline and perturbed predictions type-check. BEq+ ...

work page 2048

[1] [1]

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics

Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., and Avigad, J. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. CoRR, abs/2302.12433,

work page arXiv

[2] [2]

NLPerturbator: Study- ing the robustness of code LLMs to natural language variations.arXiv preprint arXiv:2406.19783,

Chen, J., Li, Z., Hu, X., and Xia, X. NLPerturbator: Study- ing the robustness of code LLMs to natural language variations.arXiv preprint arXiv:2406.19783,

work page arXiv

[3] [3]

Hao, Y ., Wan, X., and Zhai, C. An investigation of robust- ness of LLMs in mathematical reasoning: Benchmark- ing with mathematically-equivalent transformation of ad- vanced mathematical problems.CoRR, abs/2508.08833,

work page arXiv

[4] [4]

Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations

Huang, K., Guo, J., Li, Z., Ji, X., Ge, J., Li, W., Guo, Y ., Cai, T., Yuan, H., Wang, R., Wu, Y ., Yin, M., Tang, S., Huang, Y ., Jin, C., Chen, X., Zhang, C., and Wang, M. MATH- Perturb: Benchmarking LLMs’ math reasoning abilities against hard perturbations.CoRR, abs/2502.06453,

work page arXiv

[5] [5]

Rethinking and improving autoformalization: Towards a faithful met- ric and a dependency retrieval-based approach

Liu, Q., Zheng, X., Lu, X., Cao, Q., and Yan, J. Rethinking and improving autoformalization: Towards a faithful met- ric and a dependency retrieval-based approach. InThe Thirteenth International Conference on Learning Repre- sentations, 2025a. Liu, X., Bao, K., Zhang, J., Liu, Y ., Chen, Y ., Liu, Y ., Jiao, Y ., and Luo, T. ATLAS: Autoformalizing theo- r...

work page arXiv

[6] [6]

CoRR, abs/2406.01940,

work page arXiv

[7] [7]

and Shah, A

Moore, H. and Shah, A. Evaluating autoformalization ro- bustness via semantically similar paraphrasing.CoRR, abs/2511.12784,

work page arXiv

[8] [8]

Re- liable evaluation and benchmarks for statement autofor- malization

Poiroux, A., Weiss, G., Kunˇcak, V ., and Bosselut, A. Re- liable evaluation and benchmarks for statement autofor- malization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17958–17980,

work page 2025

[9] [9]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Ren, Z. Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z. F., Gou, Z., Ma, S., Tang, H., Liu, Y ., Gao, W., Guo, D., and Ruan, C. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.CoRR, abs/2504.21801,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Dos San- tos, M., Sung, F., Vinyes, M., Ying, Z., Zhu, Z., Lu, J., de Saxc´e, H., et al. Kimina-Prover Preview: Towards large formal reasoning models with reinforcement learn- ing.arXiv preprint arXiv:2504.11354,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Promptbench: Towards evaluating the robustness of large language models on adversarial prompts

Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y ., Yang, L., Ye, W., Gong, N. Z., Zhang, Y ., and Xie, X. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts.CoRR, abs/2306.04528,

work page arXiv

[12] [12]

ifPthenQ

5 Surface Sensitivity in Lean 4 Autoformalization A. Additional Results A.1. Compile Rates Table 2 reports per-cell compile rates underlying theN=602 compile-both claim in §4.1. Compile rates range from 11.4% to 24.2% per prediction direction. The both-compile column gives the number of pairs where both baseline and perturbed predictions type-check. BEq+ ...

work page 2048