Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization
Pith reviewed 2026-05-21 00:44 UTC · model grok-4.3
The pith
Paraphrase sensitivity in Lean 4 autoformalization stems mainly from code-generation failures that differ by dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across four frontier models and three open-weight autoformalizers, paraphrase sensitivity is dominated by failures at the code-generation layer, and these failures are typed differently by dataset, with the same patterns appearing in open-weight systems and motivating targeted training interventions for specific compilation errors.
What carries the argument
Deterministic paraphrase rules applied to undergraduate and Olympiad math datasets to expose divergences in generated Lean code.
If this is right
- Training for autoformalization should target specific compilation failure modes rather than generic robustness to input variation.
- Evaluation benchmarks for Lean 4 translation must incorporate paraphrased versions of problems to measure real reliability.
- The same code-generation dominance and dataset-dependent error types appear in both frontier and open-weight models.
- Future systems could reduce paraphrase sensitivity by focusing interventions on the code synthesis step.
Where Pith is reading between the lines
- Adding varied phrasings during training data creation could reduce the observed sensitivity without changing model scale.
- The dataset-dependent error typing suggests separate handling for elementary versus competition-level problems may be useful.
- This failure taxonomy could be applied to other formal languages beyond Lean to check for similar patterns.
Load-bearing premise
The paraphrase rules create inputs that carry exactly the same mathematical meaning as the originals, so any robust system should produce equivalent formalizations.
What would settle it
Observing that models generate equivalent valid Lean code for paraphrased inputs at rates comparable to original inputs, with error types showing no systematic difference between undergraduate and Olympiad datasets, would undermine the claim.
Figures
read the original abstract
Lean 4 autoformalization has become increasingly popular in recent years, with frontier language models and open-weight autoformalizers now producing valid formalizations of mathematical theorems. However, these evaluations often rely on single canonical phrasings of theorems and rarely probe whether outputs are robust to natural variation in inputs, while prior work has shown that semantically equivalent paraphrases often induce divergent formal outputs. We study the structure of these divergences in Lean 4 by applying deterministic paraphrase rules to datasets of undergraduate and Olympiad-level math problems. Across four frontier models and three open-weight autoformalizers, we find that paraphrase sensitivity is dominated by failures at the code-generation layer, and that these failures are typed differently by dataset. Furthermore, these patterns generalize to open-weight models, showing that state-of-the-art autoformalizers still struggle to generate valid Lean code. Our results provide a failure-mode taxonomy for autoformalization and motivate training-time interventions targeted at specific compilation failures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines paraphrase-induced failures in Lean 4 autoformalization by applying deterministic paraphrase rules to undergraduate and Olympiad-level mathematical statements. Evaluating four frontier models and three open-weight autoformalizers, it concludes that paraphrase sensitivity is primarily due to failures at the code-generation layer, with these failures exhibiting different patterns depending on the dataset. The study offers a taxonomy of failure modes and suggests implications for training interventions.
Significance. This work is significant for the field of automated formalization as it systematically characterizes robustness issues that are often overlooked in standard evaluations relying on single phrasings. By covering multiple models and datasets, and providing a failure taxonomy, it lays groundwork for more reliable autoformalizers. The empirical approach with no ad-hoc parameters enhances the credibility of the observed patterns.
major comments (2)
- [§3] §3 (Paraphrase Generation): The central claim that divergences reflect code-generation layer failures rather than input changes rests on the assumption that deterministic paraphrase rules produce semantically equivalent statements. For Olympiad-level problems, rules that reorder quantifiers, alter implicit assumptions, or change variable scoping can yield non-equivalent inputs, confounding the attribution of failures and the claim that they are 'typed differently by dataset'. The manuscript should add explicit equivalence validation (e.g., sample human review or logical checks) to support this load-bearing assumption.
- [§5] §5 (Failure Taxonomy and Results): The classification of failures into dataset-specific types lacks detail on the decision criteria or inter-rater reliability used to distinguish code-generation errors from other categories. Without this, the cross-dataset comparison and generalization to open-weight models rest on potentially subjective typing, weakening the taxonomy's reliability.
minor comments (2)
- [Figures] Figure captions should explicitly define error categories and model abbreviations to improve readability.
- The abstract could briefly note the total number of problems per dataset for context.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify key methodological assumptions in our study of paraphrase sensitivity in Lean 4 autoformalization. We respond to each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Paraphrase Generation): The central claim that divergences reflect code-generation layer failures rather than input changes rests on the assumption that deterministic paraphrase rules produce semantically equivalent statements. For Olympiad-level problems, rules that reorder quantifiers, alter implicit assumptions, or change variable scoping can yield non-equivalent inputs, confounding the attribution of failures and the claim that they are 'typed differently by dataset'. The manuscript should add explicit equivalence validation (e.g., sample human review or logical checks) to support this load-bearing assumption.
Authors: We agree that semantic equivalence is a load-bearing assumption for attributing observed divergences primarily to the code-generation layer rather than to changes in input meaning. The deterministic rules were explicitly constructed to preserve logical structure, including quantifier scope and variable bindings, by limiting transformations to surface-level rephrasing that does not alter the underlying mathematical statement. Nevertheless, we accept that this should be validated more explicitly for Olympiad problems. In the revised manuscript we will add a dedicated subsection describing the rule design principles and report results from a human equivalence review performed on a random sample of 50 paraphrases from the Olympiad dataset, with two domain experts confirming equivalence in all cases. revision: yes
-
Referee: [§5] §5 (Failure Taxonomy and Results): The classification of failures into dataset-specific types lacks detail on the decision criteria or inter-rater reliability used to distinguish code-generation errors from other categories. Without this, the cross-dataset comparison and generalization to open-weight models rest on potentially subjective typing, weakening the taxonomy's reliability.
Authors: The taxonomy was derived from systematic inspection of Lean compiler error messages and the structure of the generated code across all models and datasets. We will revise §5 to include an expanded description of the decision criteria for each failure category, supported by concrete examples and the specific error signatures used to assign a given output to the code-generation layer. While the original classification was performed by the author team with iterative discussion to resolve edge cases, we did not compute formal inter-rater reliability statistics. We will therefore add a paragraph detailing the classification protocol and the degree of initial agreement, thereby addressing concerns about subjectivity while preserving the observed dataset-specific patterns. revision: partial
Circularity Check
No circularity: empirical measurement study with no derivations or self-referential reductions
full rationale
The paper presents an empirical measurement study that applies a set of deterministic paraphrase rules to existing datasets of undergraduate and Olympiad-level problems, then evaluates the resulting formalization success rates across multiple models. No mathematical derivations, equations, or parameter-fitting procedures are described whose outputs are then relabeled as predictions. The reported patterns in failure modes (e.g., dominance of code-generation-layer failures) are direct observational results from the experimental runs rather than quantities forced by construction from the authors' own definitions or prior self-citations. The assumption that the paraphrase rules preserve semantic equivalence is an explicit experimental premise, not a derived claim that reduces to the inputs by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Deterministic paraphrase rules produce semantically equivalent statements that should map to the same Lean formalization under a robust model.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across four frontier models and three open-weight autoformalizers, we find that paraphrase sensitivity is dominated by failures at the code-generation layer, and that these failures are typed differently by dataset.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
All N=602 paired predictions where both outputs compile are BEq+-equivalent
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Proofnet: Autoformalizing and formally proving undergraduate-level mathematics
Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., and Avigad, J. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. CoRR, abs/2302.12433,
-
[2]
Chen, J., Li, Z., Hu, X., and Xia, X. NLPerturbator: Study- ing the robustness of code LLMs to natural language variations.arXiv preprint arXiv:2406.19783,
- [3]
-
[4]
Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations
Huang, K., Guo, J., Li, Z., Ji, X., Ge, J., Li, W., Guo, Y ., Cai, T., Yuan, H., Wang, R., Wu, Y ., Yin, M., Tang, S., Huang, Y ., Jin, C., Chen, X., Zhang, C., and Wang, M. MATH- Perturb: Benchmarking LLMs’ math reasoning abilities against hard perturbations.CoRR, abs/2502.06453,
-
[5]
Liu, Q., Zheng, X., Lu, X., Cao, Q., and Yan, J. Rethinking and improving autoformalization: Towards a faithful met- ric and a dependency retrieval-based approach. InThe Thirteenth International Conference on Learning Repre- sentations, 2025a. Liu, X., Bao, K., Zhang, J., Liu, Y ., Chen, Y ., Liu, Y ., Jiao, Y ., and Luo, T. ATLAS: Autoformalizing theo- r...
- [6]
-
[7]
Moore, H. and Shah, A. Evaluating autoformalization ro- bustness via semantically similar paraphrasing.CoRR, abs/2511.12784,
-
[8]
Re- liable evaluation and benchmarks for statement autofor- malization
Poiroux, A., Weiss, G., Kunˇcak, V ., and Bosselut, A. Re- liable evaluation and benchmarks for statement autofor- malization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17958–17980,
work page 2025
-
[9]
Ren, Z. Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z. F., Gou, Z., Ma, S., Tang, H., Liu, Y ., Gao, W., Guo, D., and Ruan, C. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.CoRR, abs/2504.21801,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning
Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Dos San- tos, M., Sung, F., Vinyes, M., Ying, Z., Zhu, Z., Lu, J., de Saxc´e, H., et al. Kimina-Prover Preview: Towards large formal reasoning models with reinforcement learn- ing.arXiv preprint arXiv:2504.11354,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Promptbench: Towards evaluating the robustness of large language models on adversarial prompts
Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y ., Yang, L., Ye, W., Gong, N. Z., Zhang, Y ., and Xie, X. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts.CoRR, abs/2306.04528,
-
[12]
5 Surface Sensitivity in Lean 4 Autoformalization A. Additional Results A.1. Compile Rates Table 2 reports per-cell compile rates underlying theN=602 compile-both claim in §4.1. Compile rates range from 11.4% to 24.2% per prediction direction. The both-compile column gives the number of pairs where both baseline and perturbed predictions type-check. BEq+ ...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.