pith. sign in

arxiv: 2605.22257 · v1 · pith:N4Z62QJFnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· cs.LO

What are the Right Symmetries for Formal Theorem Proving?

Pith reviewed 2026-05-22 08:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LO
keywords theorem provinglarge language modelssymmetriesequivariancesuccess invariancerewriting categoriestest-time methodsinductive bias
0
0 comments X

The pith

LLM-based theorem provers fail to maintain consistent success rates across semantically equivalent problem statements because they lack proof equivariance and success invariance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces rewriting categories as a framework to capture how proof tactics transform statements in formal mathematics. It uses this to define proof equivariance, where proof distributions should transform predictably under rewrites, and success invariance, where equivalent statements should have the same probability of being proved. State-of-the-art LLM provers exhibit neither property and show large performance differences on equivalent formulations, while state-based next-tactic provers naturally achieve proof equivariance. Test-time methods that aggregate over multiple equivalent rewritings recover success invariance in the sampling limit and raise performance under fixed inference budgets.

Core claim

We introduce rewriting categories, a category-theoretic framework capturing the compositional, generally non-invertible transformations induced by proof tactics, and use it to formalize two symmetry notions: proof equivariance, governing how proof distributions transform under rewrites, and success invariance, requiring equivalent statements to be solved with the same probability. State-of-the-art LLM-based provers satisfy neither property, exhibiting large performance variation across equivalent formulations. Test-time methods that aggregate over equivalent rewritings recover success invariance in the sampling limit and improve robustness and performance under fixed inference budgets.

What carries the argument

Rewriting categories: a category-theoretic framework that captures compositional transformations induced by proof tactics and serves to define proof equivariance and success invariance.

If this is right

  • State-based next-tactic provers already satisfy proof equivariance by operating directly on proof states.
  • Aggregating over equivalent rewritings at test time recovers success invariance in the sampling limit.
  • Test-time aggregation improves robustness and performance under fixed inference budgets without retraining.
  • Symmetry should serve as a key inductive bias when designing future LLM-based theorem provers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same symmetry failures may appear in other AI systems for formal reasoning tasks such as program verification or type checking.
  • Training models to be equivariant by design could reduce the need for test-time computation to enforce invariance.
  • The rewriting category approach could be adapted to measure symmetry gaps in non-formal reasoning domains.
  • Human mathematicians may implicitly rely on similar symmetry awareness when selecting equivalent problem formulations.

Load-bearing premise

The rewriting categories must generate a sufficient set of semantically equivalent statements whose proof difficulty remains comparable so that performance differences can be attributed to missing symmetry rather than changes in inherent hardness.

What would settle it

If applying the proposed test-time aggregation over rewritings does not reduce the observed variation in success rates across equivalent formulations or fails to improve overall proof success on a standard benchmark under fixed budgets, that would show the symmetries are not the primary driver of inconsistency.

Figures

Figures reproduced from arXiv: 2605.22257 by Bernardo Cuenca Grau, \.Ismail \.Ilkan Ceylan, Jinwoo Kim, Krzysztof Olejniczak, Radoslav Dimitrov, Xingyue Huang.

Figure 1
Figure 1. Figure 1: Sensitivity of DeepSeek-Prover-V2 [1] to semantics-preserving rewrites. Each pair of statements is mathematically equivalent, yet exhibits drastically different proof success rates under minor transformations. This highlights a systematic violation of symmetry: the prover’s success probability depends strongly on the representation instead of the underlying mathematical content. ∗Equal advising. Preprint. … view at source ↗
Figure 2
Figure 2. Figure 2: An example of tactic irreversibil￾ity in Lean. Applying a tactic transforms x > 1 into x > 0, losing information. LLM-based provers [9–11] process statements as text, making their behavior depend on surface form instead of the underlying mathematical structure. This discrepancy motivates a central question: what are the right symme￾tries for formal theorem proving? In classical machine learning, symmetries… view at source ↗
Figure 3
Figure 3. Figure 3: An example of a successful proof in Lean. The original statement [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of problems solved by the test-time ensemble framework, where seed failed. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Composition [t1 : t2] has the same effect as t3, despite them being syntactically different. Tactic l1 have h0 : x * 3 ≡ 1 [ZMOD 7] := by norm_num Tactic l2 have h0 : x * 3 ≡ 1 [ZMOD 7] := by simpa [hx] [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Tactics l1 and l2 may act differently on some statements. Consider the tactics t1, t2, and t3 shown in [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Lean proofs do not trans￾fer between equivalent statements. In Lean theorem proving, these assumptions no longer apply. Proof correctness is determined by a syntactically strict formal system, where even minor modifications to a statement may invalidate an existing proof. As a result, semantically equiva￾lent statements do not necessarily admit interchangeable proofs, and transferring proofs between relate… view at source ↗
Figure 8
Figure 8. Figure 8: Process of generation of the miniF2F-rw dataset. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example of a Lean certificate, proving [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sampling process for the test-time ensemble mechanism. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: PASS@k success rates on miniF2F-rw for ensemble sizes 2, 4, and 8 (solid lines), with single seed and random-sample baselines as narrower dashed lines. Ensemble and random curves include ± one estimated standard error. C.4 Ablation on the number of sampled rewrites Proposition 15 predicts that ensemble performance should increase with the number of sampled augmentations. We empirically validate this by ev… view at source ↗
read the original abstract

Formal theorem provers based on large language models (LLMs) are highly sensitive to superficial variations in problem representation: semantically equivalent statements can exhibit drastically different proof success rates, revealing a failure to respect structural symmetries inherent in formal mathematics. This raises a central question: what are the right symmetries for formal theorem proving? We introduce rewriting categories, a category-theoretic framework capturing the compositional, generally non-invertible transformations induced by proof tactics, and use it to formalize two symmetry notions: proof equivariance, governing how proof distributions transform under rewrites, and success invariance (i.e., invariance of success probability), requiring equivalent statements to be solved with the same probability. We observe that state-based next-tactic provers naturally satisfy proof equivariance by operating on proof states. In contrast, state-of-the-art LLM-based provers satisfy neither property, exhibiting large performance variation across equivalent formulations. To mitigate this, we propose test-time methods that aggregate over equivalent rewritings of the input, showing theoretically that they recover success invariance in the sampling limit, and empirically, that they improve robustness and performance under fixed inference budgets. Our results highlight symmetry as a key missing inductive bias in LLM-based theorem proving and suggest test-time computation as a practical route to approximate it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces rewriting categories, a category-theoretic framework for the compositional, generally non-invertible transformations induced by proof tactics. It formalizes proof equivariance (how proof distributions transform under rewrites) and success invariance (equivalent statements solved with equal probability). The central claims are that state-based next-tactic provers satisfy proof equivariance while state-of-the-art LLM-based provers satisfy neither, as evidenced by large success-rate variation across equivalent formulations generated by the rewriting categories; test-time aggregation over rewritings is shown to recover success invariance in the sampling limit and to improve robustness and performance under fixed inference budgets.

Significance. If the empirical results are supported by controls confirming that the rewriting categories preserve average proof difficulty, the work would identify a missing inductive bias in LLM theorem provers and supply both a formal language for symmetries and a practical test-time remedy. The category-theoretic framing and the theoretical guarantee for the aggregation method are notable strengths that could influence future architecture and inference design in automated theorem proving.

major comments (2)
  1. [§4] §4 (Empirical Evaluation): The central claim that large performance variation across rewriting categories demonstrates failure of success invariance rests on the assumption that the generated statements have comparable proof difficulty. No controls (e.g., proof-length distributions, tactic-applicability statistics, or search-space-size curves) are reported to confirm that difficulty is preserved on average; without them the observed gaps cannot be cleanly attributed to missing symmetry rather than difficulty shifts induced by the non-invertible rewrites.
  2. [§3.2] §3.2 (Success Invariance): The definition of success invariance requires that equivalent statements are solved with the same probability, yet the paper does not provide a quantitative bound or empirical check showing that the rewriting categories produce statements whose intrinsic difficulty remains statistically matched; this weakens the attribution of performance differences solely to the lack of symmetry.
minor comments (2)
  1. [Introduction] The distinction between state-based next-tactic provers and LLM-based provers is introduced late; an earlier explicit comparison in the introduction would improve readability.
  2. [§3] Notation for the objects and morphisms of the rewriting categories would benefit from one or two concrete examples drawn from a standard formal library (e.g., Lean or Isabelle) to make the category-theoretic constructions more accessible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential impact of the rewriting-category framework and the test-time aggregation approach. The primary concern raised—that performance differences across rewritings may reflect changes in proof difficulty rather than violations of success invariance—is a substantive one. We address it directly below and commit to strengthening the empirical section accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Empirical Evaluation): The central claim that large performance variation across rewriting categories demonstrates failure of success invariance rests on the assumption that the generated statements have comparable proof difficulty. No controls (e.g., proof-length distributions, tactic-applicability statistics, or search-space-size curves) are reported to confirm that difficulty is preserved on average; without them the observed gaps cannot be cleanly attributed to missing symmetry rather than difficulty shifts induced by the non-invertible rewrites.

    Authors: We agree that the absence of explicit difficulty controls leaves open the possibility that non-invertible rewrites alter average proof length, branching factor, or search-space size. In the revised manuscript we will add (i) side-by-side histograms and summary statistics of proof lengths for original versus rewritten statements across the evaluated datasets, (ii) average numbers of applicable tactics per state, and (iii) a simple proxy for search-space size (e.g., number of nodes expanded under a fixed budget). These statistics will be reported both in aggregate and broken down by rewrite category so that readers can judge whether difficulty is preserved on average. Should any systematic shift appear, we will discuss its magnitude relative to the observed performance gaps. revision: yes

  2. Referee: [§3.2] §3.2 (Success Invariance): The definition of success invariance requires that equivalent statements are solved with the same probability, yet the paper does not provide a quantitative bound or empirical check showing that the rewriting categories produce statements whose intrinsic difficulty remains statistically matched; this weakens the attribution of performance differences solely to the lack of symmetry.

    Authors: We concur that a quantitative or statistical check on difficulty matching would make the attribution to symmetry violation more robust. In addition to the controls listed in response to §4, we will include a Kolmogorov–Smirnov test (or equivalent) comparing the empirical distributions of the difficulty proxies between original and rewritten statements, together with a brief discussion of effect sizes. This will be placed in §3.2 or a new subsection of the empirical evaluation so that the theoretical definition of success invariance is immediately followed by the supporting empirical verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on new definitions and empirical observations rather than self-referential reductions.

full rationale

The paper defines rewriting categories as a new category-theoretic framework to formalize proof equivariance and success invariance, then reports empirical measurements showing that state-of-the-art LLM provers exhibit performance variation across rewrites while state-based provers satisfy equivariance by construction of their operation. The proposed test-time aggregation methods are justified by a sampling-limit argument that follows directly from the definitions of equivalence and invariance, without reducing any fitted parameter or central claim to prior self-citations or tautological inputs. No load-bearing step equates a prediction to its own fitting procedure or imports uniqueness via overlapping-author citations; the work is self-contained against external benchmarks of prover behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on standard category theory plus the domain assumption that proof tactics induce compositional non-invertible transformations. The central new entity is the rewriting category itself, introduced without external falsifiable evidence beyond the definitions.

axioms (1)
  • domain assumption Proof tactics induce compositional, generally non-invertible transformations on statements.
    Invoked to motivate the definition of rewriting categories.
invented entities (1)
  • rewriting categories no independent evidence
    purpose: Category-theoretic structure capturing compositional transformations induced by proof tactics.
    New modeling tool introduced to formalize proof equivariance and success invariance.

pith-pipeline@v0.9.0 · 5782 in / 1314 out tokens · 39443 ms · 2026-05-22T08:11:59.204356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 5 internal anchors

  1. [1]

    DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

    ZZ Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, et al. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint arXiv:2504.21801, 2025

  2. [2]

    Ineq-comp: Benchmarking human-intuitive compositional reasoning in automated theorem proving on inequalities

    Haoyu Zhao, Yihan Geng, Shange Tang, Yong Lin, Bohan Lyu, Hongzhou Lin, Chi Jin, and Sanjeev Arora. Ineq-comp: Benchmarking human-intuitive compositional reasoning in automated theorem proving on inequalities. InAdvances in Neural Information Processing Systems, 2025

  3. [3]

    Evolprover: Advancing automated theorem proving by evolving formalized problems via symmetry and difficulty

    Yuchen Tian, Ruiyuan Huang, WANG XUANWU, Jing Ma, Zengfeng Huang, Ziyang Luo, Hongzhan Lin, Da Zheng, and Lun Du. Evolprover: Advancing automated theorem proving by evolving formalized problems via symmetry and difficulty. InThe Fourteenth International Conference on Learning Representations, 2026

  4. [4]

    GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models

    Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations, 2025

  5. [5]

    Consistency training helps stop sycophancy and jailbreaks.arXiv preprint arXiv:2510.27062, 2025

    Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K Elson, and Rohin Shah. Consistency training helps stop sycophancy and jailbreaks.arXiv preprint arXiv:2510.27062, 2025

  6. [6]

    a is b” fail to learn “b is a

    Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. InThe Twelfth International Conference on Learning Representations, 2024

  7. [7]

    rewordbench: Benchmarking and improving the robustness of reward models with transformed inputs

    Zhaofeng Wu, Michihiro Yasunaga, Andrew Cohen, Yoon Kim, Asli Celikyilmaz, and Marjan Ghazvininejad. rewordbench: Benchmarking and improving the robustness of reward models with transformed inputs. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

  8. [8]

    Equibench: Benchmarking large language models’ reasoning about program semantics via equivalence checking

    Anjiang Wei, Jiannan Cao, Ran Li, Hongyu Chen, Yuhui Zhang, Ziheng Wang, Yuan Liu, Thiago SFX Teixeira, Diyi Yang, Ke Wang, et al. Equibench: Benchmarking large language models’ reasoning about program semantics via equivalence checking. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

  9. [9]

    Generative Language Modeling for Automated Theorem Proving

    Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

  10. [10]

    Leandojo: Theorem proving with retrieval-augmented language models

    Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models. InAdvances in Neural Information Processing Systems, 2023

  11. [11]

    Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Haowei Zhang, Qihao Zhu, Dejian Yang, Zhibin Gou, Z.F

    Huajian Xin, Z.Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Haowei Zhang, Qihao Zhu, Dejian Yang, Zhibin Gou, Z.F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search. InThe Thirteenth Internati...

  12. [12]

    Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi´c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.arXiv preprint arXiv:2104.13478, 2021. 10

  13. [13]

    The lean 4 theorem prover and programming language

    Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InAutomated Deduction – CADE 28: 28th International Conference on Automated Deduction, 2021

  14. [14]

    Hun- yuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving.arXiv preprint arXiv:2412.20735, 2024

    Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, and Haitao Mi. Hun- yuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving.arXiv preprint arXiv:2412.20735, 2024

  15. [15]

    Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving

    Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Ming Ding. Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

  16. [16]

    Goedel-prover: A frontier model for open-source automated theorem proving

    Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia LI, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving. InSecond Conference on Language Modeling, 2025

  17. [17]

    Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction

    Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction. InThe Fourte...

  18. [18]

    Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

    Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Léo Dreyfus-Schmidt, Lewis Tunstall, Luigi Paga...

  19. [19]

    Revisiting random walks for learning on graphs

    Jinwoo Kim, Olga Zaghen, Ayhan Suleymanzade, Youngmin Ryou, and Seunghoon Hong. Revisiting random walks for learning on graphs. InThe Thirteenth International Conference on Learning Representations, 2025

  20. [20]

    Inverting data transformations via diffusion sampling.arXiv preprint arXiv:2602.08267, 2026

    Jinwoo Kim, Sékou-Oumar Kaba, Jiyun Park, Seunghoon Hong, and Siamak Ravanbakhsh. Inverting data transformations via diffusion sampling.arXiv preprint arXiv:2602.08267, 2026

  21. [21]

    Equivariance with learned canonicalization functions

    Sékou-Oumar Kaba, Arnab Kumar Mondal, Yan Zhang, Yoshua Bengio, and Siamak Ra- vanbakhsh. Equivariance with learned canonicalization functions. InFortieth International Conference on Machine Learning, 2023

  22. [22]

    Tilt your head: Activating the hidden spatial-invariance of classifiers

    Johann Schmidt and Sebastian Stober. Tilt your head: Activating the hidden spatial-invariance of classifiers. InForty-first International Conference on Machine Learning, 2024

  23. [23]

    Lie algebra canonicalization: Equivariant neural operators under arbitrary lie groups

    Zakhar Shumaylov, Peter Zaika, James Rowbottom, Ferdia Sherry, Melanie Weber, and Carola- Bibiane Schönlieb. Lie algebra canonicalization: Equivariant neural operators under arbitrary lie groups. InThe Thirteenth International Conference on Learning Representations, 2025

  24. [24]

    Yu, and Atul Prakash

    Utkarsh Singhal, Ryan Feng, Stella X. Yu, and Atul Prakash. Test-time canonicalization by foundation models for robust perception. InForty-second International Conference on Machine Learning, 2025

  25. [25]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

  26. [26]

    minif2f: a cross-system benchmark for formal olympiad-level mathematics

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InThe Tenth International Conference on Learning Representations, 2022. 11

  27. [27]

    Query rewriting in retrieval- augmented large language models

    Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting in retrieval- augmented large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  28. [28]

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. InThe Eleventh International Conference on Learning Representations, 2023

  29. [29]

    Improving black-box robustness with in-context rewriting.Transactions on Machine Learning Research, 2024

    Kyle O’Brien, Nathan Hoyen Ng, Isha Puri, Jorge Mendez-Mendez, Hamid Palangi, Yoon Kim, Marzyeh Ghassemi, and Thomas Hartvigsen. Improving black-box robustness with in-context rewriting.Transactions on Machine Learning Research, 2024

  30. [30]

    The Lean Mathematical Library

    The mathlib Community. The Lean Mathematical Library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, 2020

  31. [31]

    Ayers, Dragomir Radev, and Jeremy Avigad

    Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023

  32. [32]

    Formalevolve: Neuro-symbolic evolutionary search for diverse and prover-effective autoformalization.arXiv preprint arXiv:2603.19828, 2026

    Haijian Lu, Wei Wang, and Jing Liu. Formalevolve: Neuro-symbolic evolutionary search for diverse and prover-effective autoformalization.arXiv preprint arXiv:2603.19828, 2026

  33. [33]

    DRIFT: Decompose, retrieve, illustrate, then formalize theorems

    Meiru Zhang, Philipp Borchert, Milan Gritta, and Gerasimos Lampouras. DRIFT: Decompose, retrieve, illustrate, then formalize theorems. InThe Fourteenth International Conference on Learning Representations, 2026

  34. [34]

    Proofflow: A dependency graph approach to faithful proof autoformalization

    Rafael Medeiros Cabral, Tuan Manh Do, Yu Xuejun, Wai Ming Tai, Zijin Feng, and SHEN XIN. Proofflow: A dependency graph approach to faithful proof autoformalization. InThe Fourteenth International Conference on Learning Representations, 2026

  35. [35]

    Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach

    Qi Liu, Xinhao Zheng, Xudong Lu, Qinxiang Cao, and Junchi Yan. Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach. In The Thirteenth International Conference on Learning Representations, 2025

  36. [36]

    Graph representations for higher-order logic and theorem proving

    Aditya Paliwal, Sarah Loos, Markus Rabe, Kshitij Bansal, and Christian Szegedy. Graph representations for higher-order logic and theorem proving. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

  37. [37]

    Graph contrastive pre-training for effective theorem reasoning

    Zhaoyu Li, Binghong Chen, and Xujie Si. Graph contrastive pre-training for effective theorem reasoning. InProceedings of the 38th International Conference on Machine Learning, 2021

  38. [38]

    Graph2tac: online representation learning of formal math concepts

    Lasse Blaauwbroek, Miroslav Olšák, Jason Rute, Fidel Ivan Schaposnik Massolo, Jelle Piepen- brock, and Vasily Pestun. Graph2tac: online representation learning of formal math concepts. InProceedings of the 41st International Conference on Machine Learning, 2024

  39. [39]

    Algebraic positional encod- ings

    Konstantinos Kogkalidis, Jean-Philippe Bernardy, and Vikas Garg. Algebraic positional encod- ings. InAdvances in Neural Information Processing Systems, 2024

  40. [40]

    Learning structure- aware representations of dependent types

    Konstantinos Kogkalidis, Orestis Melkonian, and Jean-Philippe Bernardy. Learning structure- aware representations of dependent types. InAdvances in Neural Information Processing Systems, 2024

  41. [41]

    Equivariance through parameter- sharing

    Siamak Ravanbakhsh, Jeff Schneider, and Barnabás Póczos. Equivariance through parameter- sharing. InProceedings of the 34th International Conference on Machine Learning, 2017

  42. [42]

    On the generalization of equivariance and convolution in neural networks to the action of compact groups

    Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. InProceedings of the 35th International Conference on Machine Learning, 2018

  43. [43]

    A group-theoretic framework for data augmentation.Journal of Machine Learning Research, 2020

    Shuxiao Chen, Edgar Dobriban, and Jane H Lee. A group-theoretic framework for data augmentation.Journal of Machine Learning Research, 2020

  44. [44]

    On the benefits of invariance in neural networks, 2020

    Clare Lyle, Mark van der Wilk, Marta Kwiatkowska, Yarin Gal, and Benjamin Bloem-Reddy. On the benefits of invariance in neural networks, 2020. 12

  45. [45]

    Position: Categorical deep learning is an algebraic theory of all architectures

    Bruno Gavranovi ´c, Paul Lessard, Andrew Joseph Dudzik, Tamara von Glehn, João Guil- herme Madeira Araújo, and Petar Veliˇckovi´c. Position: Categorical deep learning is an algebraic theory of all architectures. InForty-first International Conference on Machine Learning, 2024

  46. [46]

    Categorical equivariant deep learning: Category-equivariant neural networks and universal approximation theorems.arXiv preprint arXiv:2511.18417, 2025

    Yoshihiro Maruyama. Categorical equivariant deep learning: Category-equivariant neural networks and universal approximation theorems.arXiv preprint arXiv:2511.18417, 2025

  47. [47]

    Conceptual challenges in modern ai, 2024

    Léon Bottou. Conceptual challenges in modern ai, 2024. URL https://youtu.be/ MEeqneHqukg?si=p46WJqu9aPWwVPTR

  48. [48]

    Mathematical structures of language.Interscience tracts in pure and applied mathematics, 1968

    Zellig Harris. Mathematical structures of language.Interscience tracts in pure and applied mathematics, 1968

  49. [49]

    Oxford University Press, 1991

    Zellig Harris.A theory of language and information: a mathematical approach. Oxford University Press, 1991

  50. [50]

    Exploiting code symmetries for learning program semantics

    Kexin Pei, Weichen Li, Qirui Jin, Shuyang Liu, Scott Geng, Lorenzo Cavallaro, Junfeng Yang, and Suman Jana. Exploiting code symmetries for learning program semantics. InForty-first International Conference on Machine Learning, 2024

  51. [51]

    Dynamics reveals structure: Challenging the linear propagation assumption

    Hoyeon Chang, Bálint Mucsányi, and Seong Joon Oh. Dynamics reveals structure: Challenging the linear propagation assumption. InICLR 2026 Workshop on Unifying Concept Representation Learning, 2026

  52. [52]

    Counterfactual fairness

    Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. Advances in Neural Information Processing Systems, 2017

  53. [53]

    Position: It’s time to optimize for self-consistency, 2026

    Itamar Pres, Belinda Z Li, Laura Ruis, Zifan Carl Guo, Keya Hu, Mehul Damani, Isha Puri, Ekdeep Singh Lubana, and Jacob Andreas. Position: It’s time to optimize for self-consistency, 2026

  54. [54]

    Cambridge University Press, 2014

    Tom Leinster.Basic category theory. Cambridge University Press, 2014

  55. [55]

    MIT press, 1993

    Glynn Winskel.The formal semantics of programming languages: an introduction. MIT press, 1993

  56. [56]

    Elsevier, 2001

    Herbert B Enderton.A mathematical introduction to logic. Elsevier, 2001

  57. [57]

    Cambridge university press Cambridge, 1989

    Jean-Yves Girard, Paul Taylor, and Yves Lafont.Proofs and types. Cambridge university press Cambridge, 1989

  58. [58]

    Classes of recursively enumerable sets and their decision problems

    Henry Gordon Rice. Classes of recursively enumerable sets and their decision problems. Transactions of the American Mathematical society, 1953

  59. [59]

    A. M. Turing. On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society, 1937

  60. [60]

    Springer-Verlag, 1993

    Ronald V Book and Friedrich Otto.String-rewriting systems. Springer-Verlag, 1993

  61. [61]

    Cambridge university press, 1998

    Franz Baader and Tobias Nipkow.Term rewriting and all that. Cambridge university press, 1998

  62. [62]

    Equality saturation: a new ap- proach to optimization

    Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. Equality saturation: a new ap- proach to optimization. InProceedings of the 36th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, 2009

  63. [63]

    Equivalence by canonicalization for synthesis-backed refactoring.Proceedings of the ACM on Programming Languages, 2024

    Justin Lubin, Jeremy Ferguson, Kevin Ye, Jacob Yim, and Sarah E Chasins. Equivalence by canonicalization for synthesis-backed refactoring.Proceedings of the ACM on Programming Languages, 2024

  64. [64]

    PhD thesis, Master’s thesis, 2024

    Marcus Rossel.An Equality Saturation Tactic for Lean. PhD thesis, Master’s thesis, 2024

  65. [65]

    Introducing gpt-5.2

    OpenAI. Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-4/, 2026. Accessed: 2026-04-26. 13

  66. [66]

    having proved an auxiliary result inside the main proof, no matter how, the model should continue in the same way

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. A Background on category theory Our formalization uses a smal...

  67. [67]

    Keep the same variables and hypotheses names exactly as in the input

  68. [68]

    in increasing order

    Keep the same theorem name stem and append suffixes`_v2`,`_v3`, ... in increasing order

  69. [69]

    Do not include proofs ; every theorem must end exactly at`:= by`with nothing after (`sorry`, tactics , or extra lines )

  70. [70]

    Each rewrite must remain mathematically equivalent to the original theorem

  71. [71]

    Prefer algebraic / logical rewrites

    Avoid trivial whitespace - only edits . Prefer algebraic / logical rewrites

  72. [72]

    DO NOT substitute hypothesis values / terms into other hypotheses or the goal ( e . g . if`b = 30`, do not replace`b`with`30`)

  73. [73]

    Diversity requirement : - Include several SIMPLE rewrites that only use symmetry / commutativity style transformations

    Do not weaken / strengthen the theorem or add junk conjuncts / disjuncts ( forbidden :`\ and True`,`\ iff False`, extra unrelated assumptions / goals ) . Diversity requirement : - Include several SIMPLE rewrites that only use symmetry / commutativity style transformations . - Examples of SIMPLE rewrites : *`a = b`->`b = a` *`a < b`->`b > a` *`2 * x = y`->...

  74. [74]

    Keep the same theorem name as in the input

  75. [75]

    3)`variable_map`maps old variable names→new names ( e

    Do not change the conclusion or hypothesis ** types ** except where a name appears as a binder ( you may rename bound variables consistently ) . 3)`variable_map`maps old variable names→new names ( e . g .`x`→`u`) . 4)`hypothesis_map`maps old hypothesis names→new names ( e . g .`h`→`h1 `or`h0`)

  76. [76]

    6)`original`must be exactly the input theorem string ( after trimming )

    If you rename nothing in a category , use an empty object`{}`for that map . 6)`original`must be exactly the input theorem string ( after trimming ) . 7)`renamed`must be the full theorem line ( s ) ending with`:= by`only - no`sorry`, tactics , or proof body after`by`. Output format : - Output ** only ** one JSON object , no markdown fences , no commentary ...

  77. [77]

    Keep theorem name and binder structure aligned with the original statement unless the augmented state requires changes

  78. [78]

    Preserve hypothesis order from the augmented state

  79. [79]

    Preserve the augmented state's mathematics exactly ( do not simplify away changes )

  80. [80]

    g .`\ sqrt2`->`Real

    Replace shorthand notation with explicit Lean forms when appropriate ( e . g .`\ sqrt2`->`Real . sqrt 2`,`\pi`->`Real . pi`)

Showing first 80 references.