What are the Right Symmetries for Formal Theorem Proving?

Bernardo Cuenca Grau; \.Ismail \.Ilkan Ceylan; Jinwoo Kim; Krzysztof Olejniczak; Radoslav Dimitrov; Xingyue Huang

arxiv: 2605.22257 · v1 · pith:N4Z62QJFnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· cs.LO

What are the Right Symmetries for Formal Theorem Proving?

Krzysztof Olejniczak , Radoslav Dimitrov , Xingyue Huang , Bernardo Cuenca Grau , Jinwoo Kim , \.Ismail \.Ilkan Ceylan This is my paper

Pith reviewed 2026-05-22 08:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LO

keywords theorem provinglarge language modelssymmetriesequivariancesuccess invariancerewriting categoriestest-time methodsinductive bias

0 comments

The pith

LLM-based theorem provers fail to maintain consistent success rates across semantically equivalent problem statements because they lack proof equivariance and success invariance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces rewriting categories as a framework to capture how proof tactics transform statements in formal mathematics. It uses this to define proof equivariance, where proof distributions should transform predictably under rewrites, and success invariance, where equivalent statements should have the same probability of being proved. State-of-the-art LLM provers exhibit neither property and show large performance differences on equivalent formulations, while state-based next-tactic provers naturally achieve proof equivariance. Test-time methods that aggregate over multiple equivalent rewritings recover success invariance in the sampling limit and raise performance under fixed inference budgets.

Core claim

We introduce rewriting categories, a category-theoretic framework capturing the compositional, generally non-invertible transformations induced by proof tactics, and use it to formalize two symmetry notions: proof equivariance, governing how proof distributions transform under rewrites, and success invariance, requiring equivalent statements to be solved with the same probability. State-of-the-art LLM-based provers satisfy neither property, exhibiting large performance variation across equivalent formulations. Test-time methods that aggregate over equivalent rewritings recover success invariance in the sampling limit and improve robustness and performance under fixed inference budgets.

What carries the argument

Rewriting categories: a category-theoretic framework that captures compositional transformations induced by proof tactics and serves to define proof equivariance and success invariance.

If this is right

State-based next-tactic provers already satisfy proof equivariance by operating directly on proof states.
Aggregating over equivalent rewritings at test time recovers success invariance in the sampling limit.
Test-time aggregation improves robustness and performance under fixed inference budgets without retraining.
Symmetry should serve as a key inductive bias when designing future LLM-based theorem provers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same symmetry failures may appear in other AI systems for formal reasoning tasks such as program verification or type checking.
Training models to be equivariant by design could reduce the need for test-time computation to enforce invariance.
The rewriting category approach could be adapted to measure symmetry gaps in non-formal reasoning domains.
Human mathematicians may implicitly rely on similar symmetry awareness when selecting equivalent problem formulations.

Load-bearing premise

The rewriting categories must generate a sufficient set of semantically equivalent statements whose proof difficulty remains comparable so that performance differences can be attributed to missing symmetry rather than changes in inherent hardness.

What would settle it

If applying the proposed test-time aggregation over rewritings does not reduce the observed variation in success rates across equivalent formulations or fails to improve overall proof success on a standard benchmark under fixed budgets, that would show the symmetries are not the primary driver of inconsistency.

Figures

Figures reproduced from arXiv: 2605.22257 by Bernardo Cuenca Grau, \.Ismail \.Ilkan Ceylan, Jinwoo Kim, Krzysztof Olejniczak, Radoslav Dimitrov, Xingyue Huang.

**Figure 1.** Figure 1: Sensitivity of DeepSeek-Prover-V2 [1] to semantics-preserving rewrites. Each pair of statements is mathematically equivalent, yet exhibits drastically different proof success rates under minor transformations. This highlights a systematic violation of symmetry: the prover’s success probability depends strongly on the representation instead of the underlying mathematical content. ∗Equal advising. Preprint. … view at source ↗

**Figure 2.** Figure 2: An example of tactic irreversibility in Lean. Applying a tactic transforms x > 1 into x > 0, losing information. LLM-based provers [9–11] process statements as text, making their behavior depend on surface form instead of the underlying mathematical structure. This discrepancy motivates a central question: what are the right symmetries for formal theorem proving? In classical machine learning, symmetries… view at source ↗

**Figure 3.** Figure 3: An example of a successful proof in Lean. The original statement [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of problems solved by the test-time ensemble framework, where seed failed. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Composition [t1 : t2] has the same effect as t3, despite them being syntactically different. Tactic l1 have h0 : x * 3 ≡ 1 [ZMOD 7] := by norm_num Tactic l2 have h0 : x * 3 ≡ 1 [ZMOD 7] := by simpa [hx] [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Tactics l1 and l2 may act differently on some statements. Consider the tactics t1, t2, and t3 shown in [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Lean proofs do not transfer between equivalent statements. In Lean theorem proving, these assumptions no longer apply. Proof correctness is determined by a syntactically strict formal system, where even minor modifications to a statement may invalidate an existing proof. As a result, semantically equivalent statements do not necessarily admit interchangeable proofs, and transferring proofs between relate… view at source ↗

**Figure 8.** Figure 8: Process of generation of the miniF2F-rw dataset. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: An example of a Lean certificate, proving [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Sampling process for the test-time ensemble mechanism. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: PASS@k success rates on miniF2F-rw for ensemble sizes 2, 4, and 8 (solid lines), with single seed and random-sample baselines as narrower dashed lines. Ensemble and random curves include ± one estimated standard error. C.4 Ablation on the number of sampled rewrites Proposition 15 predicts that ensemble performance should increase with the number of sampled augmentations. We empirically validate this by ev… view at source ↗

read the original abstract

Formal theorem provers based on large language models (LLMs) are highly sensitive to superficial variations in problem representation: semantically equivalent statements can exhibit drastically different proof success rates, revealing a failure to respect structural symmetries inherent in formal mathematics. This raises a central question: what are the right symmetries for formal theorem proving? We introduce rewriting categories, a category-theoretic framework capturing the compositional, generally non-invertible transformations induced by proof tactics, and use it to formalize two symmetry notions: proof equivariance, governing how proof distributions transform under rewrites, and success invariance (i.e., invariance of success probability), requiring equivalent statements to be solved with the same probability. We observe that state-based next-tactic provers naturally satisfy proof equivariance by operating on proof states. In contrast, state-of-the-art LLM-based provers satisfy neither property, exhibiting large performance variation across equivalent formulations. To mitigate this, we propose test-time methods that aggregate over equivalent rewritings of the input, showing theoretically that they recover success invariance in the sampling limit, and empirically, that they improve robustness and performance under fixed inference budgets. Our results highlight symmetry as a key missing inductive bias in LLM-based theorem proving and suggest test-time computation as a practical route to approximate it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces rewriting categories, a category-theoretic framework for the compositional, generally non-invertible transformations induced by proof tactics. It formalizes proof equivariance (how proof distributions transform under rewrites) and success invariance (equivalent statements solved with equal probability). The central claims are that state-based next-tactic provers satisfy proof equivariance while state-of-the-art LLM-based provers satisfy neither, as evidenced by large success-rate variation across equivalent formulations generated by the rewriting categories; test-time aggregation over rewritings is shown to recover success invariance in the sampling limit and to improve robustness and performance under fixed inference budgets.

Significance. If the empirical results are supported by controls confirming that the rewriting categories preserve average proof difficulty, the work would identify a missing inductive bias in LLM theorem provers and supply both a formal language for symmetries and a practical test-time remedy. The category-theoretic framing and the theoretical guarantee for the aggregation method are notable strengths that could influence future architecture and inference design in automated theorem proving.

major comments (2)

[§4] §4 (Empirical Evaluation): The central claim that large performance variation across rewriting categories demonstrates failure of success invariance rests on the assumption that the generated statements have comparable proof difficulty. No controls (e.g., proof-length distributions, tactic-applicability statistics, or search-space-size curves) are reported to confirm that difficulty is preserved on average; without them the observed gaps cannot be cleanly attributed to missing symmetry rather than difficulty shifts induced by the non-invertible rewrites.
[§3.2] §3.2 (Success Invariance): The definition of success invariance requires that equivalent statements are solved with the same probability, yet the paper does not provide a quantitative bound or empirical check showing that the rewriting categories produce statements whose intrinsic difficulty remains statistically matched; this weakens the attribution of performance differences solely to the lack of symmetry.

minor comments (2)

[Introduction] The distinction between state-based next-tactic provers and LLM-based provers is introduced late; an earlier explicit comparison in the introduction would improve readability.
[§3] Notation for the objects and morphisms of the rewriting categories would benefit from one or two concrete examples drawn from a standard formal library (e.g., Lean or Isabelle) to make the category-theoretic constructions more accessible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential impact of the rewriting-category framework and the test-time aggregation approach. The primary concern raised—that performance differences across rewritings may reflect changes in proof difficulty rather than violations of success invariance—is a substantive one. We address it directly below and commit to strengthening the empirical section accordingly.

read point-by-point responses

Referee: [§4] §4 (Empirical Evaluation): The central claim that large performance variation across rewriting categories demonstrates failure of success invariance rests on the assumption that the generated statements have comparable proof difficulty. No controls (e.g., proof-length distributions, tactic-applicability statistics, or search-space-size curves) are reported to confirm that difficulty is preserved on average; without them the observed gaps cannot be cleanly attributed to missing symmetry rather than difficulty shifts induced by the non-invertible rewrites.

Authors: We agree that the absence of explicit difficulty controls leaves open the possibility that non-invertible rewrites alter average proof length, branching factor, or search-space size. In the revised manuscript we will add (i) side-by-side histograms and summary statistics of proof lengths for original versus rewritten statements across the evaluated datasets, (ii) average numbers of applicable tactics per state, and (iii) a simple proxy for search-space size (e.g., number of nodes expanded under a fixed budget). These statistics will be reported both in aggregate and broken down by rewrite category so that readers can judge whether difficulty is preserved on average. Should any systematic shift appear, we will discuss its magnitude relative to the observed performance gaps. revision: yes
Referee: [§3.2] §3.2 (Success Invariance): The definition of success invariance requires that equivalent statements are solved with the same probability, yet the paper does not provide a quantitative bound or empirical check showing that the rewriting categories produce statements whose intrinsic difficulty remains statistically matched; this weakens the attribution of performance differences solely to the lack of symmetry.

Authors: We concur that a quantitative or statistical check on difficulty matching would make the attribution to symmetry violation more robust. In addition to the controls listed in response to §4, we will include a Kolmogorov–Smirnov test (or equivalent) comparing the empirical distributions of the difficulty proxies between original and rewritten statements, together with a brief discussion of effect sizes. This will be placed in §3.2 or a new subsection of the empirical evaluation so that the theoretical definition of success invariance is immediately followed by the supporting empirical verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on new definitions and empirical observations rather than self-referential reductions.

full rationale

The paper defines rewriting categories as a new category-theoretic framework to formalize proof equivariance and success invariance, then reports empirical measurements showing that state-of-the-art LLM provers exhibit performance variation across rewrites while state-based provers satisfy equivariance by construction of their operation. The proposed test-time aggregation methods are justified by a sampling-limit argument that follows directly from the definitions of equivalence and invariance, without reducing any fitted parameter or central claim to prior self-citations or tautological inputs. No load-bearing step equates a prediction to its own fitting procedure or imports uniqueness via overlapping-author citations; the work is self-contained against external benchmarks of prover behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on standard category theory plus the domain assumption that proof tactics induce compositional non-invertible transformations. The central new entity is the rewriting category itself, introduced without external falsifiable evidence beyond the definitions.

axioms (1)

domain assumption Proof tactics induce compositional, generally non-invertible transformations on statements.
Invoked to motivate the definition of rewriting categories.

invented entities (1)

rewriting categories no independent evidence
purpose: Category-theoretic structure capturing compositional transformations induced by proof tactics.
New modeling tool introduced to formalize proof equivariance and success invariance.

pith-pipeline@v0.9.0 · 5782 in / 1314 out tokens · 39443 ms · 2026-05-22T08:11:59.204356+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce rewriting categories, a category-theoretic framework capturing the compositional, generally non-invertible transformations induced by proof tactics, and use it to formalize two symmetry notions: proof equivariance... and success invariance
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A prover pθ is R-equivariant in proof distribution if pθ(· | T′) = P(t)(pθ(· | T)) for every arrow t : T → T′ in R

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 5 internal anchors

[1]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

ZZ Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, et al. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint arXiv:2504.21801, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Ineq-comp: Benchmarking human-intuitive compositional reasoning in automated theorem proving on inequalities

Haoyu Zhao, Yihan Geng, Shange Tang, Yong Lin, Bohan Lyu, Hongzhou Lin, Chi Jin, and Sanjeev Arora. Ineq-comp: Benchmarking human-intuitive compositional reasoning in automated theorem proving on inequalities. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[3]

Evolprover: Advancing automated theorem proving by evolving formalized problems via symmetry and difficulty

Yuchen Tian, Ruiyuan Huang, WANG XUANWU, Jing Ma, Zengfeng Huang, Ziyang Luo, Hongzhan Lin, Da Zheng, and Lun Du. Evolprover: Advancing automated theorem proving by evolving formalized problems via symmetry and difficulty. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[4]

GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models

Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[5]

Consistency training helps stop sycophancy and jailbreaks.arXiv preprint arXiv:2510.27062, 2025

Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K Elson, and Rohin Shah. Consistency training helps stop sycophancy and jailbreaks.arXiv preprint arXiv:2510.27062, 2025

work page arXiv 2025
[6]

a is b” fail to learn “b is a

Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[7]

rewordbench: Benchmarking and improving the robustness of reward models with transformed inputs

Zhaofeng Wu, Michihiro Yasunaga, Andrew Cohen, Yoon Kim, Asli Celikyilmaz, and Marjan Ghazvininejad. rewordbench: Benchmarking and improving the robustness of reward models with transformed inputs. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025
[8]

Equibench: Benchmarking large language models’ reasoning about program semantics via equivalence checking

Anjiang Wei, Jiannan Cao, Ran Li, Hongyu Chen, Yuhui Zhang, Ziheng Wang, Yuan Liu, Thiago SFX Teixeira, Diyi Yang, Ke Wang, et al. Equibench: Benchmarking large language models’ reasoning about program semantics via equivalence checking. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025
[9]

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

work page internal anchor Pith review arXiv 2009
[10]

Leandojo: Theorem proving with retrieval-augmented language models

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[11]

Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Haowei Zhang, Qihao Zhu, Dejian Yang, Zhibin Gou, Z.F

Huajian Xin, Z.Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Haowei Zhang, Qihao Zhu, Dejian Yang, Zhibin Gou, Z.F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search. InThe Thirteenth Internati...

work page 2025
[12]

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi´c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.arXiv preprint arXiv:2104.13478, 2021. 10

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InAutomated Deduction – CADE 28: 28th International Conference on Automated Deduction, 2021

work page 2021
[14]

Hun- yuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving.arXiv preprint arXiv:2412.20735, 2024

Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, and Haitao Mi. Hun- yuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving.arXiv preprint arXiv:2412.20735, 2024

work page arXiv 2024
[15]

Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving

Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Ming Ding. Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

work page 2025
[16]

Goedel-prover: A frontier model for open-source automated theorem proving

Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia LI, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving. InSecond Conference on Language Modeling, 2025

work page 2025
[17]

Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction. InThe Fourte...

work page 2026
[18]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Léo Dreyfus-Schmidt, Lewis Tunstall, Luigi Paga...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Revisiting random walks for learning on graphs

Jinwoo Kim, Olga Zaghen, Ayhan Suleymanzade, Youngmin Ryou, and Seunghoon Hong. Revisiting random walks for learning on graphs. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[20]

Inverting data transformations via diffusion sampling.arXiv preprint arXiv:2602.08267, 2026

Jinwoo Kim, Sékou-Oumar Kaba, Jiyun Park, Seunghoon Hong, and Siamak Ravanbakhsh. Inverting data transformations via diffusion sampling.arXiv preprint arXiv:2602.08267, 2026

work page arXiv 2026
[21]

Equivariance with learned canonicalization functions

Sékou-Oumar Kaba, Arnab Kumar Mondal, Yan Zhang, Yoshua Bengio, and Siamak Ra- vanbakhsh. Equivariance with learned canonicalization functions. InFortieth International Conference on Machine Learning, 2023

work page 2023
[22]

Tilt your head: Activating the hidden spatial-invariance of classifiers

Johann Schmidt and Sebastian Stober. Tilt your head: Activating the hidden spatial-invariance of classifiers. InForty-first International Conference on Machine Learning, 2024

work page 2024
[23]

Lie algebra canonicalization: Equivariant neural operators under arbitrary lie groups

Zakhar Shumaylov, Peter Zaika, James Rowbottom, Ferdia Sherry, Melanie Weber, and Carola- Bibiane Schönlieb. Lie algebra canonicalization: Equivariant neural operators under arbitrary lie groups. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[24]

Yu, and Atul Prakash

Utkarsh Singhal, Ryan Feng, Stella X. Yu, and Atul Prakash. Test-time canonicalization by foundation models for robust perception. InForty-second International Conference on Machine Learning, 2025

work page 2025
[25]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

minif2f: a cross-system benchmark for formal olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InThe Tenth International Conference on Learning Representations, 2022. 11

work page 2022
[27]

Query rewriting in retrieval- augmented large language models

Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting in retrieval- augmented large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[28]

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[29]

Improving black-box robustness with in-context rewriting.Transactions on Machine Learning Research, 2024

Kyle O’Brien, Nathan Hoyen Ng, Isha Puri, Jorge Mendez-Mendez, Hamid Palangi, Yoon Kim, Marzyeh Ghassemi, and Thomas Hartvigsen. Improving black-box robustness with in-context rewriting.Transactions on Machine Learning Research, 2024

work page 2024
[30]

The Lean Mathematical Library

The mathlib Community. The Lean Mathematical Library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, 2020

work page 2020
[31]

Ayers, Dragomir Radev, and Jeremy Avigad

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023

work page 2023
[32]

Formalevolve: Neuro-symbolic evolutionary search for diverse and prover-effective autoformalization.arXiv preprint arXiv:2603.19828, 2026

Haijian Lu, Wei Wang, and Jing Liu. Formalevolve: Neuro-symbolic evolutionary search for diverse and prover-effective autoformalization.arXiv preprint arXiv:2603.19828, 2026

work page arXiv 2026
[33]

DRIFT: Decompose, retrieve, illustrate, then formalize theorems

Meiru Zhang, Philipp Borchert, Milan Gritta, and Gerasimos Lampouras. DRIFT: Decompose, retrieve, illustrate, then formalize theorems. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[34]

Proofflow: A dependency graph approach to faithful proof autoformalization

Rafael Medeiros Cabral, Tuan Manh Do, Yu Xuejun, Wai Ming Tai, Zijin Feng, and SHEN XIN. Proofflow: A dependency graph approach to faithful proof autoformalization. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[35]

Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach

Qi Liu, Xinhao Zheng, Xudong Lu, Qinxiang Cao, and Junchi Yan. Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[36]

Graph representations for higher-order logic and theorem proving

Aditya Paliwal, Sarah Loos, Markus Rabe, Kshitij Bansal, and Christian Szegedy. Graph representations for higher-order logic and theorem proving. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

work page 2020
[37]

Graph contrastive pre-training for effective theorem reasoning

Zhaoyu Li, Binghong Chen, and Xujie Si. Graph contrastive pre-training for effective theorem reasoning. InProceedings of the 38th International Conference on Machine Learning, 2021

work page 2021
[38]

Graph2tac: online representation learning of formal math concepts

Lasse Blaauwbroek, Miroslav Olšák, Jason Rute, Fidel Ivan Schaposnik Massolo, Jelle Piepen- brock, and Vasily Pestun. Graph2tac: online representation learning of formal math concepts. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[39]

Algebraic positional encod- ings

Konstantinos Kogkalidis, Jean-Philippe Bernardy, and Vikas Garg. Algebraic positional encod- ings. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[40]

Learning structure- aware representations of dependent types

Konstantinos Kogkalidis, Orestis Melkonian, and Jean-Philippe Bernardy. Learning structure- aware representations of dependent types. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[41]

Equivariance through parameter- sharing

Siamak Ravanbakhsh, Jeff Schneider, and Barnabás Póczos. Equivariance through parameter- sharing. InProceedings of the 34th International Conference on Machine Learning, 2017

work page 2017
[42]

On the generalization of equivariance and convolution in neural networks to the action of compact groups

Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. InProceedings of the 35th International Conference on Machine Learning, 2018

work page 2018
[43]

A group-theoretic framework for data augmentation.Journal of Machine Learning Research, 2020

Shuxiao Chen, Edgar Dobriban, and Jane H Lee. A group-theoretic framework for data augmentation.Journal of Machine Learning Research, 2020

work page 2020
[44]

On the benefits of invariance in neural networks, 2020

Clare Lyle, Mark van der Wilk, Marta Kwiatkowska, Yarin Gal, and Benjamin Bloem-Reddy. On the benefits of invariance in neural networks, 2020. 12

work page 2020
[45]

Position: Categorical deep learning is an algebraic theory of all architectures

Bruno Gavranovi ´c, Paul Lessard, Andrew Joseph Dudzik, Tamara von Glehn, João Guil- herme Madeira Araújo, and Petar Veliˇckovi´c. Position: Categorical deep learning is an algebraic theory of all architectures. InForty-first International Conference on Machine Learning, 2024

work page 2024
[46]

Categorical equivariant deep learning: Category-equivariant neural networks and universal approximation theorems.arXiv preprint arXiv:2511.18417, 2025

Yoshihiro Maruyama. Categorical equivariant deep learning: Category-equivariant neural networks and universal approximation theorems.arXiv preprint arXiv:2511.18417, 2025

work page arXiv 2025
[47]

Conceptual challenges in modern ai, 2024

Léon Bottou. Conceptual challenges in modern ai, 2024. URL https://youtu.be/ MEeqneHqukg?si=p46WJqu9aPWwVPTR

work page 2024
[48]

Mathematical structures of language.Interscience tracts in pure and applied mathematics, 1968

Zellig Harris. Mathematical structures of language.Interscience tracts in pure and applied mathematics, 1968

work page 1968
[49]

Oxford University Press, 1991

Zellig Harris.A theory of language and information: a mathematical approach. Oxford University Press, 1991

work page 1991
[50]

Exploiting code symmetries for learning program semantics

Kexin Pei, Weichen Li, Qirui Jin, Shuyang Liu, Scott Geng, Lorenzo Cavallaro, Junfeng Yang, and Suman Jana. Exploiting code symmetries for learning program semantics. InForty-first International Conference on Machine Learning, 2024

work page 2024
[51]

Dynamics reveals structure: Challenging the linear propagation assumption

Hoyeon Chang, Bálint Mucsányi, and Seong Joon Oh. Dynamics reveals structure: Challenging the linear propagation assumption. InICLR 2026 Workshop on Unifying Concept Representation Learning, 2026

work page 2026
[52]

Counterfactual fairness

Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. Advances in Neural Information Processing Systems, 2017

work page 2017
[53]

Position: It’s time to optimize for self-consistency, 2026

Itamar Pres, Belinda Z Li, Laura Ruis, Zifan Carl Guo, Keya Hu, Mehul Damani, Isha Puri, Ekdeep Singh Lubana, and Jacob Andreas. Position: It’s time to optimize for self-consistency, 2026

work page 2026
[54]

Cambridge University Press, 2014

Tom Leinster.Basic category theory. Cambridge University Press, 2014

work page 2014
[55]

MIT press, 1993

Glynn Winskel.The formal semantics of programming languages: an introduction. MIT press, 1993

work page 1993
[56]

Elsevier, 2001

Herbert B Enderton.A mathematical introduction to logic. Elsevier, 2001

work page 2001
[57]

Cambridge university press Cambridge, 1989

Jean-Yves Girard, Paul Taylor, and Yves Lafont.Proofs and types. Cambridge university press Cambridge, 1989

work page 1989
[58]

Classes of recursively enumerable sets and their decision problems

Henry Gordon Rice. Classes of recursively enumerable sets and their decision problems. Transactions of the American Mathematical society, 1953

work page 1953
[59]

A. M. Turing. On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society, 1937

work page 1937
[60]

Springer-Verlag, 1993

Ronald V Book and Friedrich Otto.String-rewriting systems. Springer-Verlag, 1993

work page 1993
[61]

Cambridge university press, 1998

Franz Baader and Tobias Nipkow.Term rewriting and all that. Cambridge university press, 1998

work page 1998
[62]

Equality saturation: a new ap- proach to optimization

Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. Equality saturation: a new ap- proach to optimization. InProceedings of the 36th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, 2009

work page 2009
[63]

Equivalence by canonicalization for synthesis-backed refactoring.Proceedings of the ACM on Programming Languages, 2024

Justin Lubin, Jeremy Ferguson, Kevin Ye, Jacob Yim, and Sarah E Chasins. Equivalence by canonicalization for synthesis-backed refactoring.Proceedings of the ACM on Programming Languages, 2024

work page 2024
[64]

PhD thesis, Master’s thesis, 2024

Marcus Rossel.An Equality Saturation Tactic for Lean. PhD thesis, Master’s thesis, 2024

work page 2024
[65]

Introducing gpt-5.2

OpenAI. Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-4/, 2026. Accessed: 2026-04-26. 13

work page 2026
[66]

having proved an auxiliary result inside the main proof, no matter how, the model should continue in the same way

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. A Background on category theory Our formalization uses a smal...

work page 2023
[67]

Keep the same variables and hypotheses names exactly as in the input

work page
[68]

in increasing order

Keep the same theorem name stem and append suffixes`_v2`,`_v3`, ... in increasing order

work page
[69]

Do not include proofs ; every theorem must end exactly at`:= by`with nothing after (`sorry`, tactics , or extra lines )

work page
[70]

Each rewrite must remain mathematically equivalent to the original theorem

work page
[71]

Prefer algebraic / logical rewrites

Avoid trivial whitespace - only edits . Prefer algebraic / logical rewrites

work page
[72]

DO NOT substitute hypothesis values / terms into other hypotheses or the goal ( e . g . if`b = 30`, do not replace`b`with`30`)

work page
[73]

Diversity requirement : - Include several SIMPLE rewrites that only use symmetry / commutativity style transformations

Do not weaken / strengthen the theorem or add junk conjuncts / disjuncts ( forbidden :`\ and True`,`\ iff False`, extra unrelated assumptions / goals ) . Diversity requirement : - Include several SIMPLE rewrites that only use symmetry / commutativity style transformations . - Examples of SIMPLE rewrites : *`a = b`->`b = a` *`a < b`->`b > a` *`2 * x = y`->...

work page
[74]

Keep the same theorem name as in the input

work page
[75]

3)`variable_map`maps old variable names→new names ( e

Do not change the conclusion or hypothesis ** types ** except where a name appears as a binder ( you may rename bound variables consistently ) . 3)`variable_map`maps old variable names→new names ( e . g .`x`→`u`) . 4)`hypothesis_map`maps old hypothesis names→new names ( e . g .`h`→`h1 `or`h0`)

work page
[76]

6)`original`must be exactly the input theorem string ( after trimming )

If you rename nothing in a category , use an empty object`{}`for that map . 6)`original`must be exactly the input theorem string ( after trimming ) . 7)`renamed`must be the full theorem line ( s ) ending with`:= by`only - no`sorry`, tactics , or proof body after`by`. Output format : - Output ** only ** one JSON object , no markdown fences , no commentary ...

work page
[77]

Keep theorem name and binder structure aligned with the original statement unless the augmented state requires changes

work page
[78]

Preserve hypothesis order from the augmented state

work page
[79]

Preserve the augmented state's mathematics exactly ( do not simplify away changes )

work page
[80]

g .`\ sqrt2`->`Real

Replace shorthand notation with explicit Lean forms when appropriate ( e . g .`\ sqrt2`->`Real . sqrt 2`,`\pi`->`Real . pi`)

work page

Showing first 80 references.

[1] [1]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

ZZ Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, et al. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint arXiv:2504.21801, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Ineq-comp: Benchmarking human-intuitive compositional reasoning in automated theorem proving on inequalities

Haoyu Zhao, Yihan Geng, Shange Tang, Yong Lin, Bohan Lyu, Hongzhou Lin, Chi Jin, and Sanjeev Arora. Ineq-comp: Benchmarking human-intuitive compositional reasoning in automated theorem proving on inequalities. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[3] [3]

Evolprover: Advancing automated theorem proving by evolving formalized problems via symmetry and difficulty

Yuchen Tian, Ruiyuan Huang, WANG XUANWU, Jing Ma, Zengfeng Huang, Ziyang Luo, Hongzhan Lin, Da Zheng, and Lun Du. Evolprover: Advancing automated theorem proving by evolving formalized problems via symmetry and difficulty. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[4] [4]

GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models

Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[5] [5]

Consistency training helps stop sycophancy and jailbreaks.arXiv preprint arXiv:2510.27062, 2025

Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K Elson, and Rohin Shah. Consistency training helps stop sycophancy and jailbreaks.arXiv preprint arXiv:2510.27062, 2025

work page arXiv 2025

[6] [6]

a is b” fail to learn “b is a

Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[7] [7]

rewordbench: Benchmarking and improving the robustness of reward models with transformed inputs

Zhaofeng Wu, Michihiro Yasunaga, Andrew Cohen, Yoon Kim, Asli Celikyilmaz, and Marjan Ghazvininejad. rewordbench: Benchmarking and improving the robustness of reward models with transformed inputs. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025

[8] [8]

Equibench: Benchmarking large language models’ reasoning about program semantics via equivalence checking

Anjiang Wei, Jiannan Cao, Ran Li, Hongyu Chen, Yuhui Zhang, Ziheng Wang, Yuan Liu, Thiago SFX Teixeira, Diyi Yang, Ke Wang, et al. Equibench: Benchmarking large language models’ reasoning about program semantics via equivalence checking. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025

[9] [9]

Generative Language Modeling for Automated Theorem Proving

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020

work page internal anchor Pith review arXiv 2009

[10] [10]

Leandojo: Theorem proving with retrieval-augmented language models

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[11] [11]

Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Haowei Zhang, Qihao Zhu, Dejian Yang, Zhibin Gou, Z.F

Huajian Xin, Z.Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Haowei Zhang, Qihao Zhu, Dejian Yang, Zhibin Gou, Z.F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search. InThe Thirteenth Internati...

work page 2025

[12] [12]

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi´c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.arXiv preprint arXiv:2104.13478, 2021. 10

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InAutomated Deduction – CADE 28: 28th International Conference on Automated Deduction, 2021

work page 2021

[14] [14]

Hun- yuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving.arXiv preprint arXiv:2412.20735, 2024

Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, and Haitao Mi. Hun- yuanprover: A scalable data synthesis framework and guided tree search for automated theorem proving.arXiv preprint arXiv:2412.20735, 2024

work page arXiv 2024

[15] [15]

Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving

Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Ming Ding. Bfs-prover: Scalable best-first tree search for llm-based automatic theorem proving. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

work page 2025

[16] [16]

Goedel-prover: A frontier model for open-source automated theorem proving

Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia LI, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving. InSecond Conference on Language Modeling, 2025

work page 2025

[17] [17]

Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction. InThe Fourte...

work page 2026

[18] [18]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Léo Dreyfus-Schmidt, Lewis Tunstall, Luigi Paga...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Revisiting random walks for learning on graphs

Jinwoo Kim, Olga Zaghen, Ayhan Suleymanzade, Youngmin Ryou, and Seunghoon Hong. Revisiting random walks for learning on graphs. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[20] [20]

Inverting data transformations via diffusion sampling.arXiv preprint arXiv:2602.08267, 2026

Jinwoo Kim, Sékou-Oumar Kaba, Jiyun Park, Seunghoon Hong, and Siamak Ravanbakhsh. Inverting data transformations via diffusion sampling.arXiv preprint arXiv:2602.08267, 2026

work page arXiv 2026

[21] [21]

Equivariance with learned canonicalization functions

Sékou-Oumar Kaba, Arnab Kumar Mondal, Yan Zhang, Yoshua Bengio, and Siamak Ra- vanbakhsh. Equivariance with learned canonicalization functions. InFortieth International Conference on Machine Learning, 2023

work page 2023

[22] [22]

Tilt your head: Activating the hidden spatial-invariance of classifiers

Johann Schmidt and Sebastian Stober. Tilt your head: Activating the hidden spatial-invariance of classifiers. InForty-first International Conference on Machine Learning, 2024

work page 2024

[23] [23]

Lie algebra canonicalization: Equivariant neural operators under arbitrary lie groups

Zakhar Shumaylov, Peter Zaika, James Rowbottom, Ferdia Sherry, Melanie Weber, and Carola- Bibiane Schönlieb. Lie algebra canonicalization: Equivariant neural operators under arbitrary lie groups. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[24] [24]

Yu, and Atul Prakash

Utkarsh Singhal, Ryan Feng, Stella X. Yu, and Atul Prakash. Test-time canonicalization by foundation models for robust perception. InForty-second International Conference on Machine Learning, 2025

work page 2025

[25] [25]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

minif2f: a cross-system benchmark for formal olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InThe Tenth International Conference on Learning Representations, 2022. 11

work page 2022

[27] [27]

Query rewriting in retrieval- augmented large language models

Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting in retrieval- augmented large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[28] [28]

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[29] [29]

Improving black-box robustness with in-context rewriting.Transactions on Machine Learning Research, 2024

Kyle O’Brien, Nathan Hoyen Ng, Isha Puri, Jorge Mendez-Mendez, Hamid Palangi, Yoon Kim, Marzyeh Ghassemi, and Thomas Hartvigsen. Improving black-box robustness with in-context rewriting.Transactions on Machine Learning Research, 2024

work page 2024

[30] [30]

The Lean Mathematical Library

The mathlib Community. The Lean Mathematical Library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, 2020

work page 2020

[31] [31]

Ayers, Dragomir Radev, and Jeremy Avigad

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023

work page 2023

[32] [32]

Formalevolve: Neuro-symbolic evolutionary search for diverse and prover-effective autoformalization.arXiv preprint arXiv:2603.19828, 2026

Haijian Lu, Wei Wang, and Jing Liu. Formalevolve: Neuro-symbolic evolutionary search for diverse and prover-effective autoformalization.arXiv preprint arXiv:2603.19828, 2026

work page arXiv 2026

[33] [33]

DRIFT: Decompose, retrieve, illustrate, then formalize theorems

Meiru Zhang, Philipp Borchert, Milan Gritta, and Gerasimos Lampouras. DRIFT: Decompose, retrieve, illustrate, then formalize theorems. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[34] [34]

Proofflow: A dependency graph approach to faithful proof autoformalization

Rafael Medeiros Cabral, Tuan Manh Do, Yu Xuejun, Wai Ming Tai, Zijin Feng, and SHEN XIN. Proofflow: A dependency graph approach to faithful proof autoformalization. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[35] [35]

Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach

Qi Liu, Xinhao Zheng, Xudong Lu, Qinxiang Cao, and Junchi Yan. Rethinking and improving autoformalization: Towards a faithful metric and a dependency retrieval-based approach. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[36] [36]

Graph representations for higher-order logic and theorem proving

Aditya Paliwal, Sarah Loos, Markus Rabe, Kshitij Bansal, and Christian Szegedy. Graph representations for higher-order logic and theorem proving. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

work page 2020

[37] [37]

Graph contrastive pre-training for effective theorem reasoning

Zhaoyu Li, Binghong Chen, and Xujie Si. Graph contrastive pre-training for effective theorem reasoning. InProceedings of the 38th International Conference on Machine Learning, 2021

work page 2021

[38] [38]

Graph2tac: online representation learning of formal math concepts

Lasse Blaauwbroek, Miroslav Olšák, Jason Rute, Fidel Ivan Schaposnik Massolo, Jelle Piepen- brock, and Vasily Pestun. Graph2tac: online representation learning of formal math concepts. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[39] [39]

Algebraic positional encod- ings

Konstantinos Kogkalidis, Jean-Philippe Bernardy, and Vikas Garg. Algebraic positional encod- ings. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[40] [40]

Learning structure- aware representations of dependent types

Konstantinos Kogkalidis, Orestis Melkonian, and Jean-Philippe Bernardy. Learning structure- aware representations of dependent types. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[41] [41]

Equivariance through parameter- sharing

Siamak Ravanbakhsh, Jeff Schneider, and Barnabás Póczos. Equivariance through parameter- sharing. InProceedings of the 34th International Conference on Machine Learning, 2017

work page 2017

[42] [42]

On the generalization of equivariance and convolution in neural networks to the action of compact groups

Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. InProceedings of the 35th International Conference on Machine Learning, 2018

work page 2018

[43] [43]

A group-theoretic framework for data augmentation.Journal of Machine Learning Research, 2020

Shuxiao Chen, Edgar Dobriban, and Jane H Lee. A group-theoretic framework for data augmentation.Journal of Machine Learning Research, 2020

work page 2020

[44] [44]

On the benefits of invariance in neural networks, 2020

Clare Lyle, Mark van der Wilk, Marta Kwiatkowska, Yarin Gal, and Benjamin Bloem-Reddy. On the benefits of invariance in neural networks, 2020. 12

work page 2020

[45] [45]

Position: Categorical deep learning is an algebraic theory of all architectures

Bruno Gavranovi ´c, Paul Lessard, Andrew Joseph Dudzik, Tamara von Glehn, João Guil- herme Madeira Araújo, and Petar Veliˇckovi´c. Position: Categorical deep learning is an algebraic theory of all architectures. InForty-first International Conference on Machine Learning, 2024

work page 2024

[46] [46]

Categorical equivariant deep learning: Category-equivariant neural networks and universal approximation theorems.arXiv preprint arXiv:2511.18417, 2025

Yoshihiro Maruyama. Categorical equivariant deep learning: Category-equivariant neural networks and universal approximation theorems.arXiv preprint arXiv:2511.18417, 2025

work page arXiv 2025

[47] [47]

Conceptual challenges in modern ai, 2024

Léon Bottou. Conceptual challenges in modern ai, 2024. URL https://youtu.be/ MEeqneHqukg?si=p46WJqu9aPWwVPTR

work page 2024

[48] [48]

Mathematical structures of language.Interscience tracts in pure and applied mathematics, 1968

Zellig Harris. Mathematical structures of language.Interscience tracts in pure and applied mathematics, 1968

work page 1968

[49] [49]

Oxford University Press, 1991

Zellig Harris.A theory of language and information: a mathematical approach. Oxford University Press, 1991

work page 1991

[50] [50]

Exploiting code symmetries for learning program semantics

Kexin Pei, Weichen Li, Qirui Jin, Shuyang Liu, Scott Geng, Lorenzo Cavallaro, Junfeng Yang, and Suman Jana. Exploiting code symmetries for learning program semantics. InForty-first International Conference on Machine Learning, 2024

work page 2024

[51] [51]

Dynamics reveals structure: Challenging the linear propagation assumption

Hoyeon Chang, Bálint Mucsányi, and Seong Joon Oh. Dynamics reveals structure: Challenging the linear propagation assumption. InICLR 2026 Workshop on Unifying Concept Representation Learning, 2026

work page 2026

[52] [52]

Counterfactual fairness

Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. Advances in Neural Information Processing Systems, 2017

work page 2017

[53] [53]

Position: It’s time to optimize for self-consistency, 2026

Itamar Pres, Belinda Z Li, Laura Ruis, Zifan Carl Guo, Keya Hu, Mehul Damani, Isha Puri, Ekdeep Singh Lubana, and Jacob Andreas. Position: It’s time to optimize for self-consistency, 2026

work page 2026

[54] [54]

Cambridge University Press, 2014

Tom Leinster.Basic category theory. Cambridge University Press, 2014

work page 2014

[55] [55]

MIT press, 1993

Glynn Winskel.The formal semantics of programming languages: an introduction. MIT press, 1993

work page 1993

[56] [56]

Elsevier, 2001

Herbert B Enderton.A mathematical introduction to logic. Elsevier, 2001

work page 2001

[57] [57]

Cambridge university press Cambridge, 1989

Jean-Yves Girard, Paul Taylor, and Yves Lafont.Proofs and types. Cambridge university press Cambridge, 1989

work page 1989

[58] [58]

Classes of recursively enumerable sets and their decision problems

Henry Gordon Rice. Classes of recursively enumerable sets and their decision problems. Transactions of the American Mathematical society, 1953

work page 1953

[59] [59]

A. M. Turing. On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society, 1937

work page 1937

[60] [60]

Springer-Verlag, 1993

Ronald V Book and Friedrich Otto.String-rewriting systems. Springer-Verlag, 1993

work page 1993

[61] [61]

Cambridge university press, 1998

Franz Baader and Tobias Nipkow.Term rewriting and all that. Cambridge university press, 1998

work page 1998

[62] [62]

Equality saturation: a new ap- proach to optimization

Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. Equality saturation: a new ap- proach to optimization. InProceedings of the 36th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, 2009

work page 2009

[63] [63]

Equivalence by canonicalization for synthesis-backed refactoring.Proceedings of the ACM on Programming Languages, 2024

Justin Lubin, Jeremy Ferguson, Kevin Ye, Jacob Yim, and Sarah E Chasins. Equivalence by canonicalization for synthesis-backed refactoring.Proceedings of the ACM on Programming Languages, 2024

work page 2024

[64] [64]

PhD thesis, Master’s thesis, 2024

Marcus Rossel.An Equality Saturation Tactic for Lean. PhD thesis, Master’s thesis, 2024

work page 2024

[65] [65]

Introducing gpt-5.2

OpenAI. Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-4/, 2026. Accessed: 2026-04-26. 13

work page 2026

[66] [66]

having proved an auxiliary result inside the main proof, no matter how, the model should continue in the same way

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. A Background on category theory Our formalization uses a smal...

work page 2023

[67] [67]

Keep the same variables and hypotheses names exactly as in the input

work page

[68] [68]

in increasing order

Keep the same theorem name stem and append suffixes`_v2`,`_v3`, ... in increasing order

work page

[69] [69]

Do not include proofs ; every theorem must end exactly at`:= by`with nothing after (`sorry`, tactics , or extra lines )

work page

[70] [70]

Each rewrite must remain mathematically equivalent to the original theorem

work page

[71] [71]

Prefer algebraic / logical rewrites

Avoid trivial whitespace - only edits . Prefer algebraic / logical rewrites

work page

[72] [72]

DO NOT substitute hypothesis values / terms into other hypotheses or the goal ( e . g . if`b = 30`, do not replace`b`with`30`)

work page

[73] [73]

Diversity requirement : - Include several SIMPLE rewrites that only use symmetry / commutativity style transformations

Do not weaken / strengthen the theorem or add junk conjuncts / disjuncts ( forbidden :`\ and True`,`\ iff False`, extra unrelated assumptions / goals ) . Diversity requirement : - Include several SIMPLE rewrites that only use symmetry / commutativity style transformations . - Examples of SIMPLE rewrites : *`a = b`->`b = a` *`a < b`->`b > a` *`2 * x = y`->...

work page

[74] [74]

Keep the same theorem name as in the input

work page

[75] [75]

3)`variable_map`maps old variable names→new names ( e

Do not change the conclusion or hypothesis ** types ** except where a name appears as a binder ( you may rename bound variables consistently ) . 3)`variable_map`maps old variable names→new names ( e . g .`x`→`u`) . 4)`hypothesis_map`maps old hypothesis names→new names ( e . g .`h`→`h1 `or`h0`)

work page

[76] [76]

6)`original`must be exactly the input theorem string ( after trimming )

If you rename nothing in a category , use an empty object`{}`for that map . 6)`original`must be exactly the input theorem string ( after trimming ) . 7)`renamed`must be the full theorem line ( s ) ending with`:= by`only - no`sorry`, tactics , or proof body after`by`. Output format : - Output ** only ** one JSON object , no markdown fences , no commentary ...

work page

[77] [77]

Keep theorem name and binder structure aligned with the original statement unless the augmented state requires changes

work page

[78] [78]

Preserve hypothesis order from the augmented state

work page

[79] [79]

Preserve the augmented state's mathematics exactly ( do not simplify away changes )

work page

[80] [80]

g .`\ sqrt2`->`Real

Replace shorthand notation with explicit Lean forms when appropriate ( e . g .`\ sqrt2`->`Real . sqrt 2`,`\pi`->`Real . pi`)

work page