pith. machine review for the scientific record. sign in

arxiv: 2605.11008 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

· Lean Theorem

When and How to Canonize: A Generalization Perspective

Authors on Pith no claims yet

Pith reviewed 2026-05-13 00:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords canonizationgeneralization boundscovering numbersinvariant modelspoint cloud processingpermutation groupsHilbert curvegroup averaging
0
0 comments X

The pith

Canonized models achieve generalization bounds that range from matching the best invariant architectures to matching non-invariant baselines, depending on the regularity of the canonization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a framework that bounds the covering numbers of function classes obtained by canonization, group averaging, and fully invariant design. These bounds place canonized models in a hierarchy where sufficiently regular canonizations match the tightest bounds of invariant and averaged models, while irregular ones match the looser bounds of ordinary non-invariant networks. The analysis is applied to point-cloud processing under permutations, where it proves that lexicographic sorting produces exponential growth in covering numbers with dimension whereas Hilbert-curve canonization produces only polynomial growth. This supplies the first rigorous explanation for why Hilbert-curve serialization works well in practice. The result matters because canonization is often easier to implement than full invariance, so knowing when it preserves good generalization guides method selection for symmetric data.

Core claim

We introduce a theoretical framework to analyze the generalization error of canonization, group averaging, and invariant architectures by bounding their covering numbers. We establish that the error bounds of canonized models are at best equal to those of structurally invariant and group-averaged models, and at worst equal to those of non-invariant baselines, with the position in this hierarchy determined by the regularity of the canonization map. For permutation groups, we prove that lexicographical sorting leads to exponential growth in covering numbers with dimension, while Hilbert curve canonization ensures polynomial growth.

What carries the argument

Covering-number bounds on the hypothesis classes induced by each symmetry-handling method, controlled by a regularity condition on the canonization map.

If this is right

  • Optimal canonizations exist that attain the same generalization bounds as fully invariant and group-averaged models.
  • Irregular canonizations can achieve no better generalization than non-invariant baselines.
  • For permutation-invariant point-cloud tasks, Hilbert-curve canonization guarantees polynomial covering-number growth while lexicographic sorting guarantees exponential growth.
  • Regularity of a canonization map determines whether it attains the upper or lower end of the generalization hierarchy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The regularity condition could serve as a design criterion when creating canonization procedures for symmetries other than permutations.
  • The same covering-number comparison might be used to evaluate canonization against equivariant layers for rotation or scale symmetries in images or graphs.
  • If a chosen canonization is only partially regular, a hybrid architecture that adds limited invariance layers after canonization may recover some of the lost generalization performance.

Load-bearing premise

The covering-number bounds derived for the canonized function class translate directly into generalization guarantees and the regularity condition on the canonization map can be verified or controlled in practice.

What would settle it

A direct computation or experiment on a concrete canonization that produces a covering number lying outside the predicted range relative to the invariant and non-invariant baselines.

Figures

Figures reproduced from arXiv: 2605.11008 by Benjamin Friedman, Nadav Dym, Snir Hordan, Yonatan Sverdlov.

Figure 1
Figure 1. Figure 1: Visualization of three canonizations for the action of G = {−1, 1} on R. The natural canonization c(x) = |x|, which is shown to be optimal (Example 3.3), and two discontinuous canonizations, which on an appropriate domain are poor (Example 3.5). 1.0 0.5 0.5 1.0 1.0 0.5 0.5 1.0 c(x) = jxj 1.0 0.5 0.5 1.0 1.0 0.5 0.5 1.0 c1 1.0 0.5 0.5 1.0 1.0 0.5 0.5 1.0 c1 Example 3.6. Consider the action of the permutatio… view at source ↗
Figure 2
Figure 2. Figure 2: The Hilbert curves Hm for d = 2 and m = 1, 2, 3 induce an ordered on a grid in [0, 1]d with 2 md points. On the right we see the discontinuous Lexsort ordering on the grid. Hilbert, m = 1 Hilbert, m = 2 Hilbert, m = 3 lex sort The Hilbert curve is a space-filling curve, mapping the unit interval [0, 1] continuously onto a hypercube [0, 1]d . It is defined as a limit of curves Hm : [0, 1] → [0, 1]d which ar… view at source ↗
Figure 3
Figure 3. Figure 3: Rotation Symmetry Accuracy. The curves illustrate the comparative performance of the four approaches: Random Frame (blue), Frame Averaging (red), Skewness (orange), and the Pure PCA baseline (green). Frame Averaging achieves the strongest overall performance. While Random Frame and Skewness reach comparable final results—both significantly outperforming the Pure PCA baseline—the augmentation-based Random F… view at source ↗
read the original abstract

While invariant architectures are standard for processing symmetric data, there is growing interest in achieving invariance by applying group averaging or canonization to non-invariant backbones. However, the theoretical generalization properties of these alternative strategies remain poorly understood. We introduce a theoretical framework to analyze the generalization error of these methods by bounding their covering numbers. We establish a rigorous generalization hierarchy: the error bounds of canonized models are at best equal to the error bounds of structurally invariant and group-averaged models, and at worst equal to the bounds of non-invariant baselines. Furthermore, we show that there exist optimal canonizations which attain the optimal error bounds, and poor canonizations which attain the non-invariant error bounds, and that this depends on the regularity of the canonization. Finally, applying this framework to permutation groups in point cloud processing, we rigorously prove that the covering number of lexicographical sorting grows exponentially with point cloud dimension, whereas Hilbert curve canonization guarantees polynomial growth. This provides the first formal theoretical justification for the empirical success of Hilbert curve serialization in state-of-the-art point cloud architectures. We conclude with experiments that support our theoretical claims. Code is available at https://github.com/yonatansverdlov/Canonization

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a covering-number-based framework to compare generalization bounds for canonized models (F_c = {f ∘ c | f ∈ F}), structurally invariant models, group-averaged models, and non-invariant baselines. It claims a rigorous hierarchy in which the covering number (and thus generalization bound) of a canonized class lies between the invariant and non-invariant extremes, with the position controlled by a regularity property of the canonization map c. The authors prove existence of optimal canonizations attaining the best bounds and poor ones attaining the worst, and apply the framework to permutation groups on point clouds, showing that lexicographic sorting yields exponential covering-number growth in dimension while Hilbert-curve canonization yields polynomial growth. Experiments and open-source code are provided to support the claims.

Significance. If the hierarchy and the transfer from canonization regularity to composed covering numbers hold, the work supplies the first formal justification for preferring certain canonization procedures (e.g., Hilbert curves) over others in symmetric-data settings and gives practitioners a concrete criterion for selecting canonization maps. The explicit comparison with group averaging and built-in invariance, together with the reproducible code, strengthens the contribution.

major comments (2)
  1. [§3] §3 (regularity definition and Theorem on covering-number interpolation): the claimed bound N(F_c, ε) ≤ N(F, ε) (the “at worst equal” direction) appears to rest on the regularity condition alone. It is not shown that regularity implies a uniform modulus of continuity sufficient to contract the ε-net size of the composed class; if regularity only guarantees continuity or a dimension-dependent Lipschitz constant, N(F_c, ε) can exceed N(F, ε), breaking the hierarchy.
  2. [§5] §5 (point-cloud application, lex-sort and Hilbert-curve theorems): the exponential/polynomial covering-number statements are proved for the canonization map c itself. No explicit composition lemma relating the modulus of c to the entropy of the neural-network class F is provided, so the transfer to generalization bounds for the full model class F_c is not immediate.
minor comments (2)
  1. [Notation and §3] The notation for covering numbers and the precise statement of the regularity condition should be cross-referenced to standard texts (e.g., Shalev-Shwartz & Ben-David) for clarity.
  2. [Experiments] Figure captions and experimental details could more explicitly link the plotted quantities back to the covering-number bounds derived in the theory.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting these potential gaps in the presentation of our covering-number hierarchy. We address each major comment below and will revise the manuscript accordingly to strengthen the rigor of the arguments.

read point-by-point responses
  1. Referee: [§3] §3 (regularity definition and Theorem on covering-number interpolation): the claimed bound N(F_c, ε) ≤ N(F, ε) (the “at worst equal” direction) appears to rest on the regularity condition alone. It is not shown that regularity implies a uniform modulus of continuity sufficient to contract the ε-net size of the composed class; if regularity only guarantees continuity or a dimension-dependent Lipschitz constant, N(F_c, ε) can exceed N(F, ε), breaking the hierarchy.

    Authors: We agree that the current write-up of the interpolation theorem in §3 would benefit from an explicit supporting lemma. The regularity condition we introduce is defined via a uniform modulus of continuity (independent of dimension in the relevant regimes) that directly controls the distortion of ε-nets under composition with c. Nevertheless, we will add a short lemma that derives the contraction N(F_c, ε) ≤ N(F, ε) from this modulus using standard covering-number composition inequalities, thereby making the “at worst equal” direction fully rigorous. revision: yes

  2. Referee: [§5] §5 (point-cloud application, lex-sort and Hilbert-curve theorems): the exponential/polynomial covering-number statements are proved for the canonization map c itself. No explicit composition lemma relating the modulus of c to the entropy of the neural-network class F is provided, so the transfer to generalization bounds for the full model class F_c is not immediate.

    Authors: We acknowledge that an explicit composition step is missing from the current text. In the revised version we will insert a dedicated lemma (placed before the point-cloud theorems) that composes the entropy of c with the entropy of a standard neural-network class F (assumed to have bounded Lipschitz constant, as is common for generalization analyses). This lemma will show that polynomial growth of the covering number of c implies polynomial growth for F_c, while exponential growth for c yields exponential growth for F_c, thereby transferring the Hilbert-curve and lexicographic results directly to the generalization bounds of the composed model class. revision: yes

Circularity Check

0 steps flagged

No circularity: standard covering-number bounds applied to composed canonized classes

full rationale

The paper derives its generalization hierarchy by applying classical covering-number arguments from statistical learning theory to the canonized hypothesis class F_c = {f ∘ c | f ∈ F}. The claimed ordering of error bounds follows directly from how a regularity condition on the canonization map c modulates the entropy of F_c relative to the structurally invariant and non-invariant baselines; this relation is established via explicit lemmas and growth-rate calculations (exponential for lexicographic sorting, polynomial for Hilbert curves) rather than by redefining the target quantities in terms of the bounds themselves. No parameters are fitted to the generalization claims, no self-citation chain supplies the central uniqueness or interpolation result, and the covering-number bounds are not tautological with the input function classes. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard statistical learning theory for covering numbers together with the domain assumption that the data distribution is invariant under the group action; no free parameters or invented entities are introduced.

axioms (2)
  • standard math Covering numbers control generalization error in the standard way for supervised learning.
    The entire framework is built by bounding covering numbers of the canonized hypothesis class.
  • domain assumption The input distribution is invariant under the symmetry group (permutations for point clouds).
    Required for the notion of canonization to be meaningful and for the comparison with invariant models.

pith-pipeline@v0.9.0 · 5517 in / 1424 out tokens · 62756 ms · 2026-05-13T00:49:43.105241+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We establish a rigorous generalization hierarchy: the error bounds of canonized models are at best equal to the error bounds of structurally invariant and group-averaged models, and at worst equal to the bounds of non-invariant baselines... by bounding their covering numbers.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    Johansson, Oleksii Prykhodko, Esben J

    Josep Arús-Pous, Simon V . Johansson, Oleksii Prykhodko, Esben J. Bjerrum, Christian Tyrchan, Jean-Louis Reymond, Hongming Chen, and Ola Engkvist. Randomized SMILES strings improve the quality of molecular generative models.Journal of Cheminformatics, 11(1):71, 2019

  2. [2]

    An explicit frame construc- tion for normalizing 3D point clouds

    Justin Baker, Shih-Hsin Wang, Tommaso De Fernex, and Bao Wang. An explicit frame construc- tion for normalizing 3D point clouds. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings...

  3. [3]

    Smiles enumeration as data augmentation for neural network modeling of molecules.ArXiv, abs/1703.07076, 2017

    Esben Jannik Bjerrum. Smiles enumeration as data augmentation for neural network modeling of molecules.ArXiv, abs/1703.07076, 2017

  4. [4]

    American Mathematical Society, 2019

    Sergey Bobkov and Michel Ledoux.One-dimensional empirical measures, order statistics, and Kantorovich transport distances, volume 261. American Mathematical Society, 2019

  5. [5]

    Does equivariance matter at scale?, 2025

    Johann Brehmer, Sönke Behrends, Pim de Haan, and Taco Cohen. Does equivariance matter at scale?, 2025

  6. [6]

    Efficient point cloud analysis using hilbert curve

    Wanli Chen, Xinge Zhu, Guojin Chen, and Bei Yu. Efficient point cloud analysis using hilbert curve. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision – ECCV 2022, pages 730–747, Cham, 2022. Springer Nature Switzerland

  7. [7]

    Equivariant frames and the impossibil- ity of continuous canonicalization

    Nadav Dym, Hannah Lawrence, and Jonathan W Siegel. Equivariant frames and the impossibil- ity of continuous canonicalization. InInternational Conference on Machine Learning, pages 12228–12267. PMLR, 2024

  8. [8]

    Provably strict generalisation benefit for invariance in kernel methods

    Bryn Elesedy. Provably strict generalisation benefit for invariance in kernel methods. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021

  9. [9]

    Weisfeiler-leman at the margin: when more expressivity matters

    Billy J Franks, Christopher Morris, Ameya Velingker, and Floris Geerts. Weisfeiler-leman at the margin: when more expressivity matters. InProceedings of the 41st International Conference on Machine Learning, pages 13885–13926, 2024

  10. [10]

    Canonnet: Canonical ordering and curvature learning for point cloud analysis, 2025

    Benjy Friedmann and Michael Werman. Canonnet: Canonical ordering and curvature learning for point cloud analysis, 2025

  11. [11]

    Extensible grids: uniform sampling on a space filling curve

    Zhijian He and Art B Owen. Extensible grids: uniform sampling on a space filling curve. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(4):917–931, 2016

  12. [12]

    Spectral graph neural networks are incomplete on graphs with a simple spectrum.arXiv preprint arXiv:2506.05530, 2025

    Snir Hordan, Maya Bechler-Speicher, Gur Lifshitz, and Nadav Dym. Spectral graph neural networks are incomplete on graphs with a simple spectrum.arXiv preprint arXiv:2506.05530, 2025

  13. [13]

    Random search neural networks for efficient and expressive graph learning

    Michael Ito, Danai Koutra, and Jenna Wiens. Random search neural networks for efficient and expressive graph learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  14. [14]

    Equivariance with learned canonicalization functions

    Sékou-Oumar Kaba, Arnab Kumar Mondal, Yan Zhang, Yoshua Bengio, and Siamak Ravan- bakhsh. Equivariance with learned canonicalization functions. InInternational Conference on Machine Learning, pages 15546–15566. PMLR, 2023

  15. [15]

    An empirical evaluation of deep architectures on problems with many factors of variation

    Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, pages 473–480, 2007

  16. [16]

    Adaptive canonicalization with application to invariant anisotropic geometric networks

    Ya-Wei Eileen Lin and Ron Levie. Adaptive canonicalization with application to invariant anisotropic geometric networks. InThe Fourteenth International Conference on Learning Representations, 2026. 10

  17. [17]

    Equivariance via minimal frame averaging for more symmetries and efficiency

    Yuchao Lin, Jacob Helwig, Shurui Gui, and Shuiwang Ji. Equivariance via minimal frame averaging for more symmetries and efficiency. InForty-first International Conference on Machine Learning, 2024

  18. [18]

    A canonicalization perspective on invariant and equivariant learning

    George Ma, Yifei Wang, Derek Lim, Stefanie Jegelka, and Yisen Wang. A canonicalization perspective on invariant and equivariant learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  19. [19]

    A canonicalization perspective on invariant and equivariant learning.Advances in Neural Information Processing Systems, 37:60936–60979, 2024

    George Ma, Yifei Wang, Derek Lim, Stefanie Jegelka, and Yisen Wang. A canonicalization perspective on invariant and equivariant learning.Advances in Neural Information Processing Systems, 37:60936–60979, 2024

  20. [20]

    Laplacian canonization: A minimalist approach to sign and basis invariant spectral embedding

    George Ma, Yifei Wang, and Yisen Wang. Laplacian canonization: A minimalist approach to sign and basis invariant spectral embedding. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  21. [21]

    Generalization bounds for message passing networks on mixture of graphons.SIAM Journal on Mathematics of Data Science, 7(2):802–825, 2025

    Sohir Maskey, Gitta Kutyniok, and Ron Levie. Generalization bounds for message passing networks on mixture of graphons.SIAM Journal on Mathematics of Data Science, 7(2):802–825, 2025

  22. [22]

    Approximation-generalization trade-offs under (ap- proximate) group equivariance

    Mircea Petrache and Shubhendu Trivedi. Approximation-generalization trade-offs under (ap- proximate) group equivariance. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  23. [23]

    arXiv preprint arXiv:2110.03336 , year=

    Omri Puny, Matan Atzmon, Heli Ben-Hamu, Ishan Misra, Aditya Grover, Edward J Smith, and Yaron Lipman. Frame averaging for invariant and equivariant network design.arXiv preprint arXiv:2110.03336, 2021

  24. [24]

    Siegel, Snir Hordan, Hannah Lawrence, Ali Syed, and Nadav Dym

    Jonathan W. Siegel, Snir Hordan, Hannah Lawrence, Ali Syed, and Nadav Dym. Quantitative approximation rates for group equivariant learning, 2026

  25. [25]

    Enumerative combinatorics volume 1 second edition.Cambridge studies in advanced mathematics, 2011

    Richard P Stanley. Enumerative combinatorics volume 1 second edition.Cambridge studies in advanced mathematics, 2011

  26. [26]

    Generalization bounds for canonicalization: A comparative study with group averaging

    Behrooz Tahmasebi and Stefanie Jegelka. Generalization bounds for canonicalization: A comparative study with group averaging. InThe Thirteenth International Conference on Learning Representations, 2025

  27. [27]

    Regularity in canonicalized models: A theoretical perspective

    Behrooz Tahmasebi and Stefanie Jegelka. Regularity in canonicalized models: A theoretical perspective. InInternational Conference on Artificial Intelligence and Statistics, pages 4789–

  28. [28]

    Survey on generaliza- tion theory for graph neural networks.arXiv preprint arXiv:2503.15650, 2025

    Antonis Vasileiou, Stefanie Jegelka, Ron Levie, and Christopher Morris. Survey on generaliza- tion theory for graph neural networks.arXiv preprint arXiv:2503.15650, 2025

  29. [29]

    Weininger

    David Weininger, Arthur Weininger, and Joseph L. Weininger. Smiles. 2. algorithm for genera- tion of unique smiles notation.J. Chem. Inf. Comput. Sci., 29:97–101, 1989

  30. [30]

    Point transformer v3: Simpler, faster, stronger

    Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler, faster, stronger. InCVPR, 2024

  31. [31]

    3d shapenets: A deep representation for volumetric shapes

    Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015

  32. [32]

    Robustness and generalization.Machine learning, 86(3):391–423, 2012

    Huan Xu and Shie Mannor. Robustness and generalization.Machine learning, 86(3):391–423, 2012

  33. [33]

    Deep sets.Advances in neural information processing systems, 30, 2017

    Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets.Advances in neural information processing systems, 30, 2017

  34. [34]

    Rethinking diffusion models with symmetries through canonicalization with applications to molecular graph generation, 2026

    Cai Zhou, Zijie Chen, Zian Li, Jike Wang, Kaiyi Jiang, Pan Li, Rose Yu, Muhan Zhang, Stephen Bates, and Tommi Jaakkola. Rethinking diffusion models with symmetries through canonicalization with applications to molecular graph generation, 2026. 11 A Proofs In this section, we provide the proofs that are missing in the main text. A.1 Generalization and Cove...

  35. [35]

    The canon.c(x) =|x|with respect to the action of{−1,1}onRand the standard metric

  36. [36]

    sort :R n →R n with respect to the action of permutations and a metric ρ induced by apnorm

    The canon. sort :R n →R n with respect to the action of permutations and a metric ρ induced by apnorm

  37. [37]

    defined for point clouds in Rd×n with respect to the action of translation by vectors inR d, with the metricρinduced by the Frobenius norm

    The centralization canon. defined for point clouds in Rd×n with respect to the action of translation by vectors inR d, with the metricρinduced by the Frobenius norm. Proof.The canonizationc(x) =|x|is an isometry as ∀x, y∈K, ρ G([x],[y]) = min{|x−y|,|x+y|}=||x| − |y||=|c(x)−c(y)| The fact that sorting is an isometry with respect to the metric ρG induced fr...

  38. [38]

    Per-graph spread

    At k= 16 , where validation AP for random_augmented continues to improve well past 200 epochs, all three arms are trained for500 epochs with early stopping disabled (patience999). Tables 5 and 6 therefore report a matched-compute comparison at every (k, h) cell. Both schedules are released in the project repository. Results.Table 5 reports mean test AP at...