pith. sign in

arxiv: 2502.12717 · v2 · submitted 2025-02-18 · 💻 cs.LG · math.CO· math.RT

Learning the symmetric group: large from small

Pith reviewed 2026-05-23 02:45 UTC · model grok-4.3

classification 💻 cs.LG math.COmath.RT
keywords symmetric grouptransformergeneralizationpermutationstranspositionsmachine learninggroup theoryscalable training
0
0 comments X

The pith

A transformer trained on S10 transpositions predicts permutations in S25 at near 100 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a neural network can learn the structure of the symmetric group by training exclusively on smaller instances and then applying the learned patterns to much larger ones. A transformer is trained to map words of general transpositions to the resulting permutation in S10 and succeeds on the same task in S25. The approach uses identity augmentation to normalize variable word lengths and partitioned windows when restricting to adjacent transpositions, which also allows generalization from S10 to S16. If the method works, it reduces the need to generate expensive training data directly from the full problem size. A sympathetic reader cares because it offers a route to scaling machine learning on algebraic structures whose explicit data sets grow factorially.

Core claim

A transformer neural-network trained on predicting permutations from words formed by general transpositions in the symmetric group S10 can generalize to the symmetric group S25 with near 100% accuracy. The same model trained on adjacent transpositions in S10 also generalizes to S16. Identity augmentation manages variable word lengths, and partitioned windows handle adjacent-transposition training.

What carries the argument

Transformer neural network equipped with identity augmentation for variable-length transposition words and partitioned windows for adjacent-transposition cases.

If this is right

  • Training data need only be generated for the smaller symmetric group even when the target task is much larger.
  • The same scaling pattern applies when restricting generators to adjacent transpositions, at least up to S16.
  • Variable-length input handling via identity augmentation is sufficient to transfer the learned mapping across group orders.
  • The method avoids explicit construction of the full Cayley table or multiplication rules for the larger group.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests that local transposition relations may be enough to induce global group structure in a neural representation.
  • The same training regime could be tested on other finitely generated groups whose small and large instances differ mainly in order.
  • If the pattern holds, data-generation cost for algebraic machine-learning tasks can be bounded by the cost of the smallest sufficient example rather than the target size.

Load-bearing premise

The statistical patterns learned from transposition words in S10 already contain enough information about the full group multiplication table to succeed on S25 without any larger-group examples.

What would settle it

Train the transformer on S10 general-transposition words and measure its accuracy on a large held-out collection of S25 words; accuracy substantially below 100 percent would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2502.12717 by Alexandr Garbali, Jan de Gier, Max Petschack.

Figure 1
Figure 1. Figure 1: Transformer architecture used in this project [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Error and loss for training and validation using general transpo [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Single token error and loss for training and validation using adja [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Embedding self-similarity heatmaps for the case of general trans [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Embedding self-similarity heatmaps for the case of adjacent trans [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Machine learning explorations can make significant inroads into solving difficult problems in pure mathematics. One advantage of this approach is that mathematical datasets do not suffer from noise, but a challenge is the amount of data required to train these models and that this data can be computationally expensive to generate. Key challenges further comprise difficulty in a posteriori interpretation of statistical models and the implementation of deep and abstract mathematical problems. We propose a method for scalable tasks, by which models trained on simpler versions of a task can then generalize to the full task. Specifically, we demonstrate that a transformer neural-network trained on predicting permutations from words formed by general transpositions in the symmetric group $S_{10}$ can generalize to the symmetric group $S_{25}$ with near 100\% accuracy. We also show that $S_{10}$ generalizes to $S_{16}$ with similar performance if we only use adjacent transpositions. We employ identity augmentation as a key tool to manage variable word lengths, and partitioned windows for training on adjacent transpositions. Finally we compare variations of the method used and discuss potential challenges with extending the method to other tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that a transformer neural network trained to predict permutations from words generated by general transpositions in the symmetric group S_10 generalizes to S_25 with near-100% accuracy; a similar result holds for adjacent transpositions from S_10 to S_16. The approach relies on identity augmentation to handle variable word lengths and partitioned windows for adjacent-transposition training.

Significance. If the reported generalization is shown to be robust, the result would demonstrate a practical route to scaling machine-learning methods on algebraic structures without generating full-size datasets, which is a concrete strength for computational group theory.

major comments (2)
  1. [Abstract] Abstract: the central generalization claim from S_10 to S_25 is load-bearing on the handling of transposition labels 11–25, which never appear in training. The manuscript must specify the tokenization and embedding strategy (per-token learned embeddings versus a fixed function of the integer value) because standard transformer embeddings would leave these tokens untrained.
  2. [Abstract] The empirical support for the accuracy figures cannot be assessed because the text supplies no information on training-set size, model depth/width, validation split, or any control that would rule out leakage between the S_10 training distribution and the S_25 test distribution.
minor comments (1)
  1. The description of identity augmentation and partitioned windows would benefit from a short pseudocode or explicit example showing how variable-length inputs are processed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback on our manuscript. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central generalization claim from S_10 to S_25 is load-bearing on the handling of transposition labels 11–25, which never appear in training. The manuscript must specify the tokenization and embedding strategy (per-token learned embeddings versus a fixed function of the integer value) because standard transformer embeddings would leave these tokens untrained.

    Authors: We agree that the embedding strategy for unseen transposition labels is central to the generalization result and must be stated explicitly. The current abstract does not provide this detail. We will revise the abstract (and add a short methods paragraph) to specify the tokenization and embedding approach used, including whether embeddings are per-token learned or computed via a fixed function of the integer label value. revision: yes

  2. Referee: [Abstract] The empirical support for the accuracy figures cannot be assessed because the text supplies no information on training-set size, model depth/width, validation split, or any control that would rule out leakage between the S_10 training distribution and the S_25 test distribution.

    Authors: We agree that the abstract (and main text) lacks the experimental hyperparameters and controls needed to evaluate the reported accuracies. We will revise the manuscript to report training-set size, model depth and width, validation split, and any measures taken to prevent leakage between the S_10 training distribution and the S_25 test distribution. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ML result with no derivation chain

full rationale

The paper reports an empirical outcome of training a transformer on transposition words in S10 and measuring accuracy on S25 test data. No equations, derivations, or first-principles claims are present; the generalization performance is obtained by direct experiment rather than by any analytical reduction. The skeptic concern about token embeddings for unseen labels 11-25 is a question of experimental validity, not a circularity in any claimed derivation. No self-citations, fitted inputs renamed as predictions, or ansatzes appear in the load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The work implicitly relies on standard assumptions of neural-network expressivity for sequence-to-sequence tasks on algebraic data.

axioms (1)
  • domain assumption Transformer architectures can represent the mapping from transposition words to permutations when trained on sufficiently representative data from smaller groups.
    Implicit premise required for the generalization claim to be plausible.

pith-pipeline@v0.9.0 · 5730 in / 1281 out tokens · 34606 ms · 2026-05-23T02:45:12.482864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

  1. [1]

    Bafna and P

    V. Bafna and P. Pevzner,Sorting by Transpositions, SIAM J. Discr. Math.,11, 224–240, (1998)

  2. [2]

    Bayer and P

    D. Bayer and P. Diaconis,Trailing the Dovetail Shuffle to Its Lair, Annals of Applied Probability,2, 294–313, (1992). MR1161056

  3. [3]

    Bulteau, G

    L. Bulteau, G. Fertin and I. Rusu,Sorting by transpositions is difficult, SIAM J. Discr. Math.,26, 1148–1180, (2012)

  4. [4]

    Charton,Linear algebra with transformers, TMLR October 2022, (2022),arXiv:2112.01898

    F. Charton,Linear algebra with transformers, TMLR October 2022, (2022),arXiv:2112.01898

  5. [5]

    symmetric-group-resub

    F. Charton,What is my math transformer doing? Three results on inter- pretability and generalization, 2nd Workshop on Mathematical Reason- ing and AI at NeurIPS’22,arXiv:2211.00170. i i “symmetric-group-resub” — 2025/9/19 — 0:43 — page 13 — #13 i i i i i i Learning the symmetric group: large from small 13

  6. [6]

    Charton, J.S

    F. Charton, J.S. Ellenberg, A.Z. Wagner and G. Williamson,Pat- ternBoost: Constructions in Mathematics with a Little Help from AI, arXiv:2411.00566

  7. [7]

    Chughtai, L

    B. Chughtai, L. Chan and N. Nanda,A Toy model of univer- sality: Reverse engineering how networks learn group operations, ICML23: Proc. 40th Int. Conf. on ML, article 248, 6243–6267, (2023), arXiv:2302.03025

  8. [8]

    Elhage, N

    N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Gan- guli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCan- dlish, and C. Olah,A Mathematical Framework for Transformer Cir- cuits, (2021)

  9. [9]

    Garsia,The saga of reduced factorizations of elements of the symmetric group, Publications du Laboratoire de Combinatoire et d’Informatique Math´ ematique,29, (2002)

    A. Garsia,The saga of reduced factorizations of elements of the symmetric group, Publications du Laboratoire de Combinatoire et d’Informatique Math´ ematique,29, (2002)

  10. [10]

    Gross, R

    J. Gross, R. Agrawal, T. Kwa, E. Ong, C. H. Yip, A. Gibson, S. Noubir, and L. Chan,Compact Proofs of Model Performance via Mechanistic Interpretability, ICML 2024 Workshop on Mechanistic Interpretability, (2024), arxiv.2406.11779

  11. [11]

    Gukov, J

    S. Gukov, J. Halverson, F. Ruehle and P. Su lkowski,Learning to unknot, Mach. Learn.: Sci. Technol,2, 025035, (2021),arXiv:2010.16263

  12. [12]

    Lacoin,Mixing Time and Cutoff for the Adjacent Transposition Shuf- fle and the Simple Exclusion, The Annals of Probability,44(2), 1426– 1487, (2016)

    H. Lacoin,Mixing Time and Cutoff for the Adjacent Transposition Shuf- fle and the Simple Exclusion, The Annals of Probability,44(2), 1426– 1487, (2016)

  13. [13]

    Progress measures for grokking via mechanistic interpretability

    N. Nanda, L. Chan, T. Lieberum, J. Smith and J. Steinhardt,Progress measures for grokking via mechanistic interpretability, ICLR 2023 Spot- light, 342, (2023),arXiv:2301.05217

  14. [14]

    Investigating the limitations of transformers with simple arithmetic tasks

    R. Nogueira, Z. Jiang, and J. Lin,Investigating the limitations of trans- formers with simple arithmetic tasks, (2021),arXiv:2102.13019

  15. [15]

    C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter, Zoom In: An Introduction to Circuits, Distill,5(3), e00024.001, (2020) https://distill.pub/2020/circuits/zoom-in/

  16. [16]

    symmetric-group-resub

    C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, D. Ganguli, i i “symmetric-group-resub” — 2025/9/19 — 0:43 — page 14 — #14 i i i i i i 14 Max Petschack, Alexandr Garbali and Jan de Gier Z. Hatfield-Dodds, D. Hernandez, S. Johnston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, ...

  17. [17]

    Petschack,Permutations

    M. Petschack,Permutations. https://github.com/Midataur/permutations. Accessed: 2025-02-13

  18. [18]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, (2022)arXiv:2201.02177

  19. [19]

    Quirke and F

    P. Quirke and F. Barez,Understanding Addition in Transformers, ICLR 2024

  20. [20]

    Stander, Q

    D. Stander, Q. Yu, H. Fan and S. Biderman,Grokking group multipli- cation with cosets,arXiv:2312.06581

  21. [21]

    2020 Conf

    Y.-A Wang and Y.-N Chen,What do position embeddings learn? An empirical study of pre-trained language model positional encoding, Proc. 2020 Conf. Emp. Meth. Nat. Lang. Proc., (2020), 6840–6849. https://arxiv.org/abs/2010.04903

  22. [22]

    W. Wu, L. Jaburi, J. Drori, and J. Gross,Unifying and Verifying Mech- anistic Interpretations: A Case Study with Group Operations, (2024) arXiv:2410.07476

  23. [23]

    S. D. Zhang, C. Tigges, S. Biderman, M. Raginsky, and T. Ringer, Can Transformers learn to solve problems recursively?, (2023). arXiv:2305.14699

  24. [24]

    symmetric-group-resub

    Z. Zhong, Z. Liu, M. Tegmark, and J. Andreas,The Clock and the Pizza: Two stories in mechanistic explanation of neural networks, (2023). arXiv:2306.17844. Appendix A. Appendix: Technical information A.1. Model hyperparameters All runs were trained using AdamW and cross entropy loss. The model is a standard transformer architecture with masked multi-headed...