Learning the symmetric group: large from small
Pith reviewed 2026-05-23 02:45 UTC · model grok-4.3
The pith
A transformer trained on S10 transpositions predicts permutations in S25 at near 100 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A transformer neural-network trained on predicting permutations from words formed by general transpositions in the symmetric group S10 can generalize to the symmetric group S25 with near 100% accuracy. The same model trained on adjacent transpositions in S10 also generalizes to S16. Identity augmentation manages variable word lengths, and partitioned windows handle adjacent-transposition training.
What carries the argument
Transformer neural network equipped with identity augmentation for variable-length transposition words and partitioned windows for adjacent-transposition cases.
If this is right
- Training data need only be generated for the smaller symmetric group even when the target task is much larger.
- The same scaling pattern applies when restricting generators to adjacent transpositions, at least up to S16.
- Variable-length input handling via identity augmentation is sufficient to transfer the learned mapping across group orders.
- The method avoids explicit construction of the full Cayley table or multiplication rules for the larger group.
Where Pith is reading between the lines
- The result suggests that local transposition relations may be enough to induce global group structure in a neural representation.
- The same training regime could be tested on other finitely generated groups whose small and large instances differ mainly in order.
- If the pattern holds, data-generation cost for algebraic machine-learning tasks can be bounded by the cost of the smallest sufficient example rather than the target size.
Load-bearing premise
The statistical patterns learned from transposition words in S10 already contain enough information about the full group multiplication table to succeed on S25 without any larger-group examples.
What would settle it
Train the transformer on S10 general-transposition words and measure its accuracy on a large held-out collection of S25 words; accuracy substantially below 100 percent would falsify the generalization claim.
Figures
read the original abstract
Machine learning explorations can make significant inroads into solving difficult problems in pure mathematics. One advantage of this approach is that mathematical datasets do not suffer from noise, but a challenge is the amount of data required to train these models and that this data can be computationally expensive to generate. Key challenges further comprise difficulty in a posteriori interpretation of statistical models and the implementation of deep and abstract mathematical problems. We propose a method for scalable tasks, by which models trained on simpler versions of a task can then generalize to the full task. Specifically, we demonstrate that a transformer neural-network trained on predicting permutations from words formed by general transpositions in the symmetric group $S_{10}$ can generalize to the symmetric group $S_{25}$ with near 100\% accuracy. We also show that $S_{10}$ generalizes to $S_{16}$ with similar performance if we only use adjacent transpositions. We employ identity augmentation as a key tool to manage variable word lengths, and partitioned windows for training on adjacent transpositions. Finally we compare variations of the method used and discuss potential challenges with extending the method to other tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that a transformer neural network trained to predict permutations from words generated by general transpositions in the symmetric group S_10 generalizes to S_25 with near-100% accuracy; a similar result holds for adjacent transpositions from S_10 to S_16. The approach relies on identity augmentation to handle variable word lengths and partitioned windows for adjacent-transposition training.
Significance. If the reported generalization is shown to be robust, the result would demonstrate a practical route to scaling machine-learning methods on algebraic structures without generating full-size datasets, which is a concrete strength for computational group theory.
major comments (2)
- [Abstract] Abstract: the central generalization claim from S_10 to S_25 is load-bearing on the handling of transposition labels 11–25, which never appear in training. The manuscript must specify the tokenization and embedding strategy (per-token learned embeddings versus a fixed function of the integer value) because standard transformer embeddings would leave these tokens untrained.
- [Abstract] The empirical support for the accuracy figures cannot be assessed because the text supplies no information on training-set size, model depth/width, validation split, or any control that would rule out leakage between the S_10 training distribution and the S_25 test distribution.
minor comments (1)
- The description of identity augmentation and partitioned windows would benefit from a short pseudocode or explicit example showing how variable-length inputs are processed.
Simulated Author's Rebuttal
We thank the referee for their careful review and constructive feedback on our manuscript. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central generalization claim from S_10 to S_25 is load-bearing on the handling of transposition labels 11–25, which never appear in training. The manuscript must specify the tokenization and embedding strategy (per-token learned embeddings versus a fixed function of the integer value) because standard transformer embeddings would leave these tokens untrained.
Authors: We agree that the embedding strategy for unseen transposition labels is central to the generalization result and must be stated explicitly. The current abstract does not provide this detail. We will revise the abstract (and add a short methods paragraph) to specify the tokenization and embedding approach used, including whether embeddings are per-token learned or computed via a fixed function of the integer label value. revision: yes
-
Referee: [Abstract] The empirical support for the accuracy figures cannot be assessed because the text supplies no information on training-set size, model depth/width, validation split, or any control that would rule out leakage between the S_10 training distribution and the S_25 test distribution.
Authors: We agree that the abstract (and main text) lacks the experimental hyperparameters and controls needed to evaluate the reported accuracies. We will revise the manuscript to report training-set size, model depth and width, validation split, and any measures taken to prevent leakage between the S_10 training distribution and the S_25 test distribution. revision: yes
Circularity Check
No circularity: purely empirical ML result with no derivation chain
full rationale
The paper reports an empirical outcome of training a transformer on transposition words in S10 and measuring accuracy on S25 test data. No equations, derivations, or first-principles claims are present; the generalization performance is obtained by direct experiment rather than by any analytical reduction. The skeptic concern about token embeddings for unseen labels 11-25 is a question of experimental validity, not a circularity in any claimed derivation. No self-citations, fitted inputs renamed as predictions, or ansatzes appear in the load-bearing steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer architectures can represent the mapping from transposition words to permutations when trained on sufficiently representative data from smaller groups.
Reference graph
Works this paper leans on
-
[1]
V. Bafna and P. Pevzner,Sorting by Transpositions, SIAM J. Discr. Math.,11, 224–240, (1998)
work page 1998
-
[2]
D. Bayer and P. Diaconis,Trailing the Dovetail Shuffle to Its Lair, Annals of Applied Probability,2, 294–313, (1992). MR1161056
work page 1992
-
[3]
L. Bulteau, G. Fertin and I. Rusu,Sorting by transpositions is difficult, SIAM J. Discr. Math.,26, 1148–1180, (2012)
work page 2012
-
[4]
Charton,Linear algebra with transformers, TMLR October 2022, (2022),arXiv:2112.01898
F. Charton,Linear algebra with transformers, TMLR October 2022, (2022),arXiv:2112.01898
-
[5]
F. Charton,What is my math transformer doing? Three results on inter- pretability and generalization, 2nd Workshop on Mathematical Reason- ing and AI at NeurIPS’22,arXiv:2211.00170. i i “symmetric-group-resub” — 2025/9/19 — 0:43 — page 13 — #13 i i i i i i Learning the symmetric group: large from small 13
-
[6]
F. Charton, J.S. Ellenberg, A.Z. Wagner and G. Williamson,Pat- ternBoost: Constructions in Mathematics with a Little Help from AI, arXiv:2411.00566
-
[7]
B. Chughtai, L. Chan and N. Nanda,A Toy model of univer- sality: Reverse engineering how networks learn group operations, ICML23: Proc. 40th Int. Conf. on ML, article 248, 6243–6267, (2023), arXiv:2302.03025
-
[8]
N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Gan- guli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCan- dlish, and C. Olah,A Mathematical Framework for Transformer Cir- cuits, (2021)
work page 2021
-
[9]
A. Garsia,The saga of reduced factorizations of elements of the symmetric group, Publications du Laboratoire de Combinatoire et d’Informatique Math´ ematique,29, (2002)
work page 2002
- [10]
- [11]
-
[12]
H. Lacoin,Mixing Time and Cutoff for the Adjacent Transposition Shuf- fle and the Simple Exclusion, The Annals of Probability,44(2), 1426– 1487, (2016)
work page 2016
-
[13]
Progress measures for grokking via mechanistic interpretability
N. Nanda, L. Chan, T. Lieberum, J. Smith and J. Steinhardt,Progress measures for grokking via mechanistic interpretability, ICLR 2023 Spot- light, 342, (2023),arXiv:2301.05217
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Investigating the limitations of transformers with simple arithmetic tasks
R. Nogueira, Z. Jiang, and J. Lin,Investigating the limitations of trans- formers with simple arithmetic tasks, (2021),arXiv:2102.13019
-
[15]
C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter, Zoom In: An Introduction to Circuits, Distill,5(3), e00024.001, (2020) https://distill.pub/2020/circuits/zoom-in/
work page 2020
-
[16]
C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, D. Ganguli, i i “symmetric-group-resub” — 2025/9/19 — 0:43 — page 14 — #14 i i i i i i 14 Max Petschack, Alexandr Garbali and Jan de Gier Z. Hatfield-Dodds, D. Hernandez, S. Johnston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, ...
work page 2025
-
[17]
M. Petschack,Permutations. https://github.com/Midataur/permutations. Accessed: 2025-02-13
work page 2025
-
[18]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, (2022)arXiv:2201.02177
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
P. Quirke and F. Barez,Understanding Addition in Transformers, ICLR 2024
work page 2024
-
[20]
D. Stander, Q. Yu, H. Fan and S. Biderman,Grokking group multipli- cation with cosets,arXiv:2312.06581
- [21]
- [22]
- [23]
-
[24]
Z. Zhong, Z. Liu, M. Tegmark, and J. Andreas,The Clock and the Pizza: Two stories in mechanistic explanation of neural networks, (2023). arXiv:2306.17844. Appendix A. Appendix: Technical information A.1. Model hyperparameters All runs were trained using AdamW and cross entropy loss. The model is a standard transformer architecture with masked multi-headed...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.