Neural Networks Provably Learn Spectral Representations for Group Composition

Fengzhuo Zhang; Jianliang He; Leda Wang; Siyu Chen; Zhuoran Yang

arxiv: 2606.02993 · v1 · pith:QBDFITGWnew · submitted 2026-06-02 · 💻 cs.LG · math.OC· math.RT· math.ST· stat.ML· stat.TH

Neural Networks Provably Learn Spectral Representations for Group Composition

Jianliang He , Leda Wang , Fengzhuo Zhang , Siyu Chen , Zhuoran Yang This is my paper

Pith reviewed 2026-06-28 11:44 UTC · model grok-4.3

classification 💻 cs.LG math.OCmath.RTmath.STstat.MLstat.TH

keywords neural networksgroup representationsFourier analysisfeature learningrepresentation theorygradient flowspectral methodsgroup composition

0 comments

The pith

Lifting gradient flow to the Fourier domain makes each neuron in a two-layer network converge to one irreducible group representation on composition tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how a two-layer neural network learns to compute the product of two elements from a finite group. By moving the training dynamics into the Fourier domain the process becomes a Riemannian gradient ascent on an energy that depends on the group's representations. Under random starts this ascent makes each neuron settle almost surely on a single irreducible representation while the coefficients linking layers line up in a rank-one rotational pattern. The same mechanism produces a low-rank compression of the matrix representations and, when the group is Abelian, yields uniform coverage of the nontrivial representations together with uniform phases that approximate the group operation by majority vote.

Core claim

Lifting the projected gradient flow to the Fourier domain shows that training is governed by Riemannian gradient ascent on a representation-theoretic energy functional. Under random initialization this flow drives each neuron to converge almost surely toward a single irreducible representation, while the cross-layer Fourier coefficients achieve a rotational rank-one alignment. The same account explains feature learning and produces a low-rank compression phenomenon for matrix-valued group representations. For Abelian groups random initialization promotes uniform diversification across nontrivial representations and induces Haar-uniform phases that jointly approximate the indicator via majori

What carries the argument

The Fourier-domain lifting of the projected gradient flow, which converts the original dynamics into Riemannian gradient ascent on a representation-theoretic energy functional.

If this is right

Each neuron converges almost surely to a single irreducible representation of the group.
Cross-layer Fourier coefficients achieve rotational rank-one alignment.
A low-rank compression occurs for the matrix-valued group representations.
For Abelian groups the process produces uniform diversification across nontrivial representations together with Haar-uniform phases.
Both phase alignment and representation competition converge at exponential rates and the group indicator is recovered by majority vote.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Fourier-lifting technique could be applied to other algebraic structures to predict which features networks will discover.
Networks trained on data with hidden group symmetry may exhibit the same neuron-to-irrep alignment, offering a diagnostic for internal representations.
The low-rank compression suggests that group-equivariant layers could be parameterized more efficiently by retaining only the dominant Fourier modes.
Numerical checks on small groups would directly test whether the predicted rank-one alignment appears in practice.

Load-bearing premise

Transforming the projected gradient flow into the Fourier domain captures the essential training dynamics without adding unaccounted approximations or constraints.

What would settle it

Train the network on the symmetric group S3, extract the Fourier coefficients of the hidden-layer neurons, and check whether they fail to concentrate on single irreps or whether the cross-layer alignment deviates from rank one.

Figures

Figures reproduced from arXiv: 2606.02993 by Fengzhuo Zhang, Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang.

**Figure 1.** Figure 1: Empirical verification of Observations 1 and 2. (a) DFT heatmaps of the learned parameter ξm for the top 15 neurons on G = Z3 ⊕ Z5. Each row corresponds to a neuron and each column to a frequency k. θ 1 m and θ 2 m exhibit identical sparsity (see [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗

**Figure 2.** Figure 2: provides empirical verification: panel (a) plots the phases {ϕ τ m} on the unit circle and their joint distribution, illustrating uniform distribution and mutual independence. Panel (b) shows a histogram of the surviving frequencies { ˇkm} across neurons, confirming the uniform occupancy over all conjugate pairs. We prove Observation 3 rigorously as part of Theorem 5.1 in §5. (a) Distribution of phases {ϕ … view at source ↗

**Figure 3.** Figure 3: Visual introduction to group structure and spectral representations. In each panel, the Cayley graph (left) illustrates the group’s algebraic structure, where nodes represent unique group elements and edges denote the action of specific generators. The spectral basis heatmaps (right) visualize the irreducible representations. While Z12 is characterized by twelve 1D irreps, A4 exhibits a more complex spectr… view at source ↗

**Figure 4.** Figure 4: Illustration of the geometric concepts used in the center-stable manifold theorem for the saddle-avoidance argument. (a) The Riemannian gradient flow evolves intrinsically on the manifold M. (b) Near a strict saddle p, the tangent space decomposes as TpM = Esc p ⊕ Eu p , where Esc p contains the non-expanding directions and Eu p contains the non-expanding directions. The center-stable manifold theorem yiel… view at source ↗

**Figure 5.** Figure 5: Empirical verification of the spectral pattern (i) in Theorem 4.3 for Stage I. The heatmaps display the learned parameters for the top 20 neurons after applying the group DFT. Each row corresponds to one neuron. Along the horizontal axis, the coefficients are grouped by irreducible representations of the Frobenius group: the 1-D representations ρtriv, ρ1 and ρ ∨ 1 each contribute a single column, while the… view at source ↗

**Figure 6.** Figure 6: Empirical verifications of the perfect accuracy condition in (µ-PA) and the spectral patterns (ii) and (iii) in Theorem 4.3. (a) Accuracy curves across training, showing that the classifier reaches accuracy 1 and then remains there. (b) Evolution of the rotational alignment metric distal for the active Fourier blocks. The trajectories approach 1 and their variance reduces to 0, which means these matrices b… view at source ↗

**Figure 7.** Figure 7: Empirical verification of the loss decrease and scale growth predicted by Theorem 4.5. (a) During Stage I, the loss remains nearly constant due to the small, frozen scaling factor a, before undergoing a rapid drop toward 0 in Stage II. (b)–(c) Evolution of the tied and untied scaling factors, both exhibiting logarithmic growth. The tied case corresponds to the theoretical setup under (µ-PA). For a fair com… view at source ↗

**Figure 8.** Figure 8: Training dynamics of Z3⊕Z5 under the initializations in Theorem 5.3. (a) Phase alignment: the alignment level ℜ(φm) converges to 1 at different speeds depending on the initial phase. (b) Representation competition: the magnitude of the winning irrep grows while all competitors decay. • Discussion of (i): Phase Alignment. The first part isolates phase dynamics by assuming the representation competition is r… view at source ↗

**Figure 9.** Figure 9: Heatmap of the learned parameters for the top 20 neurons on the generalized modular addition task over G = Z3⊕Z5, after applying the Discrete Fourier Transform. Each row corresponds to one neuron, and the three columns of panels correspond to θc1 m, θc2 m, and ξcm, respectively. The upper row plots the real parts and the lower row plots the imaginary parts of the Fourier coefficients. Along the horizontal … view at source ↗

**Figure 10.** Figure 10: Heatmap of the learned parameters for the top 20 neurons on the generalized modular addition task over G = Z2 ⊕ Z3 ⊕ Z5, after applying the Discrete Fourier Transform. Each row corresponds to one neuron, and the three columns of panels correspond to θc1 m, θc2 m, ξcm, respectively. Along the horizontal axis, each column is indexed by a frequency tuple k, and conjugate frequencies are arranged symmetricall… view at source ↗

read the original abstract

Understanding how structured internal structure emerges during neural network training is central to the study of deep learning. We investigate this phenomenon through the group composition task, where a two-layer neural network is trained to predict $g_1 \star g_2$ for elements of a finite group $G$. By lifting the projected gradient flow to the Fourier domain, we demonstrate that the training dynamics are governed by a Riemannian gradient ascent on a representation-theoretic energy functional. We prove that, under random initialization, this flow drives each neuron to converge almost surely toward a single irreducible representation, while the cross-layer Fourier coefficients achieve a rotational rank-one alignment. This framework provides a representation-theoretic account of feature learning and characterizes a novel low-rank compression phenomenon for matrix-valued group representations. Moreover, for Abelian groups, we provide a complete population-level description: random initialization promotes uniform diversification across nontrivial representations and induces Haar-uniform phases, jointly approximating the indicator via a majority-vote mechanism. We further prove that both phase alignment and representation competition emerge with exponential convergence rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims almost-sure convergence of neurons to single irreps and rank-one Fourier alignment under a lifted gradient flow for group composition, but the lifting step itself is the part that needs checking.

read the letter

The main thing to know is that the authors lift the projected gradient flow on a two-layer network trained for group composition to the Fourier domain and show it becomes Riemannian gradient ascent on a representation-theoretic energy. From there they derive almost-sure convergence of each neuron to a single irreducible representation plus rotational rank-one alignment of the cross-layer coefficients. For Abelian groups they add a full population-level picture with uniform diversification, Haar-uniform phases, and exponential rates via a majority-vote mechanism.

The work applies standard finite-group representation theory and Riemannian optimization tools to this concrete task in a way that produces a clean mechanistic story for how structured features emerge. The low-rank compression observation for matrix-valued representations is a direct consequence of the alignment result and feels like a genuine addition rather than a restatement of prior empirical patterns.

The soft spot is exactly the lifting claim. The abstract states that the projected flow becomes an exact Riemannian ascent on the energy without residual terms, yet the projection onto the network parameter manifold need not commute with the Fourier transform in a way that removes all constraints, especially when irreps are matrix-valued for non-Abelian groups. Without the full derivations and error analysis visible, it is not possible to confirm that the projection introduces no unaccounted approximations. That assumption carries the strong convergence statements.

The paper is aimed at theorists who want a representation-theoretic account of feature learning on algebraic tasks. Readers already working on symmetry-aware models or group-equivariant networks would get the most out of the framework.

I would send it to peer review so the proofs can be examined directly.

Referee Report

1 major / 0 minor

Summary. The paper studies two-layer neural networks trained on the finite-group composition task (predict g1 ⋆ g2). By lifting projected gradient flow to the Fourier domain, it claims the dynamics reduce exactly to Riemannian gradient ascent on a representation-theoretic energy; under random initialization this yields almost-sure convergence of each neuron to a single irreducible representation, rotational rank-one alignment of cross-layer Fourier coefficients, a low-rank compression phenomenon, and—for Abelian groups—a complete population-level characterization with uniform diversification, Haar-uniform phases, majority-vote approximation of the indicator, and exponential convergence rates.

Significance. If the lifting is exact and the convergence statements hold, the work supplies a representation-theoretic account of feature learning and a novel compression result for matrix-valued group representations. The explicit exponential-rate claims and the Abelian-group population description would be notable contributions to the theory of structured feature emergence.

major comments (1)

The central claim that the projected gradient flow, once lifted to the Fourier domain, becomes exactly a Riemannian gradient ascent on the representation-theoretic energy functional (without residual terms arising from the projection) is load-bearing for every convergence and alignment result. For non-Abelian groups the irreps are matrix-valued; the projection onto the network parameter manifold need not commute with the Fourier transform, so it is unclear whether the lifted dynamics remain exactly the claimed Riemannian flow or acquire additional constraints or approximation errors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the centrality of the exact lifting argument. We address the single major comment below.

read point-by-point responses

Referee: The central claim that the projected gradient flow, once lifted to the Fourier domain, becomes exactly a Riemannian gradient ascent on the representation-theoretic energy functional (without residual terms arising from the projection) is load-bearing for every convergence and alignment result. For non-Abelian groups the irreps are matrix-valued; the projection onto the network parameter manifold need not commute with the Fourier transform, so it is unclear whether the lifted dynamics remain exactly the claimed Riemannian flow or acquire additional constraints or approximation errors.

Authors: We agree that exactness of the lift is essential. In the manuscript (Section 3 and Appendix B), the projected gradient flow is written in coordinates that are already the Fourier coefficients of the weight matrices. Because the discrete Fourier transform on a finite group is a unitary change of basis (with respect to the standard Euclidean inner product on the parameter space), it is an isometry; the orthogonal projection onto the Stiefel manifold of each layer therefore commutes with the transform and produces no residual terms. For non-Abelian groups the same argument applies entrywise to the matrix-valued Fourier coefficients: each irrep block evolves independently under its own Riemannian metric induced by the Frobenius inner product, and the projection remains block-diagonal in the Fourier basis. We will add an explicit lemma (new Lemma 3.2) and a short remark after Equation (7) in the revision to make this commutation explicit and to address the matrix-valued case directly. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation applies representation theory to gradient flow without reduction to inputs

full rationale

The paper's chain begins with the group composition task and projected gradient flow on a two-layer network, then lifts the dynamics to the Fourier domain over the finite group G to obtain a Riemannian gradient ascent on a representation-theoretic energy. From random initialization it derives almost-sure convergence of neurons to single irreps and rank-one alignment of cross-layer coefficients. These steps invoke standard finite-group representation theory and Riemannian optimization; no equation equates a claimed prediction to a fitted parameter by construction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The analysis remains self-contained against external benchmarks of representation theory and optimization, yielding a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard Fourier analysis for finite groups and the random-initialization assumption; no free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)

standard math Fourier analysis on finite groups lifts the gradient flow to a Riemannian structure on representation space
Invoked to obtain the energy functional and the dynamics of neuron specialization.
domain assumption Network weights are initialized randomly
Required for the almost-sure convergence statement.

pith-pipeline@v0.9.1-grok · 5731 in / 1463 out tokens · 31734 ms · 2026-06-28T11:44:59.004483+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 7 linked inside Pith

[1]

2013 , publisher=

Global stability of dynamical systems , author=. 2013 , publisher=

2013
[2]

Advances in Neural Information Processing Systems , volume =

High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation , author =. Advances in Neural Information Processing Systems , volume =
[3]

Advances in Neural Information Processing Systems , volume =

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit , author =. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

2024
[4]

International Conference on Learning Representations , year =

Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model , author =. International Conference on Learning Representations , year =
[5]

Foundations of Computational Mathematics , year =

Learning Time-Scales in Two-Layers Neural Networks , author =. Foundations of Computational Mathematics , year =
[6]

Proceedings of Thirty Fifth Conference on Learning Theory , series =

Neural Networks can Learn Representations with Gradient Descent , author =. Proceedings of Thirty Fifth Conference on Learning Theory , series =. 2022 , publisher =

2022
[7]

2024 , eprint =

Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions , author =. 2024 , eprint =

2024
[8]

Advances in Neural Information Processing Systems , year =

Emergence and scaling laws in SGD learning of shallow neural networks , author =. Advances in Neural Information Processing Systems , year =
[9]

Advances in Neural Information Processing Systems , volume =

Can SGD Learn Recurrent Neural Networks with Provable Generalization? , author =. Advances in Neural Information Processing Systems , volume =
[10]

International Conference on Learning Representations , year =

A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features , author =. International Conference on Learning Representations , year =
[11]

Advances in Neural Information Processing Systems , volume =

Provable Guarantees for Neural Networks via Gradient Feature Learning , author =. Advances in Neural Information Processing Systems , volume =
[12]

International Conference on Learning Representations , year=

Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks , author=. International Conference on Learning Representations , year=
[13]

International Conference on Learning Representations , year=

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , author=. International Conference on Learning Representations , year=
[14]

Advances in Neural Information Processing Systems , volume=

Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks , author=. Advances in Neural Information Processing Systems , volume=
[15]

arXiv preprint arXiv:2201.02177 , year=

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , author=. arXiv preprint arXiv:2201.02177 , year=

Pith/arXiv arXiv
[16]

International Conference on Learning Representations , year=

Progress Measures for Grokking via Mechanistic Interpretability , author=. International Conference on Learning Representations , year=
[17]

Advances in Neural Information Processing Systems , year=

Towards Understanding Grokking: An Effective Theory of Representation Learning , author=. Advances in Neural Information Processing Systems , year=
[18]

Proceedings of the 41st International Conference on Machine Learning , series=

Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition , author=. Proceedings of the 41st International Conference on Machine Learning , series=
[19]

International Conference on Learning Representations , year=

Grokking at the Edge of Numerical Stability , author=. International Conference on Learning Representations , year=
[20]

Proceedings of the 42nd International Conference on Machine Learning , series=

Emergence in Non-neural Models: Grokking Modular Arithmetic via Average Gradient Outer Product , author=. Proceedings of the 42nd International Conference on Machine Learning , series=
[21]

Proceedings of the 40th International Conference on Machine Learning , series=

A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations , author=. Proceedings of the 40th International Conference on Machine Learning , series=
[22]

Proceedings of the 41st International Conference on Machine Learning , series=

Grokking Group Multiplication with Cosets , author=. Proceedings of the 41st International Conference on Machine Learning , series=
[23]

International Conference on Learning Representations , year=

Towards a Unified and Verified Understanding of Group-Operation Networks , author=. International Conference on Learning Representations , year=
[24]

Proceedings of the 33rd International Conference on Machine Learning , series=

Group Equivariant Convolutional Networks , author=. Proceedings of the 33rd International Conference on Machine Learning , series=
[25]

Proceedings of the 35th International Conference on Machine Learning , series=

On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups , author=. Proceedings of the 35th International Conference on Machine Learning , series=
[26]

Proceedings of the 38th International Conference on Machine Learning , series=

A Practical Method for Constructing Equivariant Multilayer Perceptrons for Arbitrary Matrix Groups , author=. Proceedings of the 38th International Conference on Machine Learning , series=
[27]

Advances in Neural Information Processing Systems , year=

A General Framework for Equivariant Neural Networks on Reductive Lie Groups , author=. Advances in Neural Information Processing Systems , year=
[28]

arXiv preprint arXiv:2104.13478 , year=

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges , author=. arXiv preprint arXiv:2104.13478 , year=

Pith/arXiv arXiv
[29]

Proceedings of Thirty Seventh Conference on Learning Theory , series=

Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks , author=. Proceedings of Thirty Seventh Conference on Learning Theory , series=
[30]

Proceedings of the 41st International Conference on Machine Learning , series=

Emergent Equivariance in Deep Ensembles , author=. Proceedings of the 41st International Conference on Machine Learning , series=
[31]

Advances in Neural Information Processing Systems , year=

MatrixNet: Learning over Symmetry Groups using Learned Group Representations , author=. Advances in Neural Information Processing Systems , year=
[32]

Journal of Machine Learning Research , volume=

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability , author=. Journal of Machine Learning Research , volume=
[33]

Transactions on Machine Learning Research , year=

Mechanistic Interpretability for AI Safety -- A Review , author=. Transactions on Machine Learning Research , year=
[34]

2013 , publisher=

Differential equations and dynamical systems , author=. 2013 , publisher=

2013
[35]

2005 , publisher=

Riemannian geometry and geometric analysis , author=. 2005 , publisher=

2005
[36]

1999 , publisher=

Fourier analysis on finite groups and applications , author=. 1999 , publisher=

1999
[37]

1977 , publisher=

Linear representations of finite groups , author=. 1977 , publisher=

1977
[38]

arXiv preprint arXiv:2309.15111 , year=

Sgd finds then tunes features in two-layer neural networks with near-optimal sample complexity: A case study in the xor problem , author=. arXiv preprint arXiv:2309.15111 , year=

arXiv
[39]

Advances in Neural Information Processing Systems , volume=

Hidden progress in deep learning: Sgd learns parities near the computational limit , author=. Advances in Neural Information Processing Systems , volume=
[40]

Journal of the American statistical association , volume=

Probability inequalities for sums of bounded random variables , author=. Journal of the American statistical association , volume=. 1963 , publisher=

1963
[41]

Mathematical programming , volume=

First-order methods almost always avoid strict saddle points , author=. Mathematical programming , volume=. 2019 , publisher=

2019
[42]

arXiv preprint arXiv:1607.06450 , year=

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

Pith/arXiv arXiv
[43]

Conference on Learning Theory , pages=

The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks , author=. Conference on Learning Theory , pages=. 2022 , organization=

2022
[44]

Proceedings of the National Academy of Sciences , volume=

A mean field view of the landscape of two-layer neural networks , author=. Proceedings of the National Academy of Sciences , volume=. 2018 , publisher=

2018
[45]

Advances in Neural Information Processing Systems , volume=

When do neural networks outperform kernel methods? , author=. Advances in Neural Information Processing Systems , volume=
[46]

Journal of Machine Learning Research , volume=

The implicit bias of gradient descent on separable data , author=. Journal of Machine Learning Research , volume=
[47]

arXiv preprint arXiv:2602.03655 , year=

Sequential Group Composition: A Window into the Mechanics of Deep Learning , author=. arXiv preprint arXiv:2602.03655 , year=

Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2506.06489 , year=

Alternating gradient flows: A theory of feature learning in two-layer neural networks , author=. arXiv preprint arXiv:2506.06489 , year=

arXiv
[49]

IEEE transactions on pattern analysis and machine intelligence , volume=

Representation learning: A review and new perspectives , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2013 , publisher=

2013
[50]

arXiv preprint arXiv:1804.08838 , year=

Measuring the intrinsic dimension of objective landscapes , author=. arXiv preprint arXiv:1804.08838 , year=

Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2010.15327 , year=

Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth , author=. arXiv preprint arXiv:2010.15327 , year=

arXiv 2010
[52]

arXiv preprint arXiv:2602.16849 , year=

On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking , author=. arXiv preprint arXiv:2602.16849 , year=

arXiv
[53]

arXiv preprint arXiv:2604.21691 , year=

There Will Be a Scientific Theory of Deep Learning , author=. arXiv preprint arXiv:2604.21691 , year=

Pith/arXiv arXiv
[54]

arXiv preprint arXiv:2509.21519 , year=

Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking , author=. arXiv preprint arXiv:2509.21519 , year=

arXiv
[55]

arXiv preprint arXiv:2511.07378 , year=

Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization , author=. arXiv preprint arXiv:2511.07378 , year=

arXiv
[56]

arXiv preprint arXiv:2410.01779 , year=

Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets , author=. arXiv preprint arXiv:2410.01779 , year=

arXiv
[57]

Advances in Neural Information Processing Systems , volume=

Intrinsic dimension of data representations in deep neural networks , author=. Advances in Neural Information Processing Systems , volume=
[58]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning , author=. Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers) , pages=
[59]

arXiv preprint arXiv:2605.05683 , year=

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization , author=. arXiv preprint arXiv:2605.05683 , year=

Pith/arXiv arXiv

[1] [1]

2013 , publisher=

Global stability of dynamical systems , author=. 2013 , publisher=

2013

[2] [2]

Advances in Neural Information Processing Systems , volume =

High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation , author =. Advances in Neural Information Processing Systems , volume =

[3] [3]

Advances in Neural Information Processing Systems , volume =

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit , author =. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

2024

[4] [4]

International Conference on Learning Representations , year =

Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model , author =. International Conference on Learning Representations , year =

[5] [5]

Foundations of Computational Mathematics , year =

Learning Time-Scales in Two-Layers Neural Networks , author =. Foundations of Computational Mathematics , year =

[6] [6]

Proceedings of Thirty Fifth Conference on Learning Theory , series =

Neural Networks can Learn Representations with Gradient Descent , author =. Proceedings of Thirty Fifth Conference on Learning Theory , series =. 2022 , publisher =

2022

[7] [7]

2024 , eprint =

Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions , author =. 2024 , eprint =

2024

[8] [8]

Advances in Neural Information Processing Systems , year =

Emergence and scaling laws in SGD learning of shallow neural networks , author =. Advances in Neural Information Processing Systems , year =

[9] [9]

Advances in Neural Information Processing Systems , volume =

Can SGD Learn Recurrent Neural Networks with Provable Generalization? , author =. Advances in Neural Information Processing Systems , volume =

[10] [10]

International Conference on Learning Representations , year =

A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features , author =. International Conference on Learning Representations , year =

[11] [11]

Advances in Neural Information Processing Systems , volume =

Provable Guarantees for Neural Networks via Gradient Feature Learning , author =. Advances in Neural Information Processing Systems , volume =

[12] [12]

International Conference on Learning Representations , year=

Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks , author=. International Conference on Learning Representations , year=

[13] [13]

International Conference on Learning Representations , year=

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , author=. International Conference on Learning Representations , year=

[14] [14]

Advances in Neural Information Processing Systems , volume=

Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks , author=. Advances in Neural Information Processing Systems , volume=

[15] [15]

arXiv preprint arXiv:2201.02177 , year=

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , author=. arXiv preprint arXiv:2201.02177 , year=

Pith/arXiv arXiv

[16] [16]

International Conference on Learning Representations , year=

Progress Measures for Grokking via Mechanistic Interpretability , author=. International Conference on Learning Representations , year=

[17] [17]

Advances in Neural Information Processing Systems , year=

Towards Understanding Grokking: An Effective Theory of Representation Learning , author=. Advances in Neural Information Processing Systems , year=

[18] [18]

Proceedings of the 41st International Conference on Machine Learning , series=

Why Do You Grok? A Theoretical Analysis on Grokking Modular Addition , author=. Proceedings of the 41st International Conference on Machine Learning , series=

[19] [19]

International Conference on Learning Representations , year=

Grokking at the Edge of Numerical Stability , author=. International Conference on Learning Representations , year=

[20] [20]

Proceedings of the 42nd International Conference on Machine Learning , series=

Emergence in Non-neural Models: Grokking Modular Arithmetic via Average Gradient Outer Product , author=. Proceedings of the 42nd International Conference on Machine Learning , series=

[21] [21]

Proceedings of the 40th International Conference on Machine Learning , series=

A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations , author=. Proceedings of the 40th International Conference on Machine Learning , series=

[22] [22]

Proceedings of the 41st International Conference on Machine Learning , series=

Grokking Group Multiplication with Cosets , author=. Proceedings of the 41st International Conference on Machine Learning , series=

[23] [23]

International Conference on Learning Representations , year=

Towards a Unified and Verified Understanding of Group-Operation Networks , author=. International Conference on Learning Representations , year=

[24] [24]

Proceedings of the 33rd International Conference on Machine Learning , series=

Group Equivariant Convolutional Networks , author=. Proceedings of the 33rd International Conference on Machine Learning , series=

[25] [25]

Proceedings of the 35th International Conference on Machine Learning , series=

On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups , author=. Proceedings of the 35th International Conference on Machine Learning , series=

[26] [26]

Proceedings of the 38th International Conference on Machine Learning , series=

A Practical Method for Constructing Equivariant Multilayer Perceptrons for Arbitrary Matrix Groups , author=. Proceedings of the 38th International Conference on Machine Learning , series=

[27] [27]

Advances in Neural Information Processing Systems , year=

A General Framework for Equivariant Neural Networks on Reductive Lie Groups , author=. Advances in Neural Information Processing Systems , year=

[28] [28]

arXiv preprint arXiv:2104.13478 , year=

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges , author=. arXiv preprint arXiv:2104.13478 , year=

Pith/arXiv arXiv

[29] [29]

Proceedings of Thirty Seventh Conference on Learning Theory , series=

Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks , author=. Proceedings of Thirty Seventh Conference on Learning Theory , series=

[30] [30]

Proceedings of the 41st International Conference on Machine Learning , series=

Emergent Equivariance in Deep Ensembles , author=. Proceedings of the 41st International Conference on Machine Learning , series=

[31] [31]

Advances in Neural Information Processing Systems , year=

MatrixNet: Learning over Symmetry Groups using Learned Group Representations , author=. Advances in Neural Information Processing Systems , year=

[32] [32]

Journal of Machine Learning Research , volume=

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability , author=. Journal of Machine Learning Research , volume=

[33] [33]

Transactions on Machine Learning Research , year=

Mechanistic Interpretability for AI Safety -- A Review , author=. Transactions on Machine Learning Research , year=

[34] [34]

2013 , publisher=

Differential equations and dynamical systems , author=. 2013 , publisher=

2013

[35] [35]

2005 , publisher=

Riemannian geometry and geometric analysis , author=. 2005 , publisher=

2005

[36] [36]

1999 , publisher=

Fourier analysis on finite groups and applications , author=. 1999 , publisher=

1999

[37] [37]

1977 , publisher=

Linear representations of finite groups , author=. 1977 , publisher=

1977

[38] [38]

arXiv preprint arXiv:2309.15111 , year=

Sgd finds then tunes features in two-layer neural networks with near-optimal sample complexity: A case study in the xor problem , author=. arXiv preprint arXiv:2309.15111 , year=

arXiv

[39] [39]

Advances in Neural Information Processing Systems , volume=

Hidden progress in deep learning: Sgd learns parities near the computational limit , author=. Advances in Neural Information Processing Systems , volume=

[40] [40]

Journal of the American statistical association , volume=

Probability inequalities for sums of bounded random variables , author=. Journal of the American statistical association , volume=. 1963 , publisher=

1963

[41] [41]

Mathematical programming , volume=

First-order methods almost always avoid strict saddle points , author=. Mathematical programming , volume=. 2019 , publisher=

2019

[42] [42]

arXiv preprint arXiv:1607.06450 , year=

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

Pith/arXiv arXiv

[43] [43]

Conference on Learning Theory , pages=

The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks , author=. Conference on Learning Theory , pages=. 2022 , organization=

2022

[44] [44]

Proceedings of the National Academy of Sciences , volume=

A mean field view of the landscape of two-layer neural networks , author=. Proceedings of the National Academy of Sciences , volume=. 2018 , publisher=

2018

[45] [45]

Advances in Neural Information Processing Systems , volume=

When do neural networks outperform kernel methods? , author=. Advances in Neural Information Processing Systems , volume=

[46] [46]

Journal of Machine Learning Research , volume=

The implicit bias of gradient descent on separable data , author=. Journal of Machine Learning Research , volume=

[47] [47]

arXiv preprint arXiv:2602.03655 , year=

Sequential Group Composition: A Window into the Mechanics of Deep Learning , author=. arXiv preprint arXiv:2602.03655 , year=

Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2506.06489 , year=

Alternating gradient flows: A theory of feature learning in two-layer neural networks , author=. arXiv preprint arXiv:2506.06489 , year=

arXiv

[49] [49]

IEEE transactions on pattern analysis and machine intelligence , volume=

Representation learning: A review and new perspectives , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2013 , publisher=

2013

[50] [50]

arXiv preprint arXiv:1804.08838 , year=

Measuring the intrinsic dimension of objective landscapes , author=. arXiv preprint arXiv:1804.08838 , year=

Pith/arXiv arXiv

[51] [51]

arXiv preprint arXiv:2010.15327 , year=

Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth , author=. arXiv preprint arXiv:2010.15327 , year=

arXiv 2010

[52] [52]

arXiv preprint arXiv:2602.16849 , year=

On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking , author=. arXiv preprint arXiv:2602.16849 , year=

arXiv

[53] [53]

arXiv preprint arXiv:2604.21691 , year=

There Will Be a Scientific Theory of Deep Learning , author=. arXiv preprint arXiv:2604.21691 , year=

Pith/arXiv arXiv

[54] [54]

arXiv preprint arXiv:2509.21519 , year=

Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking , author=. arXiv preprint arXiv:2509.21519 , year=

arXiv

[55] [55]

arXiv preprint arXiv:2511.07378 , year=

Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization , author=. arXiv preprint arXiv:2511.07378 , year=

arXiv

[56] [56]

arXiv preprint arXiv:2410.01779 , year=

Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets , author=. arXiv preprint arXiv:2410.01779 , year=

arXiv

[57] [57]

Advances in Neural Information Processing Systems , volume=

Intrinsic dimension of data representations in deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

[58] [58]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning , author=. Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers) , pages=

[59] [59]

arXiv preprint arXiv:2605.05683 , year=

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization , author=. arXiv preprint arXiv:2605.05683 , year=

Pith/arXiv arXiv