Recognition: 2 theorem links
· Lean TheoremSpectral Edge Dynamics Reveal Functional Modes of Learning
Pith reviewed 2026-05-10 19:04 UTC · model grok-4.3
The pith
Dominant training update directions form a spectral edge that induces low-dimensional functional modes over inputs, adapted to each task's algebraic symmetry.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training dynamics during grokking concentrate along a small number of dominant update directions -- the spectral edge -- which reliably distinguishes grokking from non-grokking regimes. Standard mechanistic interpretability tools fail to capture these directions because their structure is not localized in parameter or feature space. Instead, each direction induces a structured function over the input domain, revealing low-dimensional functional modes invisible to representation-level analysis. For modular addition all leading directions collapse to a single Fourier mode; for multiplication the same collapse appears only in the discrete-log basis; for subtraction the edge spans a small multi-
What carries the argument
The spectral edge: the leading eigenvectors of the covariance of parameter updates that each induce a low-dimensional structured function across the entire input domain.
Load-bearing premise
The dominant update directions are not localized in parameter or feature space but instead induce structured functions over the input domain that standard mechanistic interpretability tools fail to capture.
What would settle it
Extracting the top eigenvectors of the update covariance on a modular-addition grokking run and verifying that they do not each correspond to a single Fourier mode over the inputs, or that activation probing and sparse autoencoders recover them as localized features, would falsify the claim.
Figures
read the original abstract
Training dynamics during grokking concentrate along a small number of dominant update directions -- the spectral edge -- which reliably distinguishes grokking from non-grokking regimes. We show that standard mechanistic interpretability tools (head attribution, activation probing, sparse autoencoders) fail to capture these directions: their structure is not localized in parameter or feature space. Instead, each direction induces a structured function over the input domain, revealing low-dimensional functional modes invisible to representation-level analysis. For modular addition, all leading directions collapse to a single Fourier mode. For multiplication, the same collapse appears only in the discrete-log basis, yielding a 5.9x improvement in concentration. For subtraction, the edge spans a small multi-mode family. For $x^2+y^2$, no single harmonic basis suffices, but cross-terms of additive and multiplicative features provide a 4x variance boost, consistent with the decomposition (a+b)^2 - 2ab. Multitask training amplifies this compositional structure, with the $x^2+y^2$ spectral edge inheriting the addition circuit's characteristic frequency (2.3x concentration increase). These results suggest that training discovers low-dimensional functional modes over the input domain, whose structure depends on the algebraic symmetry of the task. These results suggest that spectral edge dynamics identify low-dimensional functional subspaces governing learning, whose representation depends on the algebraic structure of the task. Simple harmonic structure emerges only when the task admits a symmetry-adapted basis; more complex tasks require richer functional descriptions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that during grokking, neural network training dynamics concentrate along a small number of dominant update directions termed the 'spectral edge.' These directions reliably distinguish grokking from non-grokking regimes across modular addition, multiplication, subtraction, and x²+y² tasks (plus a multitask setting). The directions induce structured functions over the input domain aligned with algebraic symmetries (e.g., Fourier modes for addition, discrete-log basis for multiplication with 5.9x concentration, cross-terms for x²+y² with 4x boost), but are invisible to standard mechanistic interpretability tools because their structure is non-localized in parameter or feature space. The results suggest training discovers low-dimensional functional modes whose representation depends on task symmetry.
Significance. If the central observations hold, the work offers a dynamics-focused complement to representation-level interpretability, highlighting how update directions can reveal symmetry-adapted functional subspaces that govern generalization in grokking. The reported alignments (Fourier collapse, discrete-log concentration, compositional cross-terms, multitask amplification) provide concrete, task-specific evidence that could inform theories of how networks exploit algebraic structure. The emphasis on non-localized structure is a useful caution against over-reliance on localized probes.
major comments (2)
- [Abstract] Abstract and main results: The claim that spectral edge directions 'reveal functional modes of learning' and 'govern' the grokking transition rests on correlational alignments (projections of update vectors onto candidate bases yielding variance explained). No interventional evidence is presented, such as constraining optimization to the orthogonal complement of the spectral edge or ablating updates along these directions to test necessity/sufficiency for the phase transition. Without such tests, the directions could be high-variance consequences rather than causal drivers.
- [Abstract] Abstract: No quantitative details are supplied on spectral edge extraction (e.g., eigenvalue threshold, number of leading directions retained, or how 'dominant' is defined), error bars across runs, data exclusion criteria, or statistical controls for alternative explanations such as random high-variance directions or post-grokking stabilization.
minor comments (2)
- [Abstract] Abstract: The final two sentences are near-duplicates with minor rephrasing; consolidate into a single concluding statement.
- [Abstract] Abstract: The 5.9x and 4x concentration factors are reported without specifying the baseline (e.g., random basis, full space, or alternative harmonic expansions), making the magnitude hard to interpret.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We agree that the manuscript would benefit from more precise language on the nature of our evidence and from expanded methodological details. We respond to each major comment below and outline the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract and main results: The claim that spectral edge directions 'reveal functional modes of learning' and 'govern' the grokking transition rests on correlational alignments (projections of update vectors onto candidate bases yielding variance explained). No interventional evidence is presented, such as constraining optimization to the orthogonal complement of the spectral edge or ablating updates along these directions to test necessity/sufficiency for the phase transition. Without such tests, the directions could be high-variance consequences rather than causal drivers.
Authors: We agree that the evidence is correlational, consisting of observed alignments between spectral edge directions and task-specific bases (Fourier for addition, discrete-log for multiplication, cross-terms for x²+y²) together with their ability to distinguish grokking from non-grokking regimes. The current work does not contain interventional experiments that would establish necessity or sufficiency. In the revised manuscript we will replace the phrasing 'govern the grokking transition' and 'governing learning' with more accurate language emphasizing that the directions 'reveal' low-dimensional functional modes via their structured projections. We will also add a short limitations paragraph noting the correlational character of the results and outlining possible future interventional tests. revision: partial
-
Referee: [Abstract] Abstract: No quantitative details are supplied on spectral edge extraction (e.g., eigenvalue threshold, number of leading directions retained, or how 'dominant' is defined), error bars across runs, data exclusion criteria, or statistical controls for alternative explanations such as random high-variance directions or post-grokking stabilization.
Authors: We will revise the Methods section to supply the requested quantitative details: the precise eigenvalue threshold and criterion for retaining leading directions, the definition of 'dominant' used throughout the study, standard deviations across independent runs for all reported concentration factors, explicit data exclusion criteria, and additional controls that compare the observed alignments against random directions drawn from the same update covariance as well as against post-grokking update directions. revision: yes
Circularity Check
No circularity: spectral edge is an observed empirical property, not a self-referential construct
full rationale
The paper defines the spectral edge directly from the leading eigenvectors of the training update covariance matrix and then reports measured alignments (e.g., collapse to Fourier modes for modular addition) via projection onto candidate bases. This is a straightforward observational pipeline with no equations that equate the claimed functional modes to the definition of the edge itself, no fitted parameters renamed as predictions, and no load-bearing self-citations. The central claim remains an empirical finding about concentration of variance rather than a tautology.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Training dynamics during grokking concentrate along a small number of dominant update directions—the spectral edge—which reliably distinguishes grokking from non-grokking regimes... each direction induces a structured function over the input domain
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
For modular addition, all leading directions collapse to a single Fourier mode... For multiplication, the same collapse appears only in the discrete-log basis
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bricken et al
T. Bricken et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023
2023
-
[2]
Chughtai, L
B. Chughtai, L. Chan, and N. Nanda. A toy model of universality: Reverse engineering how networks learn group operations. In ICML, 2023
2023
-
[3]
Cunningham, A
H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models. In ICLR, 2024
2024
-
[4]
Elhage et al
N. Elhage et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021
2021
-
[5]
Elhage et al
N. Elhage et al. Toy models of superposition. Transformer Circuits Thread, 2022
2022
-
[6]
C. Olah, N. Elhage, T. Henighan, and others. A toy model of interference weights. Transformer Circuits Thread, 2025. https://transformer-circuits.pub/2025/interference-weights/index.html
2025
-
[7]
C. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring the intrinsic dimension of objective landscapes. In ICLR, 2018
2018
-
[8]
Nanda, L
N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023
2023
-
[9]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177, 2022
work page internal anchor Pith review arXiv 2022
-
[10]
D. Stander, Q. Yu, H. Fan, and E. Vonosek. Grokking group multiplication with cosets. arXiv:2312.06581, 2024
-
[11]
Y. Xu. Low-dimensional and transversely curved optimization dynamics in grokking. arXiv:2602.16746, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Y. Xu. The geometry of multi-task grokking: Transverse instability, superposition, and weight decay phase structure. arXiv:2602.18523, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Y. Xu. The spectral edge thesis: A mathematical framework for intra-signal phase transitions in neural network training. arXiv:2603.28964, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Zhong, Z
Z. Zhong, Z. Liu, M. Tegmark, and J. Andreas. The clock and the pizza: Two stories in mechanistic explanation of neural networks. In NeurIPS, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.