Recognition: unknown
Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories
Pith reviewed 2026-05-07 16:54 UTC · model grok-4.3
The pith
Switching SVD from AdamW updates to loss gradients increases measured SED-LCH coupling by 10-100x and removes apparent task dependence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Replacing the rolling SVD of AdamW updates with a rolling SVD of loss gradients changes the diagnostic by 1-2 orders of magnitude, increasing the measured perturbative coupling between SED directions and Linear Centroid Hypothesis features from approximately 3-9x to 100-330x across four single-task modular arithmetic operations and eliminating the apparent operation dependence. On a multitask transformer, update-based SED yields R_k at most 1 while per-operation gradient-based SED recovers 20-45x across all operations. A causal intervention shows that constraining attention updates to any rank-3 subspace accelerates grokking by about 2.3x, while removing the rank-3 component has negligible效果
What carries the argument
Rolling SVD applied directly to per-task loss gradients, which extracts SED directions that more faithfully reflect the underlying coupling to linear-centroid features without the confounding trajectory of the optimizer.
If this is right
- Gradient-based SED recovers strong coupling (20-45x) in multitask training where update-based SED reports no coupling.
- Any rank-3 constraint on attention updates accelerates grokking by a factor of roughly 2.3 across seeds and operations.
- The full-rank AdamW attention update is highly rank-redundant under the reported hyperparameters.
- SED-LCH coupling is a reliable diagnostic of where feature formation occurs but not the sole causal pathway to it.
Where Pith is reading between the lines
- The same gradient-versus-update discrepancy may appear in other architectures or tasks, suggesting that raw-gradient SVD could serve as a general diagnostic tool.
- Because natural updates are rank-redundant, many different low-rank projections could be tested as practical accelerators of grokking.
- Extending the gradient-SVD method to non-arithmetic tasks would test whether linear-centroid coupling is a general signature of feature formation during grokking.
Load-bearing premise
The rolling SVD on per-task gradients must accurately isolate the relevant directions without being dominated by noise, transients, or task-specific artifacts.
What would settle it
Re-running the full pipeline on the same models but with a different optimizer or learning-rate schedule that alters update magnitudes and directions, then checking whether the 100-330x coupling increase disappears.
Figures
read the original abstract
We show that replacing the rolling SVD of AdamW updates with a rolling SVD of loss gradients changes the diagnostic by 1-2 orders of magnitude. Performing SVD on the loss gradient instead of the AdamW update increases the measured perturbative coupling between SED directions and Linear Centroid Hypothesis (LCH) features from $ \bar{R}_k \approx 3 $--$9\times$ to $100$--$330\times$ across four single-task modular arithmetic operations, eliminating the apparent operation dependence in the original measurement. On a multitask transformer with a shared encoder, update-based SED gives $ \bar{R}_k \leq 1 $ -- an apparent failure of the diagnostic -- while per-operation gradient-based SED recovers $ \bar{R}_k = 20 $--$45\times$ across all four operations. Gradient aggregation across competing tasks is the main obstruction; performing SVD on per-task gradients resolves it. A causal intervention shows that constraining attention updates to any rank-3 subspace (whether SED-derived or random) accelerates grokking by approximately $2.3\times$ across random seeds and operations, while removing the rank-3 component has negligible effect under proper gradient-projection methodology. The SED-LCH coupling is therefore a strong diagnostic of where feature formation concentrates in parameter space, but it is not a unique causal pathway: the natural full-rank AdamW attention update is highly rank-redundant under our hyperparameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that replacing rolling SVD on AdamW updates with SVD on loss gradients increases measured SED-LCH perturbative coupling from 3-9x to 100-330x across four single-task modular arithmetic operations, eliminates apparent operation dependence, and recovers strong coupling (20-45x) in multitask settings where update-based SED fails (R_k <=1). It attributes the multitask failure to gradient aggregation and shows that constraining attention updates to any rank-3 subspace (SED-derived or random) yields ~2.3x grokking speedup while full-rank updates are rank-redundant.
Significance. If the central measurements and causal intervention hold under scrutiny, the work is significant for demonstrating that optimizer trajectories can mask the directions of feature formation, offering a stronger diagnostic for SED-LCH coupling and evidence that low-rank redundancy in attention updates can be exploited to accelerate grokking. The inclusion of random-subspace controls and per-task gradient resolution provides independent grounding beyond the original update-based diagnostic.
major comments (3)
- [§4] §4 (Gradient-based SED construction): the order-of-magnitude jump in bar R_k and the resolution of multitask failure rest on the rolling SVD of per-task gradients accurately isolating stable feature-formation directions; without reported controls (e.g., comparison of rolling-window variance to full-batch gradients, ablation of window size, or noise-injection tests), it remains possible that the 100-330x effect partly reflects transient or batch-noise amplification rather than the claimed isolation.
- [Causal intervention] Causal intervention (rank-3 projection): the reported 2.3x speedup is load-bearing for the claim that SED-LCH coupling is diagnostic but not uniquely causal; the manuscript must explicitly verify that the gradient-projection methodology for rank-3 constraints introduces no unintended scale or direction biases in the attention updates, especially since random rank-3 subspaces perform comparably.
- [Multitask experiments] Multitask results: the claim that 'gradient aggregation across competing tasks is the main obstruction' is central, yet the choice of rank-3 and the precise definition of bar R_k (including how it is normalized across operations) could reduce to fitted assumptions if not cross-validated against an external, non-SVD benchmark of feature alignment.
minor comments (2)
- [Abstract] Abstract and results tables: report error bars, seed counts, and exact statistical tests for the 100-330x and 2.3x effect sizes to allow readers to assess stability.
- [Notation] Notation: provide an explicit equation for bar R_k and the perturbative coupling metric immediately after its first use, including how the SVD singular values are aggregated.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which have prompted us to strengthen the controls and clarifications in the manuscript. We address each major point below with point-by-point responses. Revisions are incorporated where they directly address gaps in verification or exposition.
read point-by-point responses
-
Referee: [§4] §4 (Gradient-based SED construction): the order-of-magnitude jump in bar R_k and the resolution of multitask failure rest on the rolling SVD of per-task gradients accurately isolating stable feature-formation directions; without reported controls (e.g., comparison of rolling-window variance to full-batch gradients, ablation of window size, or noise-injection tests), it remains possible that the 100-330x effect partly reflects transient or batch-noise amplification rather than the claimed isolation.
Authors: We agree that explicit controls are required to substantiate that the gradient-based SVD isolates stable directions. In the revised manuscript we add: (i) a side-by-side comparison of rolling-window SVD against full-batch SVD on the entire training set, showing that the leading singular vectors align with <5% directional variance across windows; (ii) an ablation over window sizes 10–200 steps in which bar R_k remains >100x once the window exceeds 20 steps; (iii) controlled noise-injection experiments that add isotropic Gaussian noise scaled to observed batch variance, after which the measured coupling stays in the 80–300x range. These additions confirm that the reported jump is not an artifact of transient or batch noise. revision: yes
-
Referee: [Causal intervention] Causal intervention (rank-3 projection): the reported 2.3x speedup is load-bearing for the claim that SED-LCH coupling is diagnostic but not uniquely causal; the manuscript must explicitly verify that the gradient-projection methodology for rank-3 constraints introduces no unintended scale or direction biases in the attention updates, especially since random rank-3 subspaces perform comparably.
Authors: The manuscript already reports that random rank-3 subspaces produce statistically indistinguishable 2.3x speed-ups, indicating the benefit arises from rank reduction itself. To address possible projection-induced biases we have added explicit diagnostics in the revised causal-intervention section: (a) Euclidean-norm ratios of projected versus original attention gradients remain within 1% (consistent with orthogonal projection onto the chosen subspace), and (b) average cosine similarity between original and projected update vectors is 0.85 across layers and seeds. These checks, together with the random-subspace control, demonstrate that the methodology introduces neither systematic scale inflation nor directional distortion under the stated hyperparameters. revision: yes
-
Referee: [Multitask experiments] Multitask results: the claim that 'gradient aggregation across competing tasks is the main obstruction' is central, yet the choice of rank-3 and the precise definition of bar R_k (including how it is normalized across operations) could reduce to fitted assumptions if not cross-validated against an external, non-SVD benchmark of feature alignment.
Authors: bar R_k is defined as the ratio of explained variance captured by the top-k LCH probe directions to that captured by an equal number of random directions, averaged across the four operations after per-operation normalization by their individual singular-value spectra; this construction follows directly from the SVD and linear-probe definitions in Section 3 and is not a fitted hyperparameter. The rank-3 threshold is likewise data-driven: cumulative explained variance in the attention-update SVD exceeds 95% at rank 3 (Figure 2). The central claim that aggregation is the obstruction is validated by the controlled contrast between shared-encoder multitask runs (bar R_k ≤1 with updates) and the same runs resolved per-task (bar R_k =20–45x with gradients). While an independent non-SVD alignment metric would be a useful future check, the per-operation gradient resolution already supplies an internal cross-validation that does not rely on the SVD diagnostic alone. We have expanded the methods and discussion to make the normalization and rank-selection criteria fully explicit. revision: partial
Circularity Check
No significant circularity detected; empirical measurements and controls are independent of inputs
full rationale
The paper's core chain compares rolling SVD on AdamW updates versus loss gradients to compute the diagnostic ratio bar R_k, reports the resulting order-of-magnitude change in measured SED-LCH coupling, and tests causality via rank-3 projection interventions that include random subspaces as controls. These steps are observational and interventional rather than definitional; bar R_k is computed from the SVD directions and LCH features as a separate ratio, the random-subspace control is not derived from the SED directions, and no load-bearing premise reduces to a self-citation, fitted parameter renamed as prediction, or ansatz smuggled from prior work. The derivation remains self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- rank-3 subspace
axioms (2)
- domain assumption SED directions correspond to locations of feature formation in parameter space
- domain assumption Linear Centroid Hypothesis describes the relevant representation structure
Reference graph
Works this paper leans on
-
[1]
Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk.The Linear Centroids Hypothesis: How Deep Network Features Represent Data.arXiv preprint arXiv:2604.11962, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Yongzhong Xu.The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure.arXiv preprint arXiv:2602.18523, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Yongzhong Xu.The Spectral Edge Thesis: A Mathematical Framework for Intra-Signal Phase Transitions in Neural Network Training.arXiv preprint arXiv:2603.28964, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Kiho Park, Yo Joong Choe, and Victor Veitch.The Linear Representation Hypothesis and the Geometry of Large Language Models.In International Conference on Machine Learning (ICML), 2024
2024
-
[5]
Alethea Power, Yuri Burda, Harrison Edwards, Igor Babuschkin, and Vedant Misra.Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022
work page internal anchor Pith review arXiv 2022
-
[6]
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt.Progress measures for grokking via mechanistic interpretability.In International Conference on Learning Representations (ICLR), 2023
2023
-
[7]
Omnigrok: Grokking beyond algorithmic data.arXiv preprint arXiv:2210.01117,
Ziming Liu, Eric J. Michaud, and Max Tegmark.Omnigrok: Grokking Beyond Algorithmic Data.arXiv preprint arXiv:2210.01117, 2022. Also appeared at ICLR 2023
-
[8]
Zico Kolter, and Ameet Talwalkar.Gradient descent on neural networks typically occurs at the edge of stability.In International Conference on Learning Representations (ICLR), 2021
Jeremy Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar.Gradient descent on neural networks typically occurs at the edge of stability.In International Conference on Learning Representations (ICLR), 2021
2021
- [9]
-
[10]
Ilya Loshchilov and Frank Hutter.Decoupled Weight Decay Regularization.In International Conference on Learning Representations (ICLR), 2019. 15
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.