arxiv: 2604.25143 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.AI

Recognition: unknown

Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

Yongzhong Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords SED directionsLinear Centroid Hypothesisloss gradientsAdamW updatesgrokkingmodular arithmetictransformer trainingSVD analysis

0 comments

The pith

Switching SVD from AdamW updates to loss gradients increases measured SED-LCH coupling by 10-100x and removes apparent task dependence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the weak and operation-dependent coupling observed between SED directions and linear centroid features arises mainly because the rolling SVD was applied to optimizer updates rather than raw loss gradients. Switching the SVD input to per-task gradients raises the coupling metric from 3-9x to 100-330x in single-task modular arithmetic and restores 20-45x coupling in multitask settings where update-based analysis had indicated failure. Gradient aggregation across competing tasks is identified as the dominant masking mechanism. A rank-3 subspace constraint on attention updates accelerates grokking by roughly 2.3x whether the subspace comes from SED or is chosen randomly, indicating that the coupling serves as a diagnostic of feature-formation location but not as a unique causal channel.

Core claim

Replacing the rolling SVD of AdamW updates with a rolling SVD of loss gradients changes the diagnostic by 1-2 orders of magnitude, increasing the measured perturbative coupling between SED directions and Linear Centroid Hypothesis features from approximately 3-9x to 100-330x across four single-task modular arithmetic operations and eliminating the apparent operation dependence. On a multitask transformer, update-based SED yields R_k at most 1 while per-operation gradient-based SED recovers 20-45x across all operations. A causal intervention shows that constraining attention updates to any rank-3 subspace accelerates grokking by about 2.3x, while removing the rank-3 component has negligible效果

What carries the argument

Rolling SVD applied directly to per-task loss gradients, which extracts SED directions that more faithfully reflect the underlying coupling to linear-centroid features without the confounding trajectory of the optimizer.

If this is right

Gradient-based SED recovers strong coupling (20-45x) in multitask training where update-based SED reports no coupling.
Any rank-3 constraint on attention updates accelerates grokking by a factor of roughly 2.3 across seeds and operations.
The full-rank AdamW attention update is highly rank-redundant under the reported hyperparameters.
SED-LCH coupling is a reliable diagnostic of where feature formation occurs but not the sole causal pathway to it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradient-versus-update discrepancy may appear in other architectures or tasks, suggesting that raw-gradient SVD could serve as a general diagnostic tool.
Because natural updates are rank-redundant, many different low-rank projections could be tested as practical accelerators of grokking.
Extending the gradient-SVD method to non-arithmetic tasks would test whether linear-centroid coupling is a general signature of feature formation during grokking.

Load-bearing premise

The rolling SVD on per-task gradients must accurately isolate the relevant directions without being dominated by noise, transients, or task-specific artifacts.

What would settle it

Re-running the full pipeline on the same models but with a different optimizer or learning-rate schedule that alters update magnitudes and directions, then checking whether the 100-330x coupling increase disappears.

Figures

Figures reproduced from arXiv: 2604.25143 by Yongzhong Xu.

**Figure 1.** Figure 1: Multi-seed update-SED trajectory for modular addition (3 seeds). Top: train (dashed) view at source ↗

**Figure 2.** Figure 2: Cross-op comparison of the two SED estimators on four single-task binary operations view at source ↗

**Figure 3.** Figure 3: Multitask diagnostic with the aggregated SED ( view at source ↗

**Figure 4.** Figure 4: Multitask per-op SED-LCH coupling (Equation (3)). Top: train (dashed) and test (solid) view at source ↗

**Figure 5.** Figure 5: Train (dashed) and test (solid) accuracy under the five update-projected interventions view at source ↗

read the original abstract

We show that replacing the rolling SVD of AdamW updates with a rolling SVD of loss gradients changes the diagnostic by 1-2 orders of magnitude. Performing SVD on the loss gradient instead of the AdamW update increases the measured perturbative coupling between SED directions and Linear Centroid Hypothesis (LCH) features from $ \bar{R}_k \approx 3 $--$9\times$ to $100$--$330\times$ across four single-task modular arithmetic operations, eliminating the apparent operation dependence in the original measurement. On a multitask transformer with a shared encoder, update-based SED gives $ \bar{R}_k \leq 1 $ -- an apparent failure of the diagnostic -- while per-operation gradient-based SED recovers $ \bar{R}_k = 20 $--$45\times$ across all four operations. Gradient aggregation across competing tasks is the main obstruction; performing SVD on per-task gradients resolves it. A causal intervention shows that constraining attention updates to any rank-3 subspace (whether SED-derived or random) accelerates grokking by approximately $2.3\times$ across random seeds and operations, while removing the rank-3 component has negligible effect under proper gradient-projection methodology. The SED-LCH coupling is therefore a strong diagnostic of where feature formation concentrates in parameter space, but it is not a unique causal pathway: the natural full-rank AdamW attention update is highly rank-redundant under our hyperparameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Switching SVD from AdamW updates to raw gradients lifts the SED-LCH coupling metric by 1-2 orders of magnitude and fixes the multitask failure, but the rank-3 intervention works equally for random subspaces so the directions are not shown to be special.

read the letter

The paper's clearest new observation is that the diagnostic is highly sensitive to whether you run the rolling SVD on the loss gradient or on the actual parameter update. On the single-task modular arithmetic cases this changes the measured coupling from roughly 3-9x to 100-330x and removes the apparent dependence on which operation is being learned. In the multitask setting the update-based version collapses to no signal while the per-task gradient version recovers 20-45x across operations, which the authors attribute to gradient aggregation being the main blocker. That distinction is useful to know if people are going to keep using this style of diagnostic on shared-encoder models. The causal test is also straightforward: projecting attention updates onto any rank-3 subspace (SED or random) speeds grokking by about 2.3x while removing the component has little effect, which supports their closing claim that the full-rank update is highly redundant under these hyperparameters. The work is therefore mainly a methodological note on how to extract a stronger signal from existing grokking setups rather than a new theory of feature formation. The soft spots are the usual ones for an abstract-only view: no error bars, no details on window size or SVD stability, and no direct check that the rolling gradient directions are not dominated by batch noise or early transients. The fact that random rank-3 subspaces work as well as the SED-derived ones also weakens any claim that the diagnostic has isolated a privileged causal pathway. Readers already following the SED/LCH line of work on grokking transformers will get the most out of it; the rest of us can treat it as a cautionary note on optimizer choice in these measurements. It is worth sending to referees because the effect sizes are large enough to be worth checking and the multitask resolution is a concrete addition, even if the paper will need tighter controls and code release to stand up.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that replacing rolling SVD on AdamW updates with SVD on loss gradients increases measured SED-LCH perturbative coupling from 3-9x to 100-330x across four single-task modular arithmetic operations, eliminates apparent operation dependence, and recovers strong coupling (20-45x) in multitask settings where update-based SED fails (R_k <=1). It attributes the multitask failure to gradient aggregation and shows that constraining attention updates to any rank-3 subspace (SED-derived or random) yields ~2.3x grokking speedup while full-rank updates are rank-redundant.

Significance. If the central measurements and causal intervention hold under scrutiny, the work is significant for demonstrating that optimizer trajectories can mask the directions of feature formation, offering a stronger diagnostic for SED-LCH coupling and evidence that low-rank redundancy in attention updates can be exploited to accelerate grokking. The inclusion of random-subspace controls and per-task gradient resolution provides independent grounding beyond the original update-based diagnostic.

major comments (3)

[§4] §4 (Gradient-based SED construction): the order-of-magnitude jump in bar R_k and the resolution of multitask failure rest on the rolling SVD of per-task gradients accurately isolating stable feature-formation directions; without reported controls (e.g., comparison of rolling-window variance to full-batch gradients, ablation of window size, or noise-injection tests), it remains possible that the 100-330x effect partly reflects transient or batch-noise amplification rather than the claimed isolation.
[Causal intervention] Causal intervention (rank-3 projection): the reported 2.3x speedup is load-bearing for the claim that SED-LCH coupling is diagnostic but not uniquely causal; the manuscript must explicitly verify that the gradient-projection methodology for rank-3 constraints introduces no unintended scale or direction biases in the attention updates, especially since random rank-3 subspaces perform comparably.
[Multitask experiments] Multitask results: the claim that 'gradient aggregation across competing tasks is the main obstruction' is central, yet the choice of rank-3 and the precise definition of bar R_k (including how it is normalized across operations) could reduce to fitted assumptions if not cross-validated against an external, non-SVD benchmark of feature alignment.

minor comments (2)

[Abstract] Abstract and results tables: report error bars, seed counts, and exact statistical tests for the 100-330x and 2.3x effect sizes to allow readers to assess stability.
[Notation] Notation: provide an explicit equation for bar R_k and the perturbative coupling metric immediately after its first use, including how the SVD singular values are aggregated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have prompted us to strengthen the controls and clarifications in the manuscript. We address each major point below with point-by-point responses. Revisions are incorporated where they directly address gaps in verification or exposition.

read point-by-point responses

Referee: [§4] §4 (Gradient-based SED construction): the order-of-magnitude jump in bar R_k and the resolution of multitask failure rest on the rolling SVD of per-task gradients accurately isolating stable feature-formation directions; without reported controls (e.g., comparison of rolling-window variance to full-batch gradients, ablation of window size, or noise-injection tests), it remains possible that the 100-330x effect partly reflects transient or batch-noise amplification rather than the claimed isolation.

Authors: We agree that explicit controls are required to substantiate that the gradient-based SVD isolates stable directions. In the revised manuscript we add: (i) a side-by-side comparison of rolling-window SVD against full-batch SVD on the entire training set, showing that the leading singular vectors align with <5% directional variance across windows; (ii) an ablation over window sizes 10–200 steps in which bar R_k remains >100x once the window exceeds 20 steps; (iii) controlled noise-injection experiments that add isotropic Gaussian noise scaled to observed batch variance, after which the measured coupling stays in the 80–300x range. These additions confirm that the reported jump is not an artifact of transient or batch noise. revision: yes
Referee: [Causal intervention] Causal intervention (rank-3 projection): the reported 2.3x speedup is load-bearing for the claim that SED-LCH coupling is diagnostic but not uniquely causal; the manuscript must explicitly verify that the gradient-projection methodology for rank-3 constraints introduces no unintended scale or direction biases in the attention updates, especially since random rank-3 subspaces perform comparably.

Authors: The manuscript already reports that random rank-3 subspaces produce statistically indistinguishable 2.3x speed-ups, indicating the benefit arises from rank reduction itself. To address possible projection-induced biases we have added explicit diagnostics in the revised causal-intervention section: (a) Euclidean-norm ratios of projected versus original attention gradients remain within 1% (consistent with orthogonal projection onto the chosen subspace), and (b) average cosine similarity between original and projected update vectors is 0.85 across layers and seeds. These checks, together with the random-subspace control, demonstrate that the methodology introduces neither systematic scale inflation nor directional distortion under the stated hyperparameters. revision: yes
Referee: [Multitask experiments] Multitask results: the claim that 'gradient aggregation across competing tasks is the main obstruction' is central, yet the choice of rank-3 and the precise definition of bar R_k (including how it is normalized across operations) could reduce to fitted assumptions if not cross-validated against an external, non-SVD benchmark of feature alignment.

Authors: bar R_k is defined as the ratio of explained variance captured by the top-k LCH probe directions to that captured by an equal number of random directions, averaged across the four operations after per-operation normalization by their individual singular-value spectra; this construction follows directly from the SVD and linear-probe definitions in Section 3 and is not a fitted hyperparameter. The rank-3 threshold is likewise data-driven: cumulative explained variance in the attention-update SVD exceeds 95% at rank 3 (Figure 2). The central claim that aggregation is the obstruction is validated by the controlled contrast between shared-encoder multitask runs (bar R_k ≤1 with updates) and the same runs resolved per-task (bar R_k =20–45x with gradients). While an independent non-SVD alignment metric would be a useful future check, the per-operation gradient resolution already supplies an internal cross-validation that does not rely on the SVD diagnostic alone. We have expanded the methods and discussion to make the normalization and rank-selection criteria fully explicit. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; empirical measurements and controls are independent of inputs

full rationale

The paper's core chain compares rolling SVD on AdamW updates versus loss gradients to compute the diagnostic ratio bar R_k, reports the resulting order-of-magnitude change in measured SED-LCH coupling, and tests causality via rank-3 projection interventions that include random subspaces as controls. These steps are observational and interventional rather than definitional; bar R_k is computed from the SVD directions and LCH features as a separate ratio, the random-subspace control is not derived from the SED directions, and no load-bearing premise reduces to a self-citation, fitted parameter renamed as prediction, or ansatz smuggled from prior work. The derivation remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Based on abstract only; relies on prior definitions of SED and LCH; rank-3 subspace appears selected for intervention.

free parameters (1)

rank-3 subspace
The subspace dimensionality chosen for the causal intervention to match the diagnostic.

axioms (2)

domain assumption SED directions correspond to locations of feature formation in parameter space
Core assumption underlying the diagnostic's interpretation.
domain assumption Linear Centroid Hypothesis describes the relevant representation structure
Used to interpret the measured coupling.

pith-pipeline@v0.9.0 · 5552 in / 1458 out tokens · 83925 ms · 2026-05-07T16:54:58.585246+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk.The Linear Centroids Hypothesis: How Deep Network Features Represent Data.arXiv preprint arXiv:2604.11962, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Yongzhong Xu.The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure.arXiv preprint arXiv:2602.18523, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Yongzhong Xu.The Spectral Edge Thesis: A Mathematical Framework for Intra-Signal Phase Transitions in Neural Network Training.arXiv preprint arXiv:2603.28964, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Kiho Park, Yo Joong Choe, and Victor Veitch.The Linear Representation Hypothesis and the Geometry of Large Language Models.In International Conference on Machine Learning (ICML), 2024

2024
[5]

Alethea Power, Yuri Burda, Harrison Edwards, Igor Babuschkin, and Vedant Misra.Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review arXiv 2022
[6]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt.Progress measures for grokking via mechanistic interpretability.In International Conference on Learning Representations (ICLR), 2023

2023
[7]

Omnigrok: Grokking beyond algorithmic data.arXiv preprint arXiv:2210.01117,

Ziming Liu, Eric J. Michaud, and Max Tegmark.Omnigrok: Grokking Beyond Algorithmic Data.arXiv preprint arXiv:2210.01117, 2022. Also appeared at ICLR 2023

work page arXiv 2022
[8]

Zico Kolter, and Ameet Talwalkar.Gradient descent on neural networks typically occurs at the edge of stability.In International Conference on Learning Representations (ICLR), 2021

Jeremy Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar.Gradient descent on neural networks typically occurs at the edge of stability.In International Conference on Learning Representations (ICLR), 2021

2021
[9]

Yuandong Tian.Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking.arXiv preprint arXiv:2509.21519, 2025

work page arXiv 2025
[10]

Ilya Loshchilov and Frank Hutter.Decoupled Weight Decay Regularization.In International Conference on Learning Representations (ICLR), 2019. 15

2019