pith. sign in

arxiv: 2510.04930 · v3 · pith:S5TQOGLCnew · submitted 2025-10-06 · 💻 cs.LG

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

Pith reviewed 2026-05-21 21:09 UTC · model grok-4.3

classification 💻 cs.LG
keywords grokkinggradient descentnatural gradientgeneralizationoptimizationmodular additionsparse parityprincipal directions
0
0 comments X

The pith

Equalizing update speeds across gradient directions removes grokking plateaus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that grokking, the long delay before test performance suddenly improves, arises from stochastic gradient descent advancing at different speeds along different principal directions in the gradient space. It introduces egalitarian gradient descent, which normalizes the gradients to enforce identical speeds along every such direction. This change, presented as a targeted variant of natural gradient descent, shortens or eliminates the plateau on standard examples. Readers care because the fix requires only a simple adjustment to the optimizer and leaves model capacity unchanged. The result is shown on modular addition and sparse parity problems where plateaus have previously been studied in detail.

Core claim

Grokking is induced by asymmetric speeds of stochastic gradient descent along different principal directions of the gradients. Normalizing the gradients so that dynamics along all principal directions evolve at exactly the same speed produces egalitarian gradient descent, a carefully modified form of natural gradient descent. This method makes the model grok much faster and, in some cases, removes the stagnation phase entirely. The approach is demonstrated to eliminate plateaus on classical arithmetic tasks such as modular addition and sparse parity learning.

What carries the argument

Egalitarian gradient descent (EGD), which normalizes gradients so the learning dynamics along every principal direction advance at identical speed.

If this is right

  • Grokking plateaus can be removed or greatly shortened by enforcing uniform speed across principal gradient directions.
  • The same normalization works as a drop-in change to existing training loops on arithmetic tasks.
  • In some settings the stagnation phase disappears completely rather than merely shortening.
  • EGD provides a concrete link between optimization geometry and the timing of generalization jumps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests many observed grokking delays may be optimizer artifacts rather than fundamental limits of the data or architecture.
  • Similar speed equalization could be tested on other regimes where training accuracy rises well before test accuracy.
  • The method invites direct comparisons with other preconditioned optimizers to isolate which normalization choices matter most for plateau removal.

Load-bearing premise

Asymmetric speeds of gradient descent along principal directions are the primary cause of grokking plateaus, and equalizing those speeds removes the stagnation without creating new optimization problems.

What would settle it

Training the same model on modular addition with egalitarian gradient descent and still observing a long plateau before test accuracy rises would show the speed-equalization step does not address the cause.

Figures

Figures reproduced from arXiv: 2510.04930 by Ali Saheb Pasand, Elvis Dohmatob.

Figure 1
Figure 1. Figure 1: Results on Modular Addition for different values of the modulus p. Solid lines correspond to test accuracy and broken lines correspond to train accuracy. In all cases, our proposed EGD (egalitarian gradient descent) method groks after only a few epochs, while vanilla (stochastic) gradient descent stagnates for a long period before eventually grokking. We also include ”Column Normalization”, a simplificatio… view at source ↗
Figure 2
Figure 2. Figure 2: Results on Modular Multiplication for different values of the modulus p. Solid lines correspond to test accuracy and broken lines correspond to train accuracy. In all cases, our proposed EGD method groks after only a few epochs, while all the other methods stagnate a long period before eventually grokking. Refer to Section 5 for details and to Appendix B for the hyperparameters used. 2 Related Work Grokkin… view at source ↗
Figure 3
Figure 3. Figure 3: Results on Sparse Parity Problem. Solid lines correspond to test accuracy and broken lines correspond to train accuracy. All three plots show that our method (EGD) groks significantly faster than other methods. Refer to Section 5 for details on the experimental setup and to Appendix B for the hyperparameters used. 0.0 0.2 0.4 0.6 0.8 1.0 Eigenvalue ×10 4 0 10 20 30 40 50 Count Histogram of eigenvalues of G… view at source ↗
Figure 4
Figure 4. Figure 4: Ill-conditioned Gradient Spectra causes delayed generalization. We consider the problem of learning addition modulo 97 from data, with a two-layer ReLU neural network. At the start of optimization through to the end, the gradient matrix G for the hidden layer has a poor condition number. Here, we see that the largest singular-value (corresponding to a fast direction) is much larger than the smallest (corre… view at source ↗
Figure 5
Figure 5. Figure 5: A Toy Setup which Induces Stagnation in Gra￾dient Descent (GD). Training data points correspond to circles and test data points correspond to stars (middle re￾gion). The broken lines correspond to the large margin of the training data (their separation is 2s), while the solid line is the ground-truth decision-boundary x (1) = 0. The variance of the slow feature x (2) scales like ε ≪ 1. GD would quickly fin… view at source ↗
Figure 6
Figure 6. Figure 6: where X ∈ R n×d is the design matrix with rows x1, . . . , xn, and Y = (y1, . . . , yn) ∈ {±1} n is the response vector. We choose this loss function because it leads to tractable analysis while retaining the same phenomenology we would get using the logistic loss function, for example. 3.1 Vanilla Gradient-Descent Dynamics With step size η, the vanilla gradient descent (GD) on loss (1) gives the following… view at source ↗
Figure 6
Figure 6. Figure 6: Grokking on the Toy Problem. Solid lines correspond to experimental results, while broken lines correspond to our theory (Theorem 1). The initialization is w(0) is such that ∥w(0)∥ = ζ. Thus, the scalar ζ > 0 controls the size of the initialization. Left. As predicted by Theorem 1 and Corollary 1, large initialization leads to delayed generalization in vanilla GD, i.e. long plateaus (of length k∗ ≍ 1/ε) of… view at source ↗
read the original abstract

Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that grokking arises from asymmetric SGD speeds along principal singular directions of the gradients. It introduces Egalitarian Gradient Descent (EGD), a normalization of gradients that equalizes dynamics across these directions (presented as a modified natural gradient method). Theoretical analysis shows asymmetry can induce plateaus in a controlled setting, while experiments on modular addition and sparse parity demonstrate that EGD eliminates or substantially shortens the stagnation phase, sometimes removing it entirely.

Significance. If the causal attribution to principal-direction asymmetry holds and the empirical gains transfer, the work supplies both a mechanistic account of grokking and a lightweight optimizer that could shorten training on tasks known to exhibit long plateaus. The explicit link to natural gradient descent and the parameter-free character of the equalization step are strengths; however, significance depends on whether the observed acceleration is specifically due to equalizing true singular directions rather than a generic rescaling effect.

major comments (2)
  1. [Experiments (modular addition and sparse parity sections)] The central empirical claim (elimination of stagnation on modular addition and sparse parity) rests on the assumption that the observed speedup is caused by equalizing speeds along the actual principal directions of the gradient. No ablation is reported that compares EGD to a simple isotropic rescaling or to a random orthonormal basis normalization; without this control it is impossible to rule out that the benefit arises from generic preconditioning rather than the egalitarian mechanism on the true singular vectors.
  2. [Theoretical analysis] The theoretical section demonstrates that asymmetric speeds can induce grokking in a low-dimensional linear model, yet the transfer to the neural-network setting assumes the same causal pathway without direct measurement of how the principal directions of the gradient matrix align with the memorization-to-generalization transition in the trained models.
minor comments (2)
  1. [Method description] The implementation details for computing the principal directions (full SVD, randomized SVD, or low-rank approximation) and the frequency of recomputation are not stated; these choices affect both computational cost and whether the method remains practical for larger models.
  2. [Empirical results] Error bars or multiple random seeds are not mentioned in the reported curves; given the known sensitivity of grokking timing to initialization and data ordering, statistical reliability of the plateau-removal claim should be quantified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and indicate the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments (modular addition and sparse parity sections)] The central empirical claim (elimination of stagnation on modular addition and sparse parity) rests on the assumption that the observed speedup is caused by equalizing speeds along the actual principal directions of the gradient. No ablation is reported that compares EGD to a simple isotropic rescaling or to a random orthonormal basis normalization; without this control it is impossible to rule out that the benefit arises from generic preconditioning rather than the egalitarian mechanism on the true singular vectors.

    Authors: We agree that the absence of these controls leaves open the possibility of a generic preconditioning effect. In the revised manuscript we will add ablation experiments on both the modular addition and sparse parity tasks that compare EGD against (i) isotropic rescaling of the gradient by its Frobenius norm and (ii) normalization with respect to a random orthonormal basis. These controls will quantify whether the observed acceleration is specific to equalization along the estimated principal singular directions. revision: yes

  2. Referee: [Theoretical analysis] The theoretical section demonstrates that asymmetric speeds can induce grokking in a low-dimensional linear model, yet the transfer to the neural-network setting assumes the same causal pathway without direct measurement of how the principal directions of the gradient matrix align with the memorization-to-generalization transition in the trained models.

    Authors: The linear model is intended as a minimal setting that isolates the effect of asymmetric singular-direction speeds. In the neural-network experiments we demonstrate that EGD, which explicitly equalizes those speeds, eliminates the plateau. To make the causal link more direct we will add, in the revision, plots that track the alignment between the top singular vectors of the gradient matrix and the features that emerge at the memorization-to-generalization transition, thereby providing empirical support for the assumed pathway. revision: yes

Circularity Check

0 steps flagged

EGD derivation is self-contained with independent empirical validation; no load-bearing reduction to inputs.

full rationale

The paper first identifies asymmetric SGD speeds along gradient principal directions as a cause of grokking via separate empirical observations and a controlled theoretical setting. EGD is then explicitly constructed as a normalization (via SVD or equivalent) that forces equal speeds along those directions, framed as a modified natural gradient method. Faster grokking and plateau removal are demonstrated empirically on modular addition and sparse parity tasks using standard train/test splits. No equation or claim reduces the success metric to a quantity fitted from the same data, nor does any central step rely on a self-citation chain for its justification. The derivation remains independent of the target grokking outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard assumption that the gradient matrix admits a well-defined set of principal directions whose relative speeds control generalization timing. No new entities are introduced and no parameters appear to be fitted specifically to grokking metrics.

axioms (1)
  • domain assumption Gradient descent dynamics can be decomposed into independent evolution along principal directions of the gradient covariance.
    Invoked when the authors state that grokking is induced by asymmetric speeds along these directions.

pith-pipeline@v0.9.0 · 5744 in / 1328 out tokens · 40251 ms · 2026-05-21T21:09:27.903927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2303.06173 , year=

    Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent.arXiv preprint arXiv:2303.06173,

  2. [2]

    arXiv preprint arXiv:2406.03495 , year=

    URL https: //arxiv.org/abs/2406.03495. Andrey Gromov. Grokking modular arithmetic

  3. [3]

    arXiv preprint arXiv:2301.02679 , year=

    URLhttps://arxiv.org/abs/2301.02679. Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175,

  4. [4]

    v2, 5 Jun

    URLhttps://arxiv.org/abs/2405.20233. v2, 5 Jun

  5. [5]

    Spotlight

    URLhttps://arxiv.org/abs/2210.01117. Spotlight. Clare Lyle, Gharda Sokar, Razvan Pascanu, and Andras Gyorgy. What can grokking teach us about learning under nonstationarity?arXiv preprint arXiv:2507.20057,

  6. [6]

    A tale of two circuits: Grokking as competition of sparse and dense subnetworks.arXiv preprint arXiv:2303.11873,

    William Merrill, Nikolaos Tsilivis, and Aman Shukla. A tale of two circuits: Grokking as competition of sparse and dense subnetworks.arXiv preprint arXiv:2303.11873,

  7. [7]

    Progress measures for grokking via mechanistic interpretability

    URLhttps://arxiv.org/abs/2301.05217. v3, 19 Oct

  8. [8]

    Predicting grokking long before it happens: A look into the loss landscape of models which grok.arXiv preprint arXiv:2306.13253,

    Pascal Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, Guillaume Dumas, et al. Predicting grokking long before it happens: A look into the loss landscape of models which grok.arXiv preprint arXiv:2306.13253,

  9. [9]

    Revisiting Natural Gradient for Deep Networks

    Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks.arXiv preprint arXiv:1301.3584,

  10. [10]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    URLhttps://arxiv.org/abs/2201.02177. Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, and Tolga Birdal. Grokking at the edge of numerical stability

  11. [11]

    David Saad and Sara A Solla

    URLhttps://arxiv.org/abs/2501.04697. David Saad and Sara A Solla. On-line learning in soft committee machines.Physical Review E, 52(4):4225,

  12. [12]

    arXiv preprint arXiv:2206.04817 , year=

    Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon.arXiv preprint arXiv:2206.04817,

  13. [13]

    arXiv preprint arXiv:2309.02390 , year=

    Vikrant Varma, Rohin Shah, Zachary Kenton, J´anos Kram´ar, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

  14. [14]

    Grokalign: Geometric character- isation and acceleration of grokking.arXiv preprint arxiv:2506.12284,

    10 Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk. Grokalign: Geometric character- isation and acceleration of grokking.arXiv preprint arxiv:2506.12284,

  15. [15]

    Xuekai Zhu, Yao Fu, Bowen Zhou, and Zhouhan Lin

    URL https://proceedings.neurips.cc/. Xuekai Zhu, Yao Fu, Bowen Zhou, and Zhouhan Lin. Critical data size of language models from a grokking perspective. arXiv preprint arXiv:2401.10463,