Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking
Pith reviewed 2026-05-21 21:09 UTC · model grok-4.3
The pith
Equalizing update speeds across gradient directions removes grokking plateaus.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Grokking is induced by asymmetric speeds of stochastic gradient descent along different principal directions of the gradients. Normalizing the gradients so that dynamics along all principal directions evolve at exactly the same speed produces egalitarian gradient descent, a carefully modified form of natural gradient descent. This method makes the model grok much faster and, in some cases, removes the stagnation phase entirely. The approach is demonstrated to eliminate plateaus on classical arithmetic tasks such as modular addition and sparse parity learning.
What carries the argument
Egalitarian gradient descent (EGD), which normalizes gradients so the learning dynamics along every principal direction advance at identical speed.
If this is right
- Grokking plateaus can be removed or greatly shortened by enforcing uniform speed across principal gradient directions.
- The same normalization works as a drop-in change to existing training loops on arithmetic tasks.
- In some settings the stagnation phase disappears completely rather than merely shortening.
- EGD provides a concrete link between optimization geometry and the timing of generalization jumps.
Where Pith is reading between the lines
- The result suggests many observed grokking delays may be optimizer artifacts rather than fundamental limits of the data or architecture.
- Similar speed equalization could be tested on other regimes where training accuracy rises well before test accuracy.
- The method invites direct comparisons with other preconditioned optimizers to isolate which normalization choices matter most for plateau removal.
Load-bearing premise
Asymmetric speeds of gradient descent along principal directions are the primary cause of grokking plateaus, and equalizing those speeds removes the stagnation without creating new optimization problems.
What would settle it
Training the same model on modular addition with egalitarian gradient descent and still observing a long plateau before test accuracy rises would show the speed-equalization step does not address the cause.
Figures
read the original abstract
Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that grokking arises from asymmetric SGD speeds along principal singular directions of the gradients. It introduces Egalitarian Gradient Descent (EGD), a normalization of gradients that equalizes dynamics across these directions (presented as a modified natural gradient method). Theoretical analysis shows asymmetry can induce plateaus in a controlled setting, while experiments on modular addition and sparse parity demonstrate that EGD eliminates or substantially shortens the stagnation phase, sometimes removing it entirely.
Significance. If the causal attribution to principal-direction asymmetry holds and the empirical gains transfer, the work supplies both a mechanistic account of grokking and a lightweight optimizer that could shorten training on tasks known to exhibit long plateaus. The explicit link to natural gradient descent and the parameter-free character of the equalization step are strengths; however, significance depends on whether the observed acceleration is specifically due to equalizing true singular directions rather than a generic rescaling effect.
major comments (2)
- [Experiments (modular addition and sparse parity sections)] The central empirical claim (elimination of stagnation on modular addition and sparse parity) rests on the assumption that the observed speedup is caused by equalizing speeds along the actual principal directions of the gradient. No ablation is reported that compares EGD to a simple isotropic rescaling or to a random orthonormal basis normalization; without this control it is impossible to rule out that the benefit arises from generic preconditioning rather than the egalitarian mechanism on the true singular vectors.
- [Theoretical analysis] The theoretical section demonstrates that asymmetric speeds can induce grokking in a low-dimensional linear model, yet the transfer to the neural-network setting assumes the same causal pathway without direct measurement of how the principal directions of the gradient matrix align with the memorization-to-generalization transition in the trained models.
minor comments (2)
- [Method description] The implementation details for computing the principal directions (full SVD, randomized SVD, or low-rank approximation) and the frequency of recomputation are not stated; these choices affect both computational cost and whether the method remains practical for larger models.
- [Empirical results] Error bars or multiple random seeds are not mentioned in the reported curves; given the known sensitivity of grokking timing to initialization and data ordering, statistical reliability of the plateau-removal claim should be quantified.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and indicate the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments (modular addition and sparse parity sections)] The central empirical claim (elimination of stagnation on modular addition and sparse parity) rests on the assumption that the observed speedup is caused by equalizing speeds along the actual principal directions of the gradient. No ablation is reported that compares EGD to a simple isotropic rescaling or to a random orthonormal basis normalization; without this control it is impossible to rule out that the benefit arises from generic preconditioning rather than the egalitarian mechanism on the true singular vectors.
Authors: We agree that the absence of these controls leaves open the possibility of a generic preconditioning effect. In the revised manuscript we will add ablation experiments on both the modular addition and sparse parity tasks that compare EGD against (i) isotropic rescaling of the gradient by its Frobenius norm and (ii) normalization with respect to a random orthonormal basis. These controls will quantify whether the observed acceleration is specific to equalization along the estimated principal singular directions. revision: yes
-
Referee: [Theoretical analysis] The theoretical section demonstrates that asymmetric speeds can induce grokking in a low-dimensional linear model, yet the transfer to the neural-network setting assumes the same causal pathway without direct measurement of how the principal directions of the gradient matrix align with the memorization-to-generalization transition in the trained models.
Authors: The linear model is intended as a minimal setting that isolates the effect of asymmetric singular-direction speeds. In the neural-network experiments we demonstrate that EGD, which explicitly equalizes those speeds, eliminates the plateau. To make the causal link more direct we will add, in the revision, plots that track the alignment between the top singular vectors of the gradient matrix and the features that emerge at the memorization-to-generalization transition, thereby providing empirical support for the assumed pathway. revision: yes
Circularity Check
EGD derivation is self-contained with independent empirical validation; no load-bearing reduction to inputs.
full rationale
The paper first identifies asymmetric SGD speeds along gradient principal directions as a cause of grokking via separate empirical observations and a controlled theoretical setting. EGD is then explicitly constructed as a normalization (via SVD or equivalent) that forces equal speeds along those directions, framed as a modified natural gradient method. Faster grokking and plateau removal are demonstrated empirically on modular addition and sparse parity tasks using standard train/test splits. No equation or claim reduces the success metric to a quantity fitted from the same data, nor does any central step rely on a self-citation chain for its justification. The derivation remains independent of the target grokking outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient descent dynamics can be decomposed into independent evolution along principal directions of the gradient covariance.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
˜G := (GGᵀ)^{-1/2}G … makes all the singular values equal … dynamics along all the principal directions evolves at exactly the same speed
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2303.06173 , year=
Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent.arXiv preprint arXiv:2303.06173,
-
[2]
arXiv preprint arXiv:2406.03495 , year=
URL https: //arxiv.org/abs/2406.03495. Andrey Gromov. Grokking modular arithmetic
-
[3]
arXiv preprint arXiv:2301.02679 , year=
URLhttps://arxiv.org/abs/2301.02679. Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition.arXiv preprint arXiv:2402.15175,
- [4]
- [5]
-
[6]
William Merrill, Nikolaos Tsilivis, and Aman Shukla. A tale of two circuits: Grokking as competition of sparse and dense subnetworks.arXiv preprint arXiv:2303.11873,
-
[7]
Progress measures for grokking via mechanistic interpretability
URLhttps://arxiv.org/abs/2301.05217. v3, 19 Oct
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Pascal Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, Guillaume Dumas, et al. Predicting grokking long before it happens: A look into the loss landscape of models which grok.arXiv preprint arXiv:2306.13253,
-
[9]
Revisiting Natural Gradient for Deep Networks
Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks.arXiv preprint arXiv:1301.3584,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
URLhttps://arxiv.org/abs/2201.02177. Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, and Tolga Birdal. Grokking at the edge of numerical stability
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
URLhttps://arxiv.org/abs/2501.04697. David Saad and Sara A Solla. On-line learning in soft committee machines.Physical Review E, 52(4):4225,
-
[12]
arXiv preprint arXiv:2206.04817 , year=
Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon.arXiv preprint arXiv:2206.04817,
-
[13]
arXiv preprint arXiv:2309.02390 , year=
Vikrant Varma, Rohin Shah, Zachary Kenton, J´anos Kram´ar, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,
-
[14]
10 Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk. Grokalign: Geometric character- isation and acceleration of grokking.arXiv preprint arxiv:2506.12284,
-
[15]
Xuekai Zhu, Yao Fu, Bowen Zhou, and Zhouhan Lin
URL https://proceedings.neurips.cc/. Xuekai Zhu, Yao Fu, Bowen Zhou, and Zhouhan Lin. Critical data size of language models from a grokking perspective. arXiv preprint arXiv:2401.10463,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.