Grokking as the transition from lazy to rich training dynamics.arXiv preprint arXiv:2310.06110

Tanishq Kumar, Blake Bordelon, Samuel J Gershman, Cengiz Pehlevan · arXiv 2310.06110

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

cs.LG · 2026-02-18 · unverdicted · novelty 8.0

Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

cs.LG · 2026-03-30 · unverdicted · novelty 7.0

The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

cs.LG · 2026-02-19 · unverdicted · novelty 7.0

Multi-task grokking in Transformers produces staggered generalization, low-dimensional manifolds, weight-decay phase structure, holographic solutions, and transverse redundancy.

Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

cs.LG · 2026-04-28 · unverdicted · novelty 4.0

Empirical tests confirm robust feature repulsion signs but reveal activation-dependent spectral lock-in in grokking, with x^2 yielding rank-2 updates at epoch ~174 and ReLU remaining rank-1.

citing papers explorer

Showing 4 of 4 citing papers.

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking cs.LG · 2026-02-18 · unverdicted · none · ref 3
Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior cs.LG · 2026-03-30 · unverdicted · none · ref 11
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.
The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure cs.LG · 2026-02-19 · unverdicted · none · ref 4
Multi-task grokking in Transformers produces staggered generalization, low-dimensional manifolds, weight-decay phase structure, holographic solutions, and transverse redundancy.
Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking cs.LG · 2026-04-28 · unverdicted · none · ref 1
Empirical tests confirm robust feature repulsion signs but reveal activation-dependent spectral lock-in in grokking, with x^2 yielding rank-2 updates at epoch ~174 and ReLU remaining rank-1.

Grokking as the transition from lazy to rich training dynamics.arXiv preprint arXiv:2310.06110

fields

years

verdicts

representative citing papers

citing papers explorer