Grokking and Generalization Collapse: Insights from texttt{HTSR} theory
read the original abstract
We study the well-known grokking phenomena in neural networks (NNs) using a 3-layer MLP trained on 1 k-sample subset of MNIST, with and without weight decay, and discover a novel third phase -- \emph{anti-grokking} -- that occurs very late in training and resembles but is distinct from the familiar \emph{pre-grokking} phases: test accuracy collapses while training accuracy stays perfect. This late-stage collapse is distinct, from the known pre-grokking and grokking phases, and is not detected by other proposed grokking progress measures. Leveraging Heavy-Tailed Self-Regularization HTSR through the open-source WeightWatcher tool, we show that the HTSR layer quality metric $\alpha$ alone delineates all three phases, whereas the best competing metrics detect only the first two. The \emph{anti-grokking} is revealed by training for $10^7$ and is invariably heralded by $\alpha < 2$ and the appearance of \emph{Correlation Traps} -- outlier singular values in the randomized layer weight matrices that make the layer weight matrix atypical and signal overfitting of the training set. Such traps are verified by visual inspection of the layer-wise empirical spectral densities, and by using Kolmogorov--Smirnov tests on randomized spectra. Comparative metrics, including activation sparsity, absolute weight entropy, circuit complexity, and $l^2$ weight norms track pre-grokking and grokking but fail to distinguish grokking from anti-grokking. This discovery provides a way to measure overfitting and generalization collapse without direct access to the test data. These results strengthen the claim that the \emph{HTSR} $\alpha$ provides universal layer-convergence target at $\alpha \approx 2$ and underscore the value of using the HTSR alpha $(\alpha)$ metric as a measure of generalization.
This paper has not been read by Pith yet.
Forward citations
Cited by 6 Pith papers
-
Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Slingshot loss spikes result from floating-point precision limits that round correct-class gradients to zero, triggering Numerical Feature Inflation and breaking gradient zero-sum constraints.
-
Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Slingshot loss spikes arise from floating-point precision limits that round correct-class gradients to zero, breaking zero-sum constraints and driving exponential parameter growth through numerical feature inflation.
-
Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Slingshot loss spikes are produced by low-precision arithmetic that breaks the zero-sum gradient constraint and drives exponential growth via Numerical Feature Inflation.
-
Patnaik-Pearson intrinsic dimension for internal representations of neural networks
Introduces the Patnaik-Pearson intrinsic dimension estimator, relates it to HTSR/SETOL for Pareto spectral densities, and applies it to measure embedding dimension evolution in BERT-base and DeepSeek-R1-Distill-Qwen-1.
-
Patnaik-Pearson intrinsic dimension for internal representations of neural networks
Introduces the Patnaik-Pearson intrinsic dimension estimator, proves some of its properties, relates it to HTSR/SETOL for Pareto spectra, and applies it to track embedding dimension evolution in BERT-base and DeepSeek...
-
Fast Generalization after Interpolation via Critically Damped Momentum Optimization
GROKtimizer combines rapid interpolation with critically damped momentum for post-interpolation norm minimization, yielding quadratic speedup over gradient descent under a local quadratic model and better generalizati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.