pith. sign in

arxiv: 2511.01938 · v3 · pith:JR7N5D5Knew · submitted 2025-11-02 · 💻 cs.LG · cs.AI

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

classification 💻 cs.LG cs.AI
keywords learningdynamicsgeneralizationgrokkingweightdecaydelayedmanifold
0
0 comments X
read the original abstract

Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

    cs.LG 2026-05 unverdicted novelty 7.0

    A first-passage time model produces the law T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star) that predicts grokking delays with 17.7% MAPE on held-out AdamW runs after calibrating two parameters on one cell.

  2. Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

    cs.LG 2026-05 conditional novelty 6.0

    Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.