Omnigrok: Grokking Beyond Algorithmic Data

Eric J. Michaud; Max Tegmark; Ziming Liu

arxiv: 2210.01117 · v2 · pith:IWO6IY3Wnew · submitted 2022-10-03 · 💻 cs.LG · cs.AI· physics.data-an· stat.ME· stat.ML

Omnigrok: Grokking Beyond Algorithmic Data

Ziming Liu , Eric J. Michaud , Max Tegmark This is my paper

classification 💻 cs.LG cs.AIphysics.data-anstat.MEstat.ML

keywords grokkingalgorithmicdatadatasetstrainingabledependencelosses

0 comments

read the original abstract

Grokking, the unusual phenomenon for algorithmic datasets where generalization happens long after overfitting the training data, has remained elusive. We aim to understand grokking by analyzing the loss landscapes of neural networks, identifying the mismatch between training and test losses as the cause for grokking. We refer to this as the "LU mechanism" because training and test losses (against model weight norm) typically resemble "L" and "U", respectively. This simple mechanism can nicely explain many aspects of grokking: data size dependence, weight decay dependence, the emergence of representations, etc. Guided by the intuitive picture, we are able to induce grokking on tasks involving images, language and molecules. In the reverse direction, we are able to eliminate grokking for algorithmic datasets. We attribute the dramatic nature of grokking for algorithmic datasets to representation learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
cs.LG 2026-02 unverdicted novelty 8.0

Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.
Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining
cs.LG 2026-06 unverdicted novelty 7.0

During pretraining, language models exhibit natural ungrokking where learned rules are forgotten based on their support frequency in the corpus, with asymmetric editability of rule survival.
What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy
cs.LG 2026-06 conditional novelty 7.0

Grokking delay under cross-entropy is mediated primarily by logit scale and resulting softmax saturation, with weight norm acting only as an upstream handle that adds 1-2% beyond the scale.
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
cs.LG 2026-03 unverdicted novelty 7.0

The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a stro...
The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure
cs.LG 2026-02 unverdicted novelty 7.0

Multi-task grokking in Transformers produces staggered generalization, low-dimensional manifolds, weight-decay phase structure, holographic solutions, and transverse redundancy.
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking
cs.LG 2025-10 unverdicted novelty 7.0

EGD equalizes gradient speeds across singular directions, eliminating or shortening grokking plateaus on modular addition and sparse parity problems.
SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning
cs.CV 2026-06 unverdicted novelty 5.0

SingGuard presents a policy-adaptive multimodal LLM guardrail family with hybrid reasoning regimes and a new benchmark of 56,340 examples, claiming SOTA F1 across 35 datasets and improved policy adherence under runtim...
SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning
cs.CV 2026-06 unverdicted novelty 5.0

SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.
Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks
cs.LG 2026-06 unverdicted novelty 5.0

Grokking in linear DNNs is explained as hysteresis in L2 phase transitions where SGD noise enables escape from low-accuracy metastable phases with Arrhenius scaling; the same mechanism is suggested for nonlinear networks.
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
cs.LG 2026-05 unverdicted novelty 5.0

Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories
cs.LG 2026-04 unverdicted novelty 5.0

Gradient-based SVD diagnostic uncovers hidden SED-LCH coupling in single and multitask settings and shows rank-3 subspace constraints speed up grokking by 2.3x.
On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime
cs.LG 2026-01 unverdicted novelty 5.0

Preconditioned gradient descent mitigates spectral bias and reduces grokking delays by enabling uniform parameter space exploration in the NTK regime, confirming grokking as a transition to the rich regime.
Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking
cs.LG 2026-04 unverdicted novelty 4.0

Empirical tests confirm robust feature repulsion signs but reveal activation-dependent spectral lock-in in grokking, with x^2 yielding rank-2 updates at epoch ~174 and ReLU remaining rank-1.