Normal alignment is the rank-one Jacobian structure that lets classifiers minimize loss and maximize local robustness in sparse regimes; the paper proves its optimality and uses it to create GrokAlign and RFAMs.
Grokfast: Accelerated grokking by amplifying slow gradients
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 5verdicts
UNVERDICTED 5representative citing papers
ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.
EGD equalizes gradient speeds across singular directions, eliminating or shortening grokking plateaus on modular addition and sparse parity problems.
Random Matrix Theory detects overfitting via growing Correlation Traps in weight spectra during the anti-grokking phase of neural network training.
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
citing papers explorer
-
The Geometric Structure of Models Learning Sparse Data
Normal alignment is the rank-one Jacobian structure that lets classifiers minimize loss and maximize local robustness in sparse regimes; the paper proves its optimality and uses it to create GrokAlign and RFAMs.
-
ILDR: Geometric Early Detection of Grokking
ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.
-
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking
EGD equalizes gradient speeds across singular directions, eliminating or shortening grokking plateaus on modular addition and sparse parity problems.
-
Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory
Random Matrix Theory detects overfitting via growing Correlation Traps in weight spectra during the anti-grokking phase of neural network training.
-
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.