Grokfast: Accelerated grokking by amplifying slow gradients

URLhttps://arxiv · 2024 · arXiv 2405.20233

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

The Geometric Structure of Models Learning Sparse Data

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

Normal alignment is the rank-one Jacobian structure that lets classifiers minimize loss and maximize local robustness in sparse regimes; the paper proves its optimality and uses it to create GrokAlign and RFAMs.

ILDR: Geometric Early Detection of Grokking

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

cs.LG · 2025-10-06 · unverdicted · novelty 7.0

EGD equalizes gradient speeds across singular directions, eliminating or shortening grokking plateaus on modular addition and sparse parity problems.

Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Random Matrix Theory detects overfitting via growing Correlation Traps in weight spectra during the anti-grokking phase of neural network training.

Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.

citing papers explorer

Showing 5 of 5 citing papers.

The Geometric Structure of Models Learning Sparse Data cs.LG · 2026-05-08 · unverdicted · none · ref 32 · 2 links
Normal alignment is the rank-one Jacobian structure that lets classifiers minimize loss and maximize local robustness in sparse regimes; the paper proves its optimality and uses it to create GrokAlign and RFAMs.
ILDR: Geometric Early Detection of Grokking cs.LG · 2026-04-22 · unverdicted · none · ref 2
ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking cs.LG · 2025-10-06 · unverdicted · none · ref 4
EGD equalizes gradient speeds across singular directions, eliminating or shortening grokking plateaus on modular addition and sparse parity problems.
Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory cs.LG · 2026-05-12 · unverdicted · none · ref 9 · 2 links
Random Matrix Theory detects overfitting via growing Correlation Traps in weight spectra during the anti-grokking phase of neural network training.
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds cs.LG · 2026-05-10 · unverdicted · none · ref 35
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.

Grokfast: Accelerated grokking by amplifying slow gradients

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer