Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.
Distribute the Kactive frequencies evenly acrossHheads (each head handlesK/Hfrequencies)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization
Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.