Distribute the Kactive frequencies evenly acrossHheads (each head handlesK/Hfrequencies)

computescos(2πk(a+ b)/p) =⟨e (k) a , e(k) b ⟩via a dot-product attention over the two-token sequence[E[:, a], E[:, b]] · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

cs.AI · 2026-03-05 · conditional · novelty 7.0

Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.

citing papers explorer

Showing 1 of 1 citing paper.

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization cs.AI · 2026-03-05 · conditional · none · ref 16
Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.

Distribute the Kactive frequencies evenly acrossHheads (each head handlesK/Hfrequencies)

fields

years

verdicts

representative citing papers

citing papers explorer