The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

· 2026 · cs.LG · arXiv 2603.05228

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

Mechanistic interpretability typically relies on post-hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp) - investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data-dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude-based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides data-dependent query-key routing with a uniform distribution, reducing the attention layer to a Continuous Bag-of-Words (CBOW) aggregator. Despite removing adaptive routing, these models achieve 100% generalization across all seeds and bypass the grokking delay entirely. To evaluate whether this acceleration is a task-specific geometric alignment rather than a generic optimization stabilizer, we use non-commutative S5 permutation composition as a negative control. Enforcing spherical constraints on S5 does not accelerate generalization. This suggests eliminating the memorization phase depends strongly on aligning architectural priors with the task's intrinsic symmetries. Together, these findings provide interventional evidence that architectural degrees of freedom substantially influence grokking, suggesting a predictive structural perspective on training dynamics.

representative citing papers

Topological Signatures of Grokking

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Persistent homology detects a sharp increase in maximum and total H1 persistence during grokking on modular arithmetic, offering a topological diagnostic that links representation geometry to generalization.

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

cs.AI · 2026-03-05 · conditional · novelty 7.0 · 2 refs

Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

cs.LG · 2026-05-19 · conditional · novelty 6.0

Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.

Latent Trajectory Dynamics in Large Language Models: A Manifold Evolution Framework with Empirical Validation

cs.CL · 2025-05-24 · unverdicted · novelty 6.0

DMET models LLM generation as controlled dynamical trajectories on a semantic manifold, with three proxy metrics that predict output quality and support adaptive decoding to lower perplexity.

citing papers explorer

Showing 4 of 4 citing papers.

Topological Signatures of Grokking cs.LG · 2026-05-07 · unverdicted · none · ref 21 · internal anchor
Persistent homology detects a sharp increase in maximum and total H1 persistence during grokking on modular arithmetic, offering a topological diagnostic that links representation geometry to generalization.
The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization cs.AI · 2026-03-05 · conditional · none · ref 14 · 2 links · internal anchor
Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.
Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics cs.LG · 2026-05-19 · conditional · none · ref 41 · internal anchor
Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.
Latent Trajectory Dynamics in Large Language Models: A Manifold Evolution Framework with Empirical Validation cs.CL · 2025-05-24 · unverdicted · none · ref 18 · internal anchor
DMET models LLM generation as controlled dynamical trajectories on a semantic manifold, with three proxy metrics that predict output quality and support adaptive decoding to lower perplexity.

The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

fields

years

verdicts

representative citing papers

citing papers explorer