arxiv: 2604.04655 · v1 · submitted 2026-04-06 · 💻 cs.LG · cond-mat.dis-nn· cs.AI· nlin.AO

Recognition: 2 theorem links

· Lean Theorem

Grokking as Dimensional Phase Transition in Neural Networks

Ping Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:29 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncs.AInlin.AO

keywords grokkingphase transitioneffective dimensionalitygradient fieldself-organized criticalityneural network trainingavalanche dynamicsfinite-size scaling

0 comments

The pith

Grokking occurs when the effective dimensionality of the gradient field crosses from below 1 to above 1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the abrupt shift from memorization to generalization in neural networks, called grokking, is a phase transition in the effective dimensionality D of the gradient field. This D starts in a sub-diffusive regime below 1 and crosses into a super-diffusive regime above 1 exactly at the onset of generalization, showing traits of self-organized criticality. The transition depends on the geometry and correlations in the gradients created by training, not on the network's size or wiring. A reader would care because it reframes trainability of large models as a controllable geometric property of the gradient landscape rather than an opaque function of parameter count.

Core claim

Finite-size scaling analysis of gradient avalanche dynamics across eight model scales shows that effective dimensionality D crosses from subcritical values below 1 to supercritical values above 1 precisely when generalization begins. This crossing exhibits self-organized criticality. Experiments with synthetic i.i.d. Gaussian gradients keep D near 1 independent of network topology, while real training produces dimensional excess from backpropagation correlations, confirming that D tracks gradient field geometry.

What carries the argument

Effective dimensionality D obtained via finite-size scaling of gradient avalanche dynamics, which quantifies the geometry of the gradient field and marks the sub- to super-diffusive transition.

If this is right

D(t) crossing 1 can serve as a predictor of when grokking will occur, independent of specific network topology.
Overparameterized networks achieve generalization once backpropagation correlations drive D above the critical value of 1.
The architecture independence of D implies that altering gradient correlations alone can control the memorization-to-generalization shift.
Synthetic gradients remaining at D approximately 1 isolate the role of real training correlations in enabling the supercritical regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same D-threshold view could be tested on sequence tasks to check whether language-model grokking follows identical dimensional scaling.
Training methods that deliberately enhance or suppress specific gradient correlations might be designed to shift the transition point earlier or later.
This geometric framing links grokking to other critical phenomena in high-dimensional optimization where correlation structure governs abrupt changes in behavior.

Load-bearing premise

The finite-size scaling of gradient avalanche dynamics identifies a genuine phase transition whose critical point coincides with generalization onset, and D depends only on gradient correlations once architecture effects are removed.

What would settle it

An experiment in which generalization onset occurs while D stays below 1, or D crosses 1 without any generalization improvement, across additional scales or architectures.

Figures

Figures reproduced from arXiv: 2604.04655 by Ping Wang.

**Figure 2.** Figure 2: a reveals heavy-tailed, scale-dependent distributions for all hidden sizes, with systematic cutoff growth smax ∼ N D characteristic of finite-size SOC systems. The progressive rightward shift of the cutoff with increasing system size provides direct visual evidence for scale-invariant dynamics: larger systems sustain larger avalanches, a hallmark of criticality. Cross-task validation via ModAdd-59 (a co… view at source ↗

**Figure 3.** Figure 3: FIG. 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Neural network grokking -- the abrupt memorization-to-generalization transition -- challenges our understanding of learning dynamics. Through finite-size scaling of gradient avalanche dynamics across eight model scales, we find that grokking is a \textit{dimensional phase transition}: effective dimensionality~$D$ crosses from sub-diffusive (subcritical, $D < 1$) to super-diffusive (supercritical, $D > 1$) at generalization onset, exhibiting self-organized criticality (SOC). Crucially, $D$ reflects \textbf{gradient field geometry}, not network architecture: synthetic i.i.d.\ Gaussian gradients maintain $D \approx 1$ regardless of graph topology, while real training exhibits dimensional excess from backpropagation correlations. The grokking-localized $D(t)$ crossing -- robust across topologies -- offers new insight into the trainability of overparameterized networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tries to recast grokking as a gradient-field dimensionality crossing at D=1 with SOC, but the finite-size scaling evidence looks under-specified and possibly circular.

read the letter

The core claim is that grokking coincides with the effective dimensionality D of the gradient field crossing from below 1 to above 1, and that this is a genuine phase transition showing self-organized criticality. They extract D via finite-size scaling of gradient avalanche dynamics over eight model scales and show the crossing is absent in i.i.d. synthetic gradients but present in real backprop gradients, independent of topology once correlations are isolated. That synthetic contrast is the cleanest part of the work; it actually isolates the role of gradient geometry rather than just network size or architecture. The robustness claim across topologies also gets some credit if the scaling holds up in the figures. Bringing in a measurable, monitorable signature tied to existing physics ideas is at least a fresh angle on grokking. The soft spots are exactly where the stress-test note points. Without the explicit scaling ansatz, collapse plots, or checks on avalanche threshold and binning choices, it is hard to tell whether the D=1 crossing is a robust critical phenomenon or an artifact of how the observable is defined and fitted. If the transition point is tuned to match the observed generalization time rather than predicted from the scaling, the phase-transition language becomes descriptive rather than explanatory. The i.i.d. Gaussian proxy is reasonable for a control but does not fully capture the non-stationary, loss-driven statistics of actual training gradients. This is the kind of paper that would interest people already working on grokking or on physics-inspired analyses of training dynamics. A reader looking for new observables to track during large-scale runs could extract something useful even if the criticality interpretation needs tightening. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject; the synthetic control and the concrete D claim give referees something concrete to evaluate. I would send it to review but flag the scaling details and robustness checks as the first things to verify.

Referee Report

3 major / 2 minor

Summary. The paper claims that grokking is a dimensional phase transition in neural networks: the effective dimensionality D of the gradient field crosses from sub-diffusive (D < 1) to super-diffusive (D > 1) precisely at generalization onset, exhibiting self-organized criticality. D is argued to reflect gradient-field geometry (arising from back-propagation correlations) rather than architecture, supported by finite-size scaling across eight model scales and a contrast between real gradients and i.i.d. Gaussian synthetics that remain at D ≈ 1.

Significance. If the scaling analysis and architecture-independence claims hold, the work supplies a concrete, physics-inspired mechanism for the abrupt memorization-to-generalization transition and for the trainability of over-parameterized networks. It also supplies a falsifiable link between gradient avalanche statistics and SOC, which could be tested in other optimization settings.

major comments (3)

[Abstract] The central claim requires that finite-size scaling of gradient avalanche dynamics yields a D(t) that crosses 1 exactly at generalization onset. The manuscript reports this crossing but does not state the explicit scaling ansatz, the functional form used for collapse, or the fitting window; without these, it is impossible to judge whether the reported critical point is robust or an artifact of avalanche-definition choices (threshold, temporal binning).
The synthetic-control experiment assumes i.i.d. Gaussian gradients adequately proxy the non-stationary, loss-dependent statistics of real back-propagated gradients. If the synthetics fail to reproduce the actual correlation structure, the conclusion that dimensional excess is due solely to back-propagation (and therefore architecture-independent) does not follow.
The architecture-independence claim is load-bearing for the interpretation that D is a property of gradient geometry alone. The manuscript states the result is robust across topologies, yet provides no quantitative table or figure showing D values (or collapse quality) for the eight scales under varied architectures once correlations are isolated.

minor comments (2)

Clarify how the avalanche observable is defined (size, duration, threshold) and whether results are sensitive to these choices; a short robustness appendix would suffice.
The notation D(t) is introduced without an explicit formula for its extraction from avalanche statistics; an equation or pseudocode block would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We agree that additional methodological details and quantitative evidence will strengthen the presentation of our results. We address each of the major comments below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] The central claim requires that finite-size scaling of gradient avalanche dynamics yields a D(t) that crosses 1 exactly at generalization onset. The manuscript reports this crossing but does not state the explicit scaling ansatz, the functional form used for collapse, or the fitting window; without these, it is impossible to judge whether the reported critical point is robust or an artifact of avalanche-definition choices (threshold, temporal binning).

Authors: We concur that the finite-size scaling procedure requires more explicit documentation. In the revised version, we will specify the scaling ansatz used for collapsing the D(t) curves across model scales, the functional form assumed for the collapse (including any power-law or scaling exponents), the precise fitting window employed around the transition point, and perform sensitivity analyses with respect to avalanche detection threshold and temporal binning. These clarifications will substantiate that the D = 1 crossing is robust and aligned with the onset of generalization. revision: yes
Referee: The synthetic-control experiment assumes i.i.d. Gaussian gradients adequately proxy the non-stationary, loss-dependent statistics of real back-propagated gradients. If the synthetics fail to reproduce the actual correlation structure, the conclusion that dimensional excess is due solely to back-propagation (and therefore architecture-independent) does not follow.

Authors: The i.i.d. Gaussian synthetics are intended as a minimal control to highlight the role of back-propagation-induced correlations in producing dimensional excess. Although they do not fully replicate the non-stationary and loss-dependent aspects of real gradients, the key observation is that only the real gradients exhibit the sub-to-super-diffusive transition, while synthetics remain at D ≈ 1 irrespective of topology. We will augment the text with a more detailed justification of this control and its limitations, but maintain that it supports the architecture-independent nature of the gradient geometry effect. revision: partial
Referee: The architecture-independence claim is load-bearing for the interpretation that D is a property of gradient geometry alone. The manuscript states the result is robust across topologies, yet provides no quantitative table or figure showing D values (or collapse quality) for the eight scales under varied architectures once correlations are isolated.

Authors: We acknowledge the need for more explicit quantitative support. The revised manuscript will include a table (or supplementary figure) presenting the effective dimensionality D and the associated finite-size collapse metrics for each of the eight model scales across the varied network topologies, with the synthetic controls used to isolate correlation effects. This will provide the requested evidence that the dimensional transition is independent of architecture. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on empirical finite-size scaling of observed dynamics.

full rationale

The paper's central claim rests on applying finite-size scaling to gradient avalanche dynamics measured across model scales, then observing that the resulting effective dimensionality D crosses 1 precisely at the memorization-to-generalization transition. This is presented as an empirical finding supported by contrasts with i.i.d. Gaussian synthetic gradients. No equations or definitions in the provided text reduce D or the transition point to a fitted parameter or self-referential construction; the scaling analysis is described as identifying the crossing rather than presupposing it. No self-citations, imported uniqueness theorems, or ansatzes are invoked in the abstract or summary to bear the load of the phase-transition identification. The derivation therefore remains self-contained against the measured avalanche statistics and does not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract; the central claim rests on the existence of a well-defined effective dimensionality D extracted from gradient avalanche statistics and on the assumption that finite-size scaling reveals a genuine phase transition.

axioms (2)

domain assumption Gradient avalanche dynamics admit a well-defined effective dimensionality D that can be extracted via finite-size scaling.
Invoked when the abstract states that D crosses 1 at generalization onset.
domain assumption Self-organized criticality applies to the gradient field during neural network training.
Stated directly in the abstract as part of the phase-transition interpretation.

invented entities (1)

effective dimensionality D of the gradient field no independent evidence
purpose: To quantify the geometry of backpropagated gradients and mark the sub- to super-diffusive transition.
Introduced as the central observable whose crossing signals grokking; no independent evidence outside the scaling analysis is provided in the abstract.

pith-pipeline@v0.9.0 · 5446 in / 1472 out tokens · 78549 ms · 2026-05-10T19:29:39.595638+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

effective dimensionality D—the FSS exponent in s_max ∼ N^D ... crosses from sub-diffusive (D < 1) to super-diffusive (D > 1) at generalization onset

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
cs.LG 2026-05 unverdicted novelty 6.0

A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.

Reference graph

Works this paper leans on

25 extracted references · 9 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, arXiv preprint arXiv:2201.02177 (2022)

work page internal anchor Pith review arXiv 2022
[2]

Zhang, S

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Communications of the ACM64, 107 (2021)

2021
[3]

Progress measures for grokking via mechanistic interpretability

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Stein- hardt, arXiv preprint arXiv:2301.05217 (2023)

work page internal anchor Pith review arXiv 2023
[4]

Z. Liu, O. Kitouni, N. Nolte, E. Michaud, M. Tegmark, and M. Williams, Advances in Neural Information Pro- cessing Systems35(2022)

2022
[5]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

V. Varma, R. Shah, Z. Kenton, J. Kram´ ar, and R. Ku- mar, arXiv preprint arXiv:2309.02390 (2023)

work page arXiv 2023
[6]

Rubin, I

N. Rubin, I. Seroussi, and Z. Ringel, inThe Twelfth Inter- national Conference on Learning Representations(2024)

2024
[7]

P. Bak, C. Tang, and K. Wiesenfeld, Physical Review Letters59, 381 (1987)

1987
[8]

P. Bak, C. Tang, and K. Wiesenfeld, Physical Review A 38, 364 (1988)

1988
[9]

J. M. Beggs and D. Plenz, Journal of Neuroscience23, 11167 (2003)

2003
[10]

D. R. Chialvo, Nature Physics6, 744 (2010)

2010
[11]

Wang (2026), submitted to APS OPEN SCIENCE

P. Wang (2026), submitted to APS OPEN SCIENCE

2026
[12]

Onsager, Physical Review65, 117 (1944)

L. Onsager, Physical Review65, 117 (1944)

1944
[13]

Goldenfeld,Lectures on phase transitions and the renormalization group(Addison-Wesley, 1992)

N. Goldenfeld,Lectures on phase transitions and the renormalization group(Addison-Wesley, 1992)

1992
[14]

A. M. Saxe, J. L. McClelland, and S. Ganguli, arXiv preprint arXiv:1312.6120 (2014)

work page Pith review arXiv 2014
[15]

Neural tangent kernel: Convergence and generalization in neural networks

A. Jacot, F. Gabriel, and C. Hongler, inAdvances in Neural Information Processing Systems, Vol. 31 (2018) arXiv:1806.07572

work page arXiv 2018
[16]

Christensen and Z

K. Christensen and Z. Olami, Physical Review A46, 1829 (1992)

1992
[17]

Olami, H

Z. Olami, H. J. S. Feder, and K. Christensen, Physical Review Letters68, 1244 (1992)

1992
[18]

Barab´ asi and R

A.-L. Barab´ asi and R. Albert, Science286, 509 (1999)

1999
[19]

J. P. Sethna, K. A. Dahmen, and C. R. Myers, Nature 410, 242 (2001)

2001
[20]

Pruessner,Self-organised criticality: Theory, mod- els and characterisation(Cambridge University Press, 2012)

G. Pruessner,Self-organised criticality: Theory, mod- els and characterisation(Cambridge University Press, 2012)

2012
[21]

Clauset, C

A. Clauset, C. R. Shalizi, and M. E. Newman, SIAM Review51, 661 (2009)

2009
[22]

Gradient Descent Happens in a Tiny Subspace

G. Gur-Ari, D. A. Roberts, and E. Dyer, arXiv preprint arXiv:1812.04754 (2018)

work page Pith review arXiv 2018
[23]

Ghavasieh, M

A. Ghavasieh, M. Vila-Minana, A. Khurd, J. Beggs, G. Ortiz, and S. Fortunato, arXiv preprint arXiv:2509.22649 (2025)

work page arXiv 2025
[24]

Ghavasieh, arXiv preprint arXiv:2512.00168 (2025)

A. Ghavasieh, arXiv preprint arXiv:2512.00168 (2025)

work page arXiv 2025
[25]

Fort and S

S. Fort and S. Ganguli, arXiv preprint arXiv:1910.05929 (2019)

work page arXiv 1910