Recognition: 2 theorem links
· Lean TheoremGrokking as Dimensional Phase Transition in Neural Networks
Pith reviewed 2026-05-10 19:29 UTC · model grok-4.3
The pith
Grokking occurs when the effective dimensionality of the gradient field crosses from below 1 to above 1.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Finite-size scaling analysis of gradient avalanche dynamics across eight model scales shows that effective dimensionality D crosses from subcritical values below 1 to supercritical values above 1 precisely when generalization begins. This crossing exhibits self-organized criticality. Experiments with synthetic i.i.d. Gaussian gradients keep D near 1 independent of network topology, while real training produces dimensional excess from backpropagation correlations, confirming that D tracks gradient field geometry.
What carries the argument
Effective dimensionality D obtained via finite-size scaling of gradient avalanche dynamics, which quantifies the geometry of the gradient field and marks the sub- to super-diffusive transition.
If this is right
- D(t) crossing 1 can serve as a predictor of when grokking will occur, independent of specific network topology.
- Overparameterized networks achieve generalization once backpropagation correlations drive D above the critical value of 1.
- The architecture independence of D implies that altering gradient correlations alone can control the memorization-to-generalization shift.
- Synthetic gradients remaining at D approximately 1 isolate the role of real training correlations in enabling the supercritical regime.
Where Pith is reading between the lines
- The same D-threshold view could be tested on sequence tasks to check whether language-model grokking follows identical dimensional scaling.
- Training methods that deliberately enhance or suppress specific gradient correlations might be designed to shift the transition point earlier or later.
- This geometric framing links grokking to other critical phenomena in high-dimensional optimization where correlation structure governs abrupt changes in behavior.
Load-bearing premise
The finite-size scaling of gradient avalanche dynamics identifies a genuine phase transition whose critical point coincides with generalization onset, and D depends only on gradient correlations once architecture effects are removed.
What would settle it
An experiment in which generalization onset occurs while D stays below 1, or D crosses 1 without any generalization improvement, across additional scales or architectures.
Figures
read the original abstract
Neural network grokking -- the abrupt memorization-to-generalization transition -- challenges our understanding of learning dynamics. Through finite-size scaling of gradient avalanche dynamics across eight model scales, we find that grokking is a \textit{dimensional phase transition}: effective dimensionality~$D$ crosses from sub-diffusive (subcritical, $D < 1$) to super-diffusive (supercritical, $D > 1$) at generalization onset, exhibiting self-organized criticality (SOC). Crucially, $D$ reflects \textbf{gradient field geometry}, not network architecture: synthetic i.i.d.\ Gaussian gradients maintain $D \approx 1$ regardless of graph topology, while real training exhibits dimensional excess from backpropagation correlations. The grokking-localized $D(t)$ crossing -- robust across topologies -- offers new insight into the trainability of overparameterized networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that grokking is a dimensional phase transition in neural networks: the effective dimensionality D of the gradient field crosses from sub-diffusive (D < 1) to super-diffusive (D > 1) precisely at generalization onset, exhibiting self-organized criticality. D is argued to reflect gradient-field geometry (arising from back-propagation correlations) rather than architecture, supported by finite-size scaling across eight model scales and a contrast between real gradients and i.i.d. Gaussian synthetics that remain at D ≈ 1.
Significance. If the scaling analysis and architecture-independence claims hold, the work supplies a concrete, physics-inspired mechanism for the abrupt memorization-to-generalization transition and for the trainability of over-parameterized networks. It also supplies a falsifiable link between gradient avalanche statistics and SOC, which could be tested in other optimization settings.
major comments (3)
- [Abstract] The central claim requires that finite-size scaling of gradient avalanche dynamics yields a D(t) that crosses 1 exactly at generalization onset. The manuscript reports this crossing but does not state the explicit scaling ansatz, the functional form used for collapse, or the fitting window; without these, it is impossible to judge whether the reported critical point is robust or an artifact of avalanche-definition choices (threshold, temporal binning).
- The synthetic-control experiment assumes i.i.d. Gaussian gradients adequately proxy the non-stationary, loss-dependent statistics of real back-propagated gradients. If the synthetics fail to reproduce the actual correlation structure, the conclusion that dimensional excess is due solely to back-propagation (and therefore architecture-independent) does not follow.
- The architecture-independence claim is load-bearing for the interpretation that D is a property of gradient geometry alone. The manuscript states the result is robust across topologies, yet provides no quantitative table or figure showing D values (or collapse quality) for the eight scales under varied architectures once correlations are isolated.
minor comments (2)
- Clarify how the avalanche observable is defined (size, duration, threshold) and whether results are sensitive to these choices; a short robustness appendix would suffice.
- The notation D(t) is introduced without an explicit formula for its extraction from avalanche statistics; an equation or pseudocode block would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. We agree that additional methodological details and quantitative evidence will strengthen the presentation of our results. We address each of the major comments below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] The central claim requires that finite-size scaling of gradient avalanche dynamics yields a D(t) that crosses 1 exactly at generalization onset. The manuscript reports this crossing but does not state the explicit scaling ansatz, the functional form used for collapse, or the fitting window; without these, it is impossible to judge whether the reported critical point is robust or an artifact of avalanche-definition choices (threshold, temporal binning).
Authors: We concur that the finite-size scaling procedure requires more explicit documentation. In the revised version, we will specify the scaling ansatz used for collapsing the D(t) curves across model scales, the functional form assumed for the collapse (including any power-law or scaling exponents), the precise fitting window employed around the transition point, and perform sensitivity analyses with respect to avalanche detection threshold and temporal binning. These clarifications will substantiate that the D = 1 crossing is robust and aligned with the onset of generalization. revision: yes
-
Referee: The synthetic-control experiment assumes i.i.d. Gaussian gradients adequately proxy the non-stationary, loss-dependent statistics of real back-propagated gradients. If the synthetics fail to reproduce the actual correlation structure, the conclusion that dimensional excess is due solely to back-propagation (and therefore architecture-independent) does not follow.
Authors: The i.i.d. Gaussian synthetics are intended as a minimal control to highlight the role of back-propagation-induced correlations in producing dimensional excess. Although they do not fully replicate the non-stationary and loss-dependent aspects of real gradients, the key observation is that only the real gradients exhibit the sub-to-super-diffusive transition, while synthetics remain at D ≈ 1 irrespective of topology. We will augment the text with a more detailed justification of this control and its limitations, but maintain that it supports the architecture-independent nature of the gradient geometry effect. revision: partial
-
Referee: The architecture-independence claim is load-bearing for the interpretation that D is a property of gradient geometry alone. The manuscript states the result is robust across topologies, yet provides no quantitative table or figure showing D values (or collapse quality) for the eight scales under varied architectures once correlations are isolated.
Authors: We acknowledge the need for more explicit quantitative support. The revised manuscript will include a table (or supplementary figure) presenting the effective dimensionality D and the associated finite-size collapse metrics for each of the eight model scales across the varied network topologies, with the synthetic controls used to isolate correlation effects. This will provide the requested evidence that the dimensional transition is independent of architecture. revision: yes
Circularity Check
No significant circularity; derivation relies on empirical finite-size scaling of observed dynamics.
full rationale
The paper's central claim rests on applying finite-size scaling to gradient avalanche dynamics measured across model scales, then observing that the resulting effective dimensionality D crosses 1 precisely at the memorization-to-generalization transition. This is presented as an empirical finding supported by contrasts with i.i.d. Gaussian synthetic gradients. No equations or definitions in the provided text reduce D or the transition point to a fitted parameter or self-referential construction; the scaling analysis is described as identifying the crossing rather than presupposing it. No self-citations, imported uniqueness theorems, or ansatzes are invoked in the abstract or summary to bear the load of the phase-transition identification. The derivation therefore remains self-contained against the measured avalanche statistics and does not exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Gradient avalanche dynamics admit a well-defined effective dimensionality D that can be extracted via finite-size scaling.
- domain assumption Self-organized criticality applies to the gradient field during neural network training.
invented entities (1)
-
effective dimensionality D of the gradient field
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
effective dimensionality D—the FSS exponent in s_max ∼ N^D ... crosses from sub-diffusive (D < 1) to super-diffusive (D > 1) at generalization onset
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.
Reference graph
Works this paper leans on
-
[1]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, arXiv preprint arXiv:2201.02177 (2022)
work page internal anchor Pith review arXiv 2022
-
[2]
Zhang, S
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Communications of the ACM64, 107 (2021)
2021
-
[3]
Progress measures for grokking via mechanistic interpretability
N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Stein- hardt, arXiv preprint arXiv:2301.05217 (2023)
work page internal anchor Pith review arXiv 2023
-
[4]
Z. Liu, O. Kitouni, N. Nolte, E. Michaud, M. Tegmark, and M. Williams, Advances in Neural Information Pro- cessing Systems35(2022)
2022
-
[5]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
V. Varma, R. Shah, Z. Kenton, J. Kram´ ar, and R. Ku- mar, arXiv preprint arXiv:2309.02390 (2023)
-
[6]
Rubin, I
N. Rubin, I. Seroussi, and Z. Ringel, inThe Twelfth Inter- national Conference on Learning Representations(2024)
2024
-
[7]
P. Bak, C. Tang, and K. Wiesenfeld, Physical Review Letters59, 381 (1987)
1987
-
[8]
P. Bak, C. Tang, and K. Wiesenfeld, Physical Review A 38, 364 (1988)
1988
-
[9]
J. M. Beggs and D. Plenz, Journal of Neuroscience23, 11167 (2003)
2003
-
[10]
D. R. Chialvo, Nature Physics6, 744 (2010)
2010
-
[11]
Wang (2026), submitted to APS OPEN SCIENCE
P. Wang (2026), submitted to APS OPEN SCIENCE
2026
-
[12]
Onsager, Physical Review65, 117 (1944)
L. Onsager, Physical Review65, 117 (1944)
1944
-
[13]
Goldenfeld,Lectures on phase transitions and the renormalization group(Addison-Wesley, 1992)
N. Goldenfeld,Lectures on phase transitions and the renormalization group(Addison-Wesley, 1992)
1992
-
[14]
A. M. Saxe, J. L. McClelland, and S. Ganguli, arXiv preprint arXiv:1312.6120 (2014)
work page Pith review arXiv 2014
-
[15]
Neural tangent kernel: Convergence and generalization in neural networks
A. Jacot, F. Gabriel, and C. Hongler, inAdvances in Neural Information Processing Systems, Vol. 31 (2018) arXiv:1806.07572
-
[16]
Christensen and Z
K. Christensen and Z. Olami, Physical Review A46, 1829 (1992)
1992
-
[17]
Olami, H
Z. Olami, H. J. S. Feder, and K. Christensen, Physical Review Letters68, 1244 (1992)
1992
-
[18]
Barab´ asi and R
A.-L. Barab´ asi and R. Albert, Science286, 509 (1999)
1999
-
[19]
J. P. Sethna, K. A. Dahmen, and C. R. Myers, Nature 410, 242 (2001)
2001
-
[20]
Pruessner,Self-organised criticality: Theory, mod- els and characterisation(Cambridge University Press, 2012)
G. Pruessner,Self-organised criticality: Theory, mod- els and characterisation(Cambridge University Press, 2012)
2012
-
[21]
Clauset, C
A. Clauset, C. R. Shalizi, and M. E. Newman, SIAM Review51, 661 (2009)
2009
-
[22]
Gradient Descent Happens in a Tiny Subspace
G. Gur-Ari, D. A. Roberts, and E. Dyer, arXiv preprint arXiv:1812.04754 (2018)
work page Pith review arXiv 2018
-
[23]
A. Ghavasieh, M. Vila-Minana, A. Khurd, J. Beggs, G. Ortiz, and S. Fortunato, arXiv preprint arXiv:2509.22649 (2025)
-
[24]
Ghavasieh, arXiv preprint arXiv:2512.00168 (2025)
A. Ghavasieh, arXiv preprint arXiv:2512.00168 (2025)
- [25]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.