Recognition: no theorem link
Dimensional Criticality at Grokking Across MLPs and Transformers
Pith reviewed 2026-05-10 19:19 UTC · model grok-4.3
The pith
The effective cascade dimension D(t) crosses the Gaussian baseline exactly at the grokking generalization transition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying the TDU-OFC probe to gradient data from Transformers on modular addition and MLPs on XOR, we find a localized crossing of the effective cascade dimension D(t) through the value 1 precisely at the generalization transition. For modular addition the approach is from above while for XOR it is from below, both consistent with convergence onto a candidate shared critical manifold. Controls show that runs without grokking remain supercritical and that the probe is non-invasive.
What carries the argument
The TDU-OFC avalanche probe, which offline-maps gradient snapshots to cascade statistics and extracts the effective dimension D(t) via finite-size scaling tuned to the grokking point.
If this is right
- D(t) diverges between successful and unsuccessful training trajectories hundreds of epochs prior to the accuracy transition.
- Avalanche distributions display heavy tails whose scaling matches the dimensional exponent derived from D(t).
- Ungrokked networks remain at D>1 and never reach the post-grokking regime.
- Shadow-probe tests with training rate set to zero confirm the measurement leaves dynamics unchanged.
- Task-dependent crossing directions favor a shared manifold over mere proximity to D=1.
Where Pith is reading between the lines
- Early monitoring of D(t) might predict whether a run will grok before test accuracy improves.
- The idea of a common critical manifold could unify grokking observations across different problems and models.
- Applying the same probe to other sudden-learning phenomena could reveal whether dimensional criticality is a general feature of generalization transitions.
Load-bearing premise
The thresholded mapping of gradients to avalanche cascades in TDU-OFC reflects the genuine dynamical evolution of the network rather than probe-induced biases or task artifacts.
What would settle it
A new experiment in which D(t) shows no crossing of 1 at the measured generalization epoch, or in which the crossing direction is not reproducible for a given task, would falsify the central claim.
Figures
read the original abstract
Abrupt transitions between distinct dynamical regimes are a hallmark of complex systems. Grokking in deep neural networks provides a striking example -- an abrupt transition from memorization to generalization long after training accuracy saturates -- yet robust macroscopic signatures of this transition remain elusive. Here we introduce \textbf{TDU--OFC} (Thresholded Diffusion Update--Olami-Feder-Christensen), an offline avalanche probe that converts gradient snapshots into cascade statistics and extracts a \emph{macroscopic observable} -- the time-resolved effective cascade dimension $D(t)$ -- via grokking-aligned finite-size scaling. Across Transformers trained on modular addition and MLPs trained on XOR, we discover a localized dynamical crossing of the Gaussian diffusion baseline $D=1$ precisely at the generalization transition. The crossing direction is task-dependent: modular addition descends through $D=1$ (approaching from $D>1$), while XOR ascends (from $D<1$). This opposite-direction convergence is consistent with attraction toward a candidate shared critical manifold, rather than trivial residence near $D \approx 1$. Negative controls confirm this picture: ungrokked runs remain supercritical ($D>1$) and never enter the post-transition regime. In addition, avalanche distributions exhibit heavy tails and finite-size scaling consistent with the dimensional exponent extracted from $D(t)$. Shadow-probe controls ($\alpha_{\mathrm{train}}=0$) confirm that $D(t)$ is non-invasive, and grokked trajectories diverge from ungrokked ones in $D(t)$ some $100$--$200$ epochs before the behavioral transition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the TDU-OFC (Thresholded Diffusion Update–Olami-Feder-Christensen) offline avalanche probe that converts gradient snapshots into cascade statistics. From these it extracts a macroscopic observable, the time-resolved effective cascade dimension D(t), via grokking-aligned finite-size scaling. Across Transformers on modular addition and MLPs on XOR, D(t) is reported to cross the Gaussian diffusion baseline D=1 precisely at the generalization transition, with task-dependent directions (descending for modular addition, ascending for XOR). Negative controls show ungrokked runs remain supercritical (D>1), shadow probes confirm non-invasiveness, and avalanche distributions exhibit heavy tails consistent with the extracted D.
Significance. If the reported D(t) crossing proves robust to probe-parameter variation and independent of epoch-alignment choices, the work would supply a concrete macroscopic signature linking grokking to critical phenomena, together with a candidate shared critical manifold across architectures and tasks. The consistency between the extracted dimensional exponent and the avalanche statistics, plus the divergence of grokked versus ungrokked trajectories 100–200 epochs before the behavioral transition, would constitute falsifiable, reproducible evidence of this picture.
major comments (3)
- [§3.1] §3.1 (TDU-OFC definition): the avalanche threshold is introduced as a single fixed hyper-parameter without any reported sensitivity analysis or variation around the scale of gradient magnitudes at the generalization point. Because the probe maps gradients to cascades offline, modest threshold shifts could alter the extracted D(t) trajectory and the apparent crossing, directly affecting the central claim that the crossing reflects intrinsic dynamics rather than probe construction.
- [§4.2] §4.2 (finite-size scaling for D(t)): the scaling windows and collapse procedure are described as 'grokking-aligned,' i.e., centered on the epoch at which test accuracy rises. This alignment makes the detection of a D=1 crossing partly dependent on the same behavioral marker used to define the transition, raising the risk that the reported localized crossing is at least partly tautological with the probe and scaling choices rather than an independent prediction.
- [§5] §5 (negative and shadow controls): while ungrokked runs and α_train=0 shadow probes are shown, the manuscript does not report whether D(t) remains invariant under (i) small changes to the offline 'update' definition or (ii) re-derivation of the scaling exponent from pre-transition data only. These tests are load-bearing for the claim that the opposite-direction crossings indicate attraction to a shared critical manifold rather than task-specific probe artifacts.
minor comments (2)
- [Figures 3,4] Figure 3 and 4: error bars or bootstrap intervals on the D(t) curves are not shown, making it difficult to judge the statistical significance of the reported crossings at D=1.
- [Abstract] The abstract states the crossing occurs but supplies no implementation details, error bars on D(t), or explicit finite-size scaling procedure; these should be added to the main text or supplementary material for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which identify key areas for strengthening the robustness of the TDU-OFC probe and the interpretation of D(t). We address each major comment point by point below, committing to revisions where the manuscript requires additional evidence or clarification.
read point-by-point responses
-
Referee: [§3.1] §3.1 (TDU-OFC definition): the avalanche threshold is introduced as a single fixed hyper-parameter without any reported sensitivity analysis or variation around the scale of gradient magnitudes at the generalization point. Because the probe maps gradients to cascades offline, modest threshold shifts could alter the extracted D(t) trajectory and the apparent crossing, directly affecting the central claim that the crossing reflects intrinsic dynamics rather than probe construction.
Authors: We agree that explicit sensitivity analysis is needed to rule out probe artifacts. The manuscript selects the threshold based on typical gradient magnitudes near the transition, but does not vary it. We will add a sensitivity study varying the threshold by factors of 0.5–2 around this scale, recomputing D(t) trajectories and confirming the D=1 crossing persists. These results and an updated figure will be incorporated into the revised §3.1 and supplementary material. revision: yes
-
Referee: [§4.2] §4.2 (finite-size scaling for D(t)): the scaling windows and collapse procedure are described as 'grokking-aligned,' i.e., centered on the epoch at which test accuracy rises. This alignment makes the detection of a D=1 crossing partly dependent on the same behavioral marker used to define the transition, raising the risk that the reported localized crossing is at least partly tautological with the probe and scaling choices rather than an independent prediction.
Authors: The potential dependence on behavioral alignment is a substantive concern. While D=1 remains an independent theoretical baseline from Gaussian diffusion and the crossing is a localized feature in the D(t) time series, the window centering does rely on the accuracy-defined transition. We will revise §4.2 to include a supplementary analysis using fixed, non-aligned scaling windows applied uniformly across all epochs; this will demonstrate that the D=1 crossing remains detectable near the transition without behavioral centering. A clarifying discussion of this independence will also be added. revision: partial
-
Referee: [§5] §5 (negative and shadow controls): while ungrokked runs and α_train=0 shadow probes are shown, the manuscript does not report whether D(t) remains invariant under (i) small changes to the offline 'update' definition or (ii) re-derivation of the scaling exponent from pre-transition data only. These tests are load-bearing for the claim that the opposite-direction crossings indicate attraction to a shared critical manifold rather than task-specific probe artifacts.
Authors: We concur that these invariance tests are necessary to support the shared-manifold interpretation. We will compute and report D(t) under modest variations to the offline update definition (e.g., alternative gradient thresholding or update aggregation rules) and will re-derive the scaling exponent using only pre-transition data. Both sets of results will be added to the revised §5 and appendix to confirm that the opposite-direction crossings and pre-transition divergence are robust. revision: yes
Circularity Check
D(t) crossing reported at grokking transition partly follows from grokking-aligned finite-size scaling in TDU-OFC probe definition
specific steps
-
fitted input called prediction
[Abstract]
"extracts a macroscopic observable -- the time-resolved effective cascade dimension D(t) -- via grokking-aligned finite-size scaling. Across Transformers trained on modular addition and MLPs trained on XOR, we discover a localized dynamical crossing of the Gaussian diffusion baseline D=1 precisely at the generalization transition."
The finite-size scaling used to obtain D(t) is described as 'grokking-aligned,' meaning its parameters or windowing are chosen with reference to the known grokking epochs. The paper then presents the crossing of D=1 at exactly those epochs as a discovery. This reduces the reported coincidence to a consequence of the alignment step in the definition of D(t) rather than an a-priori prediction from the probe applied to unaligned data.
full rationale
The paper defines the key observable D(t) by applying the newly introduced TDU-OFC avalanche construction to gradient snapshots and then performing finite-size scaling that is explicitly aligned to grokking epochs. The claimed discovery is that this D(t) crosses the D=1 baseline precisely at the independently measured generalization transition (with task-dependent direction). Because the scaling step is tuned to the grokking time scale, the reported coincidence is at least partially forced by the alignment choice rather than emerging as an independent prediction from the raw dynamics. Negative controls (ungrokked runs stay supercritical) provide some separation, but do not eliminate the dependence on the alignment. This matches the fitted-input-called-prediction pattern at the level of the central macroscopic claim.
Axiom & Free-Parameter Ledger
free parameters (1)
- avalanche threshold in TDU
axioms (2)
- domain assumption Finite-size scaling from statistical physics applies directly to avalanche distributions extracted from neural network gradient updates.
- domain assumption The offline TDU-OFC mapping does not alter or bias the underlying training trajectory.
invented entities (1)
-
effective cascade dimension D(t)
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.
Reference graph
Works this paper leans on
-
[1]
Goldenfeld,Lectures on phase transitions and the renormalization group(Addison-Wesley, 1992)
N. Goldenfeld,Lectures on phase transitions and the renormalization group(Addison-Wesley, 1992)
1992
-
[2]
P. Bak, C. Tang, and K. Wiesenfeld, Physical Review Letters59, 381 (1987)
1987
-
[3]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, arXiv preprint arXiv:2201.02177 (2022)
work page internal anchor Pith review arXiv 2022
-
[4]
Progress measures for grokking via mechanistic interpretability
N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Stein- hardt, arXiv preprint arXiv:2301.05217 (2023)
work page internal anchor Pith review arXiv 2023
-
[5]
Z. Liu, O. Kitouni, N. Nolte, E. Michaud, M. Tegmark, and M. Williams, Advances in Neural Information Pro- cessing Systems35(2022)
2022
-
[6]
Pruessner,Self-organised criticality: Theory, mod- els and characterisation(Cambridge University Press, 2012)
G. Pruessner,Self-organised criticality: Theory, mod- els and characterisation(Cambridge University Press, 2012)
2012
-
[7]
Olami, H
Z. Olami, H. J. S. Feder, and K. Christensen, Physical Review Letters68, 1244 (1992)
1992
-
[8]
J. P. Sethna, K. A. Dahmen, and C. R. Myers, Nature 410, 242 (2001)
2001
-
[9]
J. M. Beggs and D. Plenz, Journal of Neuroscience23, 11167 (2003)
2003
-
[10]
D. R. Chialvo, Nature Physics6, 744 (2010)
2010
-
[11]
M. A. Mu˜ noz, Reviews of Modern Physics90, 031001 (2018)
2018
-
[12]
A. M. Saxe, J. L. McClelland, and S. Ganguli, arXiv preprint arXiv:1312.6120 (2014)
work page Pith review arXiv 2014
-
[13]
Neural tangent kernel: Convergence and generalization in neural networks
A. Jacot, F. Gabriel, and C. Hongler, inAdvances in Neural Information Processing Systems, Vol. 31 (2018) pp. 8571–8580, arXiv:1806.07572
-
[14]
H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, Advances in Neural Information Processing Systems31 (2018)
2018
- [15]
-
[16]
C. H. Martin and M. W. Mahoney, inInternational Con- ference on Machine Learning, Vol. 97 (PMLR, 2019) pp. 4284–4293
2019
-
[17]
A. Ghavasieh, M. Vila-Minana, A. Khurd, J. Beggs, G. Ortiz, and S. Fortunato, arXiv preprint arXiv:2509.22649 (2025)
-
[18]
Ghavasieh, arXiv preprint arXiv:2512.00168 (2025)
A. Ghavasieh, arXiv preprint arXiv:2512.00168 (2025)
-
[19]
Wang, submitted to APS OPEN SCIENCE (2026)
P. Wang, submitted to APS OPEN SCIENCE (2026)
2026
-
[20]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, inAdvances in Neural Information Processing Systems, Vol. 30 (2017)
2017
-
[21]
D. P. Kingma and J. Ba, International Conference on Learning Representations (2015)
2015
-
[22]
Christensen and Z
K. Christensen and Z. Olami, Physical Review A46, 1829 (1992)
1992
-
[23]
Barab´ asi and R
A.-L. Barab´ asi and R. Albert, Science286, 509 (1999)
1999
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.