arxiv: 2604.16431 · v1 · submitted 2026-04-06 · 💻 cs.LG · cond-mat.dis-nn· cs.AI· nlin.AO

Recognition: no theorem link

Dimensional Criticality at Grokking Across MLPs and Transformers

Ping Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:19 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncs.AInlin.AO

keywords grokkingcriticalityavalanche dynamicseffective dimensiongeneralizationneural networkscascade statisticsfinite-size scaling

0 comments

The pith

The effective cascade dimension D(t) crosses the Gaussian baseline exactly at the grokking generalization transition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TDU-OFC, a probe that converts gradient snapshots into avalanche cascades and computes a time-resolved effective dimension D(t). It establishes that this dimension crosses D=1 at the point where memorization gives way to generalization. The direction of crossing depends on the task, descending for modular addition and ascending for XOR, pointing to a common critical state. This provides a dynamical signature for grokking that appears before the behavioral change and is absent in non-grokking runs, offering a way to understand the transition as criticality rather than an unexplained jump.

Core claim

By applying the TDU-OFC probe to gradient data from Transformers on modular addition and MLPs on XOR, we find a localized crossing of the effective cascade dimension D(t) through the value 1 precisely at the generalization transition. For modular addition the approach is from above while for XOR it is from below, both consistent with convergence onto a candidate shared critical manifold. Controls show that runs without grokking remain supercritical and that the probe is non-invasive.

What carries the argument

The TDU-OFC avalanche probe, which offline-maps gradient snapshots to cascade statistics and extracts the effective dimension D(t) via finite-size scaling tuned to the grokking point.

If this is right

D(t) diverges between successful and unsuccessful training trajectories hundreds of epochs prior to the accuracy transition.
Avalanche distributions display heavy tails whose scaling matches the dimensional exponent derived from D(t).
Ungrokked networks remain at D>1 and never reach the post-grokking regime.
Shadow-probe tests with training rate set to zero confirm the measurement leaves dynamics unchanged.
Task-dependent crossing directions favor a shared manifold over mere proximity to D=1.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Early monitoring of D(t) might predict whether a run will grok before test accuracy improves.
The idea of a common critical manifold could unify grokking observations across different problems and models.
Applying the same probe to other sudden-learning phenomena could reveal whether dimensional criticality is a general feature of generalization transitions.

Load-bearing premise

The thresholded mapping of gradients to avalanche cascades in TDU-OFC reflects the genuine dynamical evolution of the network rather than probe-induced biases or task artifacts.

What would settle it

A new experiment in which D(t) shows no crossing of 1 at the measured generalization epoch, or in which the crossing direction is not reproducible for a given task, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.16431 by Ping Wang.

**Figure 3.** Figure 3: FIG. 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 2.** Figure 2: FIG. 2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Abrupt transitions between distinct dynamical regimes are a hallmark of complex systems. Grokking in deep neural networks provides a striking example -- an abrupt transition from memorization to generalization long after training accuracy saturates -- yet robust macroscopic signatures of this transition remain elusive. Here we introduce \textbf{TDU--OFC} (Thresholded Diffusion Update--Olami-Feder-Christensen), an offline avalanche probe that converts gradient snapshots into cascade statistics and extracts a \emph{macroscopic observable} -- the time-resolved effective cascade dimension $D(t)$ -- via grokking-aligned finite-size scaling. Across Transformers trained on modular addition and MLPs trained on XOR, we discover a localized dynamical crossing of the Gaussian diffusion baseline $D=1$ precisely at the generalization transition. The crossing direction is task-dependent: modular addition descends through $D=1$ (approaching from $D>1$), while XOR ascends (from $D<1$). This opposite-direction convergence is consistent with attraction toward a candidate shared critical manifold, rather than trivial residence near $D \approx 1$. Negative controls confirm this picture: ungrokked runs remain supercritical ($D>1$) and never enter the post-transition regime. In addition, avalanche distributions exhibit heavy tails and finite-size scaling consistent with the dimensional exponent extracted from $D(t)$. Shadow-probe controls ($\alpha_{\mathrm{train}}=0$) confirm that $D(t)$ is non-invasive, and grokked trajectories diverge from ungrokked ones in $D(t)$ some $100$--$200$ epochs before the behavioral transition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extracts a time-resolved effective cascade dimension D(t) from gradient snapshots that crosses 1 at the grokking transition in both MLPs and transformers, but the supporting methods are too thin to rule out probe artifacts.

read the letter

The main new thing is the TDU-OFC probe that turns gradient snapshots into avalanche cascades and then pulls out D(t) via finite-size scaling. They report D(t) hitting the Gaussian baseline of 1 right at the generalization point, with modular addition descending from above and XOR ascending from below, plus ungrokked runs staying supercritical and some pre-transition divergence between the two groups. Avalanche distributions also show heavy tails consistent with the extracted exponent. This is a genuine attempt to find a macroscopic dynamical marker that works across architectures rather than just tracking loss or accuracy curves. The opposite crossing directions and the early separation are the parts that stand out if they hold up. They at least mention negative controls and shadow probes with alpha_train=0, which is better than nothing. The soft spots are substantial and mostly in the missing details. The abstract and description give no error bars on D(t), no sensitivity tests on the avalanche threshold, and no explicit description of how the offline cascade mapping or the grokking-aligned scaling is implemented. Because the scaling parameters are chosen to align with the behavioral transition, the reported crossing carries a real risk of being partly tautological with the probe construction. The stress-test point about the threshold potentially interacting with gradient magnitude changes at the transition is worth taking seriously; without direct checks that D(t) stays stable under small threshold shifts, the task-dependent directions could be induced rather than discovered. This is aimed at people working on phase transitions and generalization in neural nets. A reader hunting for new observables might pick up the avalanche idea as a starting point, but the current version is too preliminary to cite or build on directly. It deserves a serious referee because the core observation is novel enough to test, provided the authors supply code, full implementation details, and robustness checks on the threshold and scaling choices.

Referee Report

3 major / 2 minor

Summary. The paper introduces the TDU-OFC (Thresholded Diffusion Update–Olami-Feder-Christensen) offline avalanche probe that converts gradient snapshots into cascade statistics. From these it extracts a macroscopic observable, the time-resolved effective cascade dimension D(t), via grokking-aligned finite-size scaling. Across Transformers on modular addition and MLPs on XOR, D(t) is reported to cross the Gaussian diffusion baseline D=1 precisely at the generalization transition, with task-dependent directions (descending for modular addition, ascending for XOR). Negative controls show ungrokked runs remain supercritical (D>1), shadow probes confirm non-invasiveness, and avalanche distributions exhibit heavy tails consistent with the extracted D.

Significance. If the reported D(t) crossing proves robust to probe-parameter variation and independent of epoch-alignment choices, the work would supply a concrete macroscopic signature linking grokking to critical phenomena, together with a candidate shared critical manifold across architectures and tasks. The consistency between the extracted dimensional exponent and the avalanche statistics, plus the divergence of grokked versus ungrokked trajectories 100–200 epochs before the behavioral transition, would constitute falsifiable, reproducible evidence of this picture.

major comments (3)

[§3.1] §3.1 (TDU-OFC definition): the avalanche threshold is introduced as a single fixed hyper-parameter without any reported sensitivity analysis or variation around the scale of gradient magnitudes at the generalization point. Because the probe maps gradients to cascades offline, modest threshold shifts could alter the extracted D(t) trajectory and the apparent crossing, directly affecting the central claim that the crossing reflects intrinsic dynamics rather than probe construction.
[§4.2] §4.2 (finite-size scaling for D(t)): the scaling windows and collapse procedure are described as 'grokking-aligned,' i.e., centered on the epoch at which test accuracy rises. This alignment makes the detection of a D=1 crossing partly dependent on the same behavioral marker used to define the transition, raising the risk that the reported localized crossing is at least partly tautological with the probe and scaling choices rather than an independent prediction.
[§5] §5 (negative and shadow controls): while ungrokked runs and α_train=0 shadow probes are shown, the manuscript does not report whether D(t) remains invariant under (i) small changes to the offline 'update' definition or (ii) re-derivation of the scaling exponent from pre-transition data only. These tests are load-bearing for the claim that the opposite-direction crossings indicate attraction to a shared critical manifold rather than task-specific probe artifacts.

minor comments (2)

[Figures 3,4] Figure 3 and 4: error bars or bootstrap intervals on the D(t) curves are not shown, making it difficult to judge the statistical significance of the reported crossings at D=1.
[Abstract] The abstract states the crossing occurs but supplies no implementation details, error bars on D(t), or explicit finite-size scaling procedure; these should be added to the main text or supplementary material for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas for strengthening the robustness of the TDU-OFC probe and the interpretation of D(t). We address each major comment point by point below, committing to revisions where the manuscript requires additional evidence or clarification.

read point-by-point responses

Referee: [§3.1] §3.1 (TDU-OFC definition): the avalanche threshold is introduced as a single fixed hyper-parameter without any reported sensitivity analysis or variation around the scale of gradient magnitudes at the generalization point. Because the probe maps gradients to cascades offline, modest threshold shifts could alter the extracted D(t) trajectory and the apparent crossing, directly affecting the central claim that the crossing reflects intrinsic dynamics rather than probe construction.

Authors: We agree that explicit sensitivity analysis is needed to rule out probe artifacts. The manuscript selects the threshold based on typical gradient magnitudes near the transition, but does not vary it. We will add a sensitivity study varying the threshold by factors of 0.5–2 around this scale, recomputing D(t) trajectories and confirming the D=1 crossing persists. These results and an updated figure will be incorporated into the revised §3.1 and supplementary material. revision: yes
Referee: [§4.2] §4.2 (finite-size scaling for D(t)): the scaling windows and collapse procedure are described as 'grokking-aligned,' i.e., centered on the epoch at which test accuracy rises. This alignment makes the detection of a D=1 crossing partly dependent on the same behavioral marker used to define the transition, raising the risk that the reported localized crossing is at least partly tautological with the probe and scaling choices rather than an independent prediction.

Authors: The potential dependence on behavioral alignment is a substantive concern. While D=1 remains an independent theoretical baseline from Gaussian diffusion and the crossing is a localized feature in the D(t) time series, the window centering does rely on the accuracy-defined transition. We will revise §4.2 to include a supplementary analysis using fixed, non-aligned scaling windows applied uniformly across all epochs; this will demonstrate that the D=1 crossing remains detectable near the transition without behavioral centering. A clarifying discussion of this independence will also be added. revision: partial
Referee: [§5] §5 (negative and shadow controls): while ungrokked runs and α_train=0 shadow probes are shown, the manuscript does not report whether D(t) remains invariant under (i) small changes to the offline 'update' definition or (ii) re-derivation of the scaling exponent from pre-transition data only. These tests are load-bearing for the claim that the opposite-direction crossings indicate attraction to a shared critical manifold rather than task-specific probe artifacts.

Authors: We concur that these invariance tests are necessary to support the shared-manifold interpretation. We will compute and report D(t) under modest variations to the offline update definition (e.g., alternative gradient thresholding or update aggregation rules) and will re-derive the scaling exponent using only pre-transition data. Both sets of results will be added to the revised §5 and appendix to confirm that the opposite-direction crossings and pre-transition divergence are robust. revision: yes

Circularity Check

1 steps flagged

D(t) crossing reported at grokking transition partly follows from grokking-aligned finite-size scaling in TDU-OFC probe definition

specific steps

fitted input called prediction [Abstract]
"extracts a macroscopic observable -- the time-resolved effective cascade dimension D(t) -- via grokking-aligned finite-size scaling. Across Transformers trained on modular addition and MLPs trained on XOR, we discover a localized dynamical crossing of the Gaussian diffusion baseline D=1 precisely at the generalization transition."

The finite-size scaling used to obtain D(t) is described as 'grokking-aligned,' meaning its parameters or windowing are chosen with reference to the known grokking epochs. The paper then presents the crossing of D=1 at exactly those epochs as a discovery. This reduces the reported coincidence to a consequence of the alignment step in the definition of D(t) rather than an a-priori prediction from the probe applied to unaligned data.

full rationale

The paper defines the key observable D(t) by applying the newly introduced TDU-OFC avalanche construction to gradient snapshots and then performing finite-size scaling that is explicitly aligned to grokking epochs. The claimed discovery is that this D(t) crosses the D=1 baseline precisely at the independently measured generalization transition (with task-dependent direction). Because the scaling step is tuned to the grokking time scale, the reported coincidence is at least partially forced by the alignment choice rather than emerging as an independent prediction from the raw dynamics. Negative controls (ungrokked runs stay supercritical) provide some separation, but do not eliminate the dependence on the alignment. This matches the fitted-input-called-prediction pattern at the level of the central macroscopic claim.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on adapting sandpile avalanche models to neural gradient dynamics and assuming finite-size scaling extracts a meaningful effective dimension. The probe introduces at least one tunable threshold whose value is not independently justified.

free parameters (1)

avalanche threshold in TDU
Used to convert gradient snapshots into discrete cascades; its specific value determines which updates participate in statistics and is not derived from first principles.

axioms (2)

domain assumption Finite-size scaling from statistical physics applies directly to avalanche distributions extracted from neural network gradient updates.
Invoked to define the time-resolved D(t) and to interpret the crossing of D=1.
domain assumption The offline TDU-OFC mapping does not alter or bias the underlying training trajectory.
Required for the claim that D(t) is a non-invasive macroscopic observable.

invented entities (1)

effective cascade dimension D(t) no independent evidence
purpose: Macroscopic observable that detects the grokking transition via avalanche statistics.
New quantity constructed from the probe; no external falsifiable prediction (e.g., predicted mass or measurable quantity outside the training run) is supplied.

pith-pipeline@v0.9.0 · 5592 in / 1570 out tokens · 71940 ms · 2026-05-10T19:19:03.115141+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
cs.LG 2026-05 unverdicted novelty 6.0

A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.

Reference graph

Works this paper leans on

23 extracted references · 7 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Goldenfeld,Lectures on phase transitions and the renormalization group(Addison-Wesley, 1992)

N. Goldenfeld,Lectures on phase transitions and the renormalization group(Addison-Wesley, 1992)

1992
[2]

P. Bak, C. Tang, and K. Wiesenfeld, Physical Review Letters59, 381 (1987)

1987
[3]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra, arXiv preprint arXiv:2201.02177 (2022)

work page internal anchor Pith review arXiv 2022
[4]

Progress measures for grokking via mechanistic interpretability

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Stein- hardt, arXiv preprint arXiv:2301.05217 (2023)

work page internal anchor Pith review arXiv 2023
[5]

Z. Liu, O. Kitouni, N. Nolte, E. Michaud, M. Tegmark, and M. Williams, Advances in Neural Information Pro- cessing Systems35(2022)

2022
[6]

Pruessner,Self-organised criticality: Theory, mod- els and characterisation(Cambridge University Press, 2012)

G. Pruessner,Self-organised criticality: Theory, mod- els and characterisation(Cambridge University Press, 2012)

2012
[7]

Olami, H

Z. Olami, H. J. S. Feder, and K. Christensen, Physical Review Letters68, 1244 (1992)

1992
[8]

J. P. Sethna, K. A. Dahmen, and C. R. Myers, Nature 410, 242 (2001)

2001
[9]

J. M. Beggs and D. Plenz, Journal of Neuroscience23, 11167 (2003)

2003
[10]

D. R. Chialvo, Nature Physics6, 744 (2010)

2010
[11]

M. A. Mu˜ noz, Reviews of Modern Physics90, 031001 (2018)

2018
[12]

A. M. Saxe, J. L. McClelland, and S. Ganguli, arXiv preprint arXiv:1312.6120 (2014)

work page Pith review arXiv 2014
[13]

Neural tangent kernel: Convergence and generalization in neural networks

A. Jacot, F. Gabriel, and C. Hongler, inAdvances in Neural Information Processing Systems, Vol. 31 (2018) pp. 8571–8580, arXiv:1806.07572

work page arXiv 2018
[14]

H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, Advances in Neural Information Processing Systems31 (2018)

2018
[15]

Fort and S

S. Fort and S. Ganguli, arXiv preprint arXiv:1910.05929 (2019). 9

work page arXiv 1910
[16]

C. H. Martin and M. W. Mahoney, inInternational Con- ference on Machine Learning, Vol. 97 (PMLR, 2019) pp. 4284–4293

2019
[17]

Ghavasieh, M

A. Ghavasieh, M. Vila-Minana, A. Khurd, J. Beggs, G. Ortiz, and S. Fortunato, arXiv preprint arXiv:2509.22649 (2025)

work page arXiv 2025
[18]

Ghavasieh, arXiv preprint arXiv:2512.00168 (2025)

A. Ghavasieh, arXiv preprint arXiv:2512.00168 (2025)

work page arXiv 2025
[19]

Wang, submitted to APS OPEN SCIENCE (2026)

P. Wang, submitted to APS OPEN SCIENCE (2026)

2026
[20]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, inAdvances in Neural Information Processing Systems, Vol. 30 (2017)

2017
[21]

D. P. Kingma and J. Ba, International Conference on Learning Representations (2015)

2015
[22]

Christensen and Z

K. Christensen and Z. Olami, Physical Review A46, 1829 (1992)

1992
[23]

Barab´ asi and R

A.-L. Barab´ asi and R. Albert, Science286, 509 (1999)

1999