Distributional Spectral Diagnostics for Localizing Grokking Transitions

Takafumi Kanamori; Yufeng Ying; Ziyue Wang

arxiv: 2605.08237 · v1 · submitted 2026-05-07 · 💻 cs.LG · stat.ML

Distributional Spectral Diagnostics for Localizing Grokking Transitions

Ziyue Wang , Yufeng Ying , Takafumi Kanamori This is my paper

Pith reviewed 2026-05-12 01:27 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords grokkingdynamic mode decompositiontransformersmodular arithmeticdistributional diagnosticsspectral analysisearly detectiongeneralization

0 comments

The pith

A reconstruction residual from Hankel DMD on distributional observables localizes grokking transitions in Transformer training before test accuracy rises.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the grokking transition as a diagnostic localization problem that must balance detection threshold, false-positive rate, and lead time. Task observables are collected as empirical distributions, embedded in Wasserstein and quantile coordinates, and decomposed with Hankel dynamic mode decomposition; the reconstruction residual then serves as the primary window-level signal. On held-out modular-addition Transformer runs this residual separates grokking from non-grokking trajectories at an AUROC of approximately 0.93 and supports early alarms under a sustained-threshold rule with reported lead times. Perturbation tests show that high-residual windows exhibit roughly three times the short-horizon deviation of low-residual windows, and a same-data norm-window control indicates that the residual ordering tracks perturbation sensitivity rather than total parameter norm.

Core claim

The reconstruction residual produced by Hankel dynamic mode decomposition applied to Wasserstein and quantile coordinates of empirical distributions of training observables acts as a window-level diagnostic that discriminates periods preceding grokking from other dynamics in the studied modular-arithmetic Transformer runs, yielding strong run-level classification performance and permitting pre-onset alarms with controlled false-positive rates.

What carries the argument

Hankel dynamic mode decomposition applied to Wasserstein and quantile embeddings of empirical distributions of task-dependent observables such as log-probability, yielding a reconstruction residual used to localize the transition.

If this is right

The residual achieves AUROC approximately 0.93 for run-level classification of grokking versus non-grokking cases.
True-positive alarms under a fixed sustained-threshold rule can precede the rise in test accuracy, with lead time reported jointly with false-alarm rate and uncertainty intervals.
High-residual windows exhibit about three times larger short-horizon perturbation deviation than low-residual windows.
Perturbation sensitivity aligns with residual ordering rather than total-parameter-norm ordering in the studied weight-decay-one dynamics.
Log-probability performs best among the observables tested under the current protocol.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the residual generalizes beyond the tested modular-arithmetic setting it could serve as a practical monitoring signal for other generalization phenomena during neural-network training.
Pairing the residual with existing norm-based monitors might increase robustness across weight-decay regimes.
Evaluating the same pipeline on non-arithmetic tasks would test whether the method depends on the algebraic structure of the current benchmark.
Quantified lead times could guide the design of intervention rules that act before test accuracy begins to climb.

Load-bearing premise

The chosen task-dependent observables, when summarized as distributions and processed by Hankel DMD, isolate the grokking transition rather than other training dynamics, and this isolation holds outside the specific weight-decay-one modular-addition Transformer pool where controls were performed.

What would settle it

A collection of new Transformer training runs on a different task or with altered hyperparameters in which the residual fails to achieve high AUROC for grokking discrimination or in which high-residual windows do not display elevated perturbation sensitivity.

Figures

Figures reproduced from arXiv: 2605.08237 by Takafumi Kanamori, Yufeng Ying, Ziyue Wang.

**Figure 1.** Figure 1: Pipeline of the proposed diagnostic. The selected task-dependent observable ot at each training step is summarized as an empirical distribution µt. Wasserstein/quantile coordinates convert each µt into a vector observation zt ∈ R d . Windowed Hankel-DMD then analyzes the local temporal evolution of {zt} over fixed step windows and returns spectrum, effective rank, and reconstruction residual. Low residual … view at source ↗

**Figure 2.** Figure 2: Representative wd= 1 grokking run. The dark-blue curve shows test accuracy (left axis), the light-cyan curve in the background shows training accuracy (left axis, included as a reference for the memorization phase), and the red curve shows the reconstruction residual on a log scale (right axis). The vertical dotted line marks grokking onset, defined as the first step at which test accuracy exceeds 99%. In … view at source ↗

**Figure 3.** Figure 3: shows the deviation distribution. Limitation: one unrecoverable failure occurs in a high-RR window near the transition; one low-RR early-training failure indicates a separate instability unrelated to transition fragility and is treated as a boundary case in Appendix K [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training-dynamics spectral points under SGD for ReLU and GeLU activations at widths [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 6.** Figure 6: 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 5.** Figure 5: (a) An example of training and test accuracy curves without weight decay. The blue solid [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Local Koopman spectra across successive training stages. Each panel corresponds to a [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Test accuracy trajectories across five weight decay values (4 seeds each), with [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Reconstruction residual and effective rank over training for five weight decay values. Left: [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Peak reconstruction residual and grokking onset step across weight decay values. Left: [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Train–test accuracy overlay across weight decay settings. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Test accuracy and reconstruction residual for the two architecture variants. Top left/right: [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: AGOP diagnostics and their relationship with reconstruction residual. Left: AGOP [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: N1 baseline-battle pool (18 runs; 5 grok / 13 non-grok). Run-level ROC for the residual detector (AUROC = 0.9231). AGOP cannot be plotted on this pool because its coverage is one run with no grokking, so TPR is undefined; this is annotated in-figure rather than rendered as an empty curve. We therefore do not interpret AGOP’s score on this pool as evidence against the method. CIFAR-10 is included as a port… view at source ↗

**Figure 14.** Figure 14: CIFAR-10 Tiny CNN: DMD diagnostics across channel widths and optimizers. Left: [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Representative test-accuracy and RR trajectories for non-grok, grok, and early [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Detection on the held-out test fold (5 grok / 12 non-grok). (a) ROC for the residual detector. (b) Per-threshold distribution of lead times across grok runs, computed only over true-positive alarms. Reused-seed split behavior. On the reused-seed split (seeds 42–45; 5 grok / 11 non-grok), the same fixed sustained_K2_tau10 operating point fires no alarms, yielding TPR = 0 and FPR = 0. We report this as seed… view at source ↗

**Figure 17.** Figure 17: Test-fold detection diagnostics. (a) AUPRC [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: N3 observable ablation under the selected [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗

**Figure 19.** Figure 19: Outcome distribution by trigger strategy ( [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗

**Figure 20.** Figure 20: FCN secondary check: stage-wise reconstruction residuals and [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

read the original abstract

In grokking, a model first fits the training data while test accuracy remains low, and only later begins to generalize. We ask whether this transition can be localized from observed training trajectories before the test accuracy rises, and formulate grokking transition localization as a diagnostic problem with an explicit threshold/FPR/lead-time trade-off. Task-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and analyzed by Hankel dynamic mode decomposition (DMD); the resulting reconstruction residual, together with spectrum and effective rank, forms the diagnostic output. On held-out modular-addition Transformer runs, the residual achieves AUROC \(\approx \) 0.93 for grokking-vs-non-grokking discrimination at the run level; under a fixed sustained-threshold operating rule, true-positive alarms can precede onset, with lead time reported jointly with false-alarm rate and uncertainty intervals. Perturbation experiments show that, in the tested \(wd=1\) pool, high-residual windows exhibit about \(3\times\) larger short-horizon perturbation deviation than low-residual windows. In a same-data norm-window control, perturbation sensitivity aligns with the residual ordering rather than total-parameter-norm ordering, suggesting that the residual is not merely a total-norm proxy at the window level in the studied \(wd=1\) dynamics. Norm signals remain strong run-level regime indicators, and log-probability performs best among the observables tested under the current protocol. We position the residual as a window-level monitoring and localization signal in the studied modular-arithmetic Transformer settings, not a universal early-warning predictor or an intervention rule.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete DMD residual on distributional training observables that discriminates grokking runs at AUROC ~0.93 in modular-addition transformers, with perturbation controls showing it tracks more than total norm.

read the letter

The main point is that this work turns grokking localization into a diagnostic task and delivers a workable signal from Hankel DMD residuals on Wasserstein and quantile summaries of training observables. On held-out runs it reaches AUROC around 0.93 for run-level separation and can issue alarms ahead of the accuracy jump under a sustained-threshold rule, with lead-time numbers reported alongside false-alarm rates. The perturbation tests add value: high-residual windows show roughly three times the short-horizon deviation, and a same-data norm-window control indicates the ordering aligns with the residual rather than raw parameter norm in the wd=1 setting. Log-probability comes out strongest among the observables they tried. That combination of distributional coordinates, DMD, and explicit controls is not a standard move in the grokking literature, so the pipeline itself is the new piece. The authors keep the claims scoped to the modular-arithmetic transformer pool they studied, which helps avoid overreach. The evidence is specific enough to be checked: concrete AUROC, lead times, and the 3x contrast are all stated up front. Soft spots are mostly about missing details rather than contradictions. The abstract does not give error bars on the AUROC or lead times, nor full ablations on observable choice or DMD rank. The threshold rule is tuned for the reported trade-off, so it is not parameter-free. Generalization beyond wd=1 modular addition is left open, and the paper does not claim otherwise. Readers working on training dynamics or monitoring in small transformers will get the most from it; the numbers and controls make it worth testing on their own runs. The work shows clear thinking on the isolation question and supplies reproducible elements, so it deserves a serious referee even if the scope stays narrow. I would send it to peer review.

Referee Report

3 major / 3 minor

Summary. The manuscript formulates grokking transition localization as a diagnostic problem with an explicit threshold/FPR/lead-time trade-off. Task-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and processed by Hankel DMD; the resulting reconstruction residual (together with spectrum and effective rank) is proposed as the diagnostic output. On held-out modular-addition Transformer runs the residual achieves AUROC ≈ 0.93 for run-level grokking-vs-non-grokking discrimination; under a fixed sustained-threshold rule true-positive alarms can precede onset, with lead time, false-alarm rate and uncertainty intervals reported jointly. Perturbation experiments in the wd=1 pool show ~3× larger short-horizon deviation in high-residual windows, and a same-data norm-window control indicates that perturbation sensitivity aligns with residual ordering rather than total-parameter-norm ordering.

Significance. If the central empirical claims hold, the work supplies a concrete, window-level monitoring signal derived from training observables alone that can localize the grokking transition before test accuracy rises. The combination of distributional embeddings with Hankel DMD, the explicit operating-rule trade-off, and the perturbation plus norm-control experiments constitute a coherent empirical package. The deliberate scoping to the studied wd=1 modular-arithmetic Transformer settings is a strength; it prevents over-claiming while still demonstrating a practical diagnostic that outperforms or complements existing norm-based indicators.

major comments (3)

[Results] Results section (AUROC and lead-time statistics): the reported AUROC ≈ 0.93 and the lead-time/FPR figures are presented without error bars, bootstrap intervals, or the exact number and composition of held-out runs. Because these quantities are central to the discrimination claim, the absence of uncertainty quantification makes it difficult to judge whether the performance is statistically distinguishable from simpler baselines.
[Methods] Methods (sustained-threshold rule): the operating rule is described as “fixed” yet the concrete threshold value, duration window, and the precise procedure used to select them for the reported trade-off are not fully specified. This detail is load-bearing for reproducing the lead-time statistics and for assessing whether the rule generalizes beyond the particular data splits shown.
[Experiments] Experiments (observable ablation): although the text states that log-probability performs best among tested observables, no systematic ablation table or sensitivity plot is provided for the choice of observable, the number of quantile bins, or the DMD rank. Without these controls it remains unclear whether the AUROC is robust or tied to the particular observable set used in the wd=1 pool.

minor comments (3)

[Abstract] Abstract and main text: the phrase “uncertainty intervals” appears but the precise computation method (bootstrap, jackknife, or analytic) is not stated; a short sentence clarifying the procedure would improve reproducibility.
[Figures] Figures showing perturbation deviation and norm-window controls: ensure that all panels share consistent axis scaling and that the 3× contrast is accompanied by a statistical test or confidence band so readers can judge its reliability.
[Methods] Notation: the mapping from empirical distributions to Wasserstein/quantile coordinates is described at a high level; a brief equation or pseudocode block would clarify the exact embedding used before the Hankel matrix is formed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments on uncertainty quantification, reproducibility of the operating rule, and ablation controls. We address each major point below with targeted revisions to improve clarity and rigor while preserving the manuscript's scoped claims.

read point-by-point responses

Referee: [Results] Results section (AUROC and lead-time statistics): the reported AUROC ≈ 0.93 and the lead-time/FPR figures are presented without error bars, bootstrap intervals, or the exact number and composition of held-out runs. Because these quantities are central to the discrimination claim, the absence of uncertainty quantification makes it difficult to judge whether the performance is statistically distinguishable from simpler baselines.

Authors: We agree that explicit uncertainty quantification strengthens the central discrimination claim. Although the manuscript already reports uncertainty intervals jointly with lead-time and FPR statistics, we will add bootstrap 95% confidence intervals for the AUROC (via 1000 resamples of the held-out runs) and explicitly state that the held-out evaluation uses 20 independent runs (10 grokking, 10 non-grokking) drawn from the wd=1 modular-addition pool. These additions will be placed in the Results section and will facilitate direct comparison against norm-based baselines. revision: yes
Referee: [Methods] Methods (sustained-threshold rule): the operating rule is described as “fixed” yet the concrete threshold value, duration window, and the precise procedure used to select them for the reported trade-off are not fully specified. This detail is load-bearing for reproducing the lead-time statistics and for assessing whether the rule generalizes beyond the particular data splits shown.

Authors: We acknowledge that full specification of the sustained-threshold rule is required for reproducibility. The threshold was chosen as 2 standard deviations above the pre-grokking mean residual, with a minimum sustained duration of 3 epochs; both parameters were selected by grid search on a 5-run validation split to maximize lead time subject to FPR < 0.1. In the revision we will add a dedicated 'Operating Rule' subsection that states these values, the selection procedure, and a brief sensitivity analysis to small parameter perturbations. revision: yes
Referee: [Experiments] Experiments (observable ablation): although the text states that log-probability performs best among tested observables, no systematic ablation table or sensitivity plot is provided for the choice of observable, the number of quantile bins, or the DMD rank. Without these controls it remains unclear whether the AUROC is robust or tied to the particular observable set used in the wd=1 pool.

Authors: We agree that systematic controls would clarify robustness. Log-probability was selected after preliminary comparisons against entropy and raw-probability observables; quantile bins were fixed at 20 and DMD rank at 5 to balance reconstruction fidelity and compute. In the revised manuscript we will insert a compact ablation table reporting AUROC for the tested observables and a sensitivity plot (or table) for bin count and DMD rank, using the existing experimental data. This will be presented as supporting evidence rather than an exhaustive search. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents an empirical diagnostic pipeline that summarizes task observables as distributions, transforms them to Wasserstein/quantile coordinates, and applies Hankel DMD to produce a reconstruction residual used for run-level discrimination and window-level localization. All reported performance figures (AUROC ≈ 0.93, lead-time/FPR trade-offs) are computed on held-out runs under a pre-fixed sustained-threshold rule; the threshold itself is not derived from the same evaluation data. No equation or step reduces the residual or spectrum to a self-definition, a fitted parameter renamed as prediction, or a load-bearing self-citation. The method is explicitly scoped to the wd=1 modular-addition Transformer pool with supporting controls (perturbation sensitivity, norm-window comparison) that remain independent of the central claim. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the premise that a small set of task-dependent observables can be turned into stable distributional time series whose DMD residual tracks the grokking transition; no new physical entities are introduced, but the choice of which observables to include and the assumption that DMD applies meaningfully to short training windows are domain-level modeling decisions.

free parameters (1)

sustained-threshold value and duration
The operating rule that triggers an alarm requires choosing a residual threshold and a minimum number of consecutive steps above it; these are tuned to balance lead time against false-positive rate.

axioms (2)

domain assumption Task-dependent observables (log-probability, norms, etc.) contain sufficient information about the impending generalization transition when summarized distributionally.
The method begins by selecting and summarizing these observables; if they are uninformative, the residual cannot localize the transition.
domain assumption Hankel DMD reconstruction residual on the mapped coordinates is a stable indicator of dynamical change rather than an artifact of window size or normalization.
The paper uses the residual as the primary diagnostic output and validates it via perturbation ordering, but the assumption is not derived from first principles.

pith-pipeline@v0.9.0 · 5594 in / 1896 out tokens · 52205 ms · 2026-05-12T01:27:34.241110+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Task-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and analyzed by Hankel dynamic mode decomposition (DMD); the resulting reconstruction residual... forms the diagnostic output.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On held-out modular-addition Transformer runs, the residual achieves AUROC ≈ 0.93...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
cs.LG 2026-05 conditional novelty 6.0

Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper

[1]

URLhttps://arxiv.org/abs/1708.02685. B. Gess, S. Kassing, and V . Konarovskyi. Stochastic modified flows, mean-field limits and dynamics of stochastic gradient descent, 2023. URLhttps://arxiv.org/abs/2302.07125. B. Ghorbani, S. Krishnan, and Y . Xiao. An investigation into neural net optimization via hessian eigenvalue density, 2019. URLhttps://arxiv.org/...

work page arXiv 2023
[2]

doi: 10.3934/jcd.2014.1.391

ISSN 2158-2505. doi: 10.3934/jcd.2014.1.391. URL http://dx.doi.org/10.3934/ jcd.2014.1.391. A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks, 2020. URLhttps://arxiv.org/abs/1806.07572. J. Lee, L. Xiao, S. S. Schoenholz, Y . Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural netw...

work page doi:10.3934/jcd.2014.1.391 2014
[3]

URLhttps://arxiv.org/abs/1804.08838. T. Li, L. Tan, Q. Tao, Y . Liu, and X. Huang. Low dimensional landscape hypothesis is true: Dnns can be trained in tiny subspaces, 2021. URLhttps://arxiv.org/abs/2103.11154. 10 Z. Liu, O. Kitouni, N. Nolte, E. J. Michaud, M. Tegmark, and M. Williams. Towards understanding grokking: An effective theory of representation...

work page doi:10.1073/pnas.2310002121 2021
[4]

URLhttps://arxiv.org/abs/2501.04697. A. Radhakrishnan, D. Beaglehole, P. Pandit, and M. Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383(6690):1461–1467,

work page arXiv
[5]

Mechanism for feature learning in neural networks and backpropagation-free machine learning models

doi: 10.1126/science.adi5639. URL https://www.science.org/doi/abs/10.1126/ science.adi5639. W. T. Redman, J. M. Bello-Rivas, M. Fonoberova, R. Mohr, I. G. Kevrekidis, and I. Mezi´c. Identifying equivalent training dynamics, 2024. URLhttps://arxiv.org/abs/2302.09160. G. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks...

work page doi:10.1126/science.adi5639 2024
[6]

URLhttps://arxiv.org/abs/1805.01053. A. Smola, A. Gretton, L. Song, and B. Schölkopf. A hilbert space embedding for distributions. In M. Hutter, R. A. Servedio, and E. Takimoto, editors,Algorithmic Learning Theory, pages 13–31, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-540-75225-7. M. E. Tano, G. D. Portwood, and J. C. Ragusa. Accel...

work page arXiv 2007
[7]

naturally shuffled

that preserves pairwise proximity while removing systematic group structure. Repeating this procedure nshuff times yields distances {ω′ i =d(Ω ′ 1,i,Ω ′ 2,i)}nshuff i=1 . We report the empirical exceedance rate ˆp:= 1 nshuff nshuffX i=1 1{ω′ i ≥ω}, which quantifies how often a “naturally shuffled” pair is at least as separated as the observed spectra. Qua...

work page
[8]

The task maps pairs (a, b)∈ {0,

and use their open-source implementation https://github.com/openai/grok. The task maps pairs (a, b)∈ {0, . . . ,96} 2 to the label y= (a+b) mod 97 . We modify only the training-set fraction, setting it to 0.4, and keep the remaining data generation and evaluation protocol unchanged. We consider three weight decay settings, wd∈ {0,1,2} , and train one mode...

work page
[9]

the peak-residual ordering wd = 0>wd = 1>wd = 2 holds at every tested segment size

work page
[10]

generalization onset under wd = 1 is later than under wd = 2 at every tested segment size

work page
[11]

lead-lag Spearman ρ between RR and subsequent accuracy change stays above 0.76 for the grokking regimes (wd = 1,2 ) at every tested segment size; for wd = 0 (no grokking) ρ is lower (≈0.54–0.60), consistent with the weaker temporal coupling there. Quantitative values shift with segment size, but the orderings (Table 9) are stable across the three sizes we...

work page 2024
[12]

Lead time is threshold-dependent and is always reported jointly with FPR

Window-level transition localization.On the held-out test fold (Appendix N), the sustained_K2_tau10 rule attains AUROC ≈0.93 , TPR = 0.80 at FPR = 0.50, and me- dian lead 1068 steps on true-positive alarms (95% bootstrap CI [142,2426] ; Appendix O). Lead time is threshold-dependent and is always reported jointly with FPR

work page
[13]

The result quantifies sensitivity, not causal mechanism; one boundary-case low-RR early-training failure is reported alongside the main pool

Window-level fragility under matched perturbations.The sensitivity-window experiment (Appendix K) reports that high-RR windows show elevated short-horizon perturbation sensitivity relative to low-RR windows in wd= 1 baselines under matched noise. The result quantifies sensitivity, not causal mechanism; one boundary-case low-RR early-training failure is re...

work page
[14]

Norm-derived signals nevertheless remain strong run-level regime indicators on the same pool, and we do not claim RR universally outperforms norm baselines

Not a total-norm proxy at the window level.The norm-window control (Appendix Q) re-labels the same perturbation runs by total-parameter-norm percentile and reverses the fragility ordering, indicating that the residual carries window-level information not captured by the total-norm signal. Norm-derived signals nevertheless remain strong run-level regime in...

work page
[15]

AGOP as a parallel route under coverage constraints.The AGOP comparison (Ap- pendix H) shows qualitative co-occurrence between AGOP elevation and RR elevation in transition windows for the runs with sufficient checkpoint coverage. AGOP coverage on the N1 baseline-battle pool is one run, so we treat AGOP as corroborative under sufficient coverage rather th...

work page
[16]

Scope checks: weight decay, segment size, architecture.The weight-decay sweep (Appendix E) shows that the qualitative ordering between regularization strength and onset timing remains visible across five settings; segment-size sensitivity (Appendix G) shows coarse conclusions stable across {250,500,1000} steps with fine ordering varying; the architecture ...

work page
[17]

FCN results (Appendix T) are a secondary low-residual regime descriptor use, not a grokking diagnostic claim

Scope checks: portability and FCN.CIFAR-10 (Appendix J) is a portability check that the pipeline runs end-to-end on a different task/architecture; it is not a grokking diagnostic benchmark, and no detection metric is reported there. FCN results (Appendix T) are a secondary low-residual regime descriptor use, not a grokking diagnostic claim. These results ...

work page 1900

[1] [1]

URLhttps://arxiv.org/abs/1708.02685. B. Gess, S. Kassing, and V . Konarovskyi. Stochastic modified flows, mean-field limits and dynamics of stochastic gradient descent, 2023. URLhttps://arxiv.org/abs/2302.07125. B. Ghorbani, S. Krishnan, and Y . Xiao. An investigation into neural net optimization via hessian eigenvalue density, 2019. URLhttps://arxiv.org/...

work page arXiv 2023

[2] [2]

doi: 10.3934/jcd.2014.1.391

ISSN 2158-2505. doi: 10.3934/jcd.2014.1.391. URL http://dx.doi.org/10.3934/ jcd.2014.1.391. A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks, 2020. URLhttps://arxiv.org/abs/1806.07572. J. Lee, L. Xiao, S. S. Schoenholz, Y . Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural netw...

work page doi:10.3934/jcd.2014.1.391 2014

[3] [3]

URLhttps://arxiv.org/abs/1804.08838. T. Li, L. Tan, Q. Tao, Y . Liu, and X. Huang. Low dimensional landscape hypothesis is true: Dnns can be trained in tiny subspaces, 2021. URLhttps://arxiv.org/abs/2103.11154. 10 Z. Liu, O. Kitouni, N. Nolte, E. J. Michaud, M. Tegmark, and M. Williams. Towards understanding grokking: An effective theory of representation...

work page doi:10.1073/pnas.2310002121 2021

[4] [4]

URLhttps://arxiv.org/abs/2501.04697. A. Radhakrishnan, D. Beaglehole, P. Pandit, and M. Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383(6690):1461–1467,

work page arXiv

[5] [5]

Mechanism for feature learning in neural networks and backpropagation-free machine learning models

doi: 10.1126/science.adi5639. URL https://www.science.org/doi/abs/10.1126/ science.adi5639. W. T. Redman, J. M. Bello-Rivas, M. Fonoberova, R. Mohr, I. G. Kevrekidis, and I. Mezi´c. Identifying equivalent training dynamics, 2024. URLhttps://arxiv.org/abs/2302.09160. G. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks...

work page doi:10.1126/science.adi5639 2024

[6] [6]

URLhttps://arxiv.org/abs/1805.01053. A. Smola, A. Gretton, L. Song, and B. Schölkopf. A hilbert space embedding for distributions. In M. Hutter, R. A. Servedio, and E. Takimoto, editors,Algorithmic Learning Theory, pages 13–31, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-540-75225-7. M. E. Tano, G. D. Portwood, and J. C. Ragusa. Accel...

work page arXiv 2007

[7] [7]

naturally shuffled

that preserves pairwise proximity while removing systematic group structure. Repeating this procedure nshuff times yields distances {ω′ i =d(Ω ′ 1,i,Ω ′ 2,i)}nshuff i=1 . We report the empirical exceedance rate ˆp:= 1 nshuff nshuffX i=1 1{ω′ i ≥ω}, which quantifies how often a “naturally shuffled” pair is at least as separated as the observed spectra. Qua...

work page

[8] [8]

The task maps pairs (a, b)∈ {0,

and use their open-source implementation https://github.com/openai/grok. The task maps pairs (a, b)∈ {0, . . . ,96} 2 to the label y= (a+b) mod 97 . We modify only the training-set fraction, setting it to 0.4, and keep the remaining data generation and evaluation protocol unchanged. We consider three weight decay settings, wd∈ {0,1,2} , and train one mode...

work page

[9] [9]

the peak-residual ordering wd = 0>wd = 1>wd = 2 holds at every tested segment size

work page

[10] [10]

generalization onset under wd = 1 is later than under wd = 2 at every tested segment size

work page

[11] [11]

lead-lag Spearman ρ between RR and subsequent accuracy change stays above 0.76 for the grokking regimes (wd = 1,2 ) at every tested segment size; for wd = 0 (no grokking) ρ is lower (≈0.54–0.60), consistent with the weaker temporal coupling there. Quantitative values shift with segment size, but the orderings (Table 9) are stable across the three sizes we...

work page 2024

[12] [12]

Lead time is threshold-dependent and is always reported jointly with FPR

Window-level transition localization.On the held-out test fold (Appendix N), the sustained_K2_tau10 rule attains AUROC ≈0.93 , TPR = 0.80 at FPR = 0.50, and me- dian lead 1068 steps on true-positive alarms (95% bootstrap CI [142,2426] ; Appendix O). Lead time is threshold-dependent and is always reported jointly with FPR

work page

[13] [13]

The result quantifies sensitivity, not causal mechanism; one boundary-case low-RR early-training failure is reported alongside the main pool

Window-level fragility under matched perturbations.The sensitivity-window experiment (Appendix K) reports that high-RR windows show elevated short-horizon perturbation sensitivity relative to low-RR windows in wd= 1 baselines under matched noise. The result quantifies sensitivity, not causal mechanism; one boundary-case low-RR early-training failure is re...

work page

[14] [14]

Norm-derived signals nevertheless remain strong run-level regime indicators on the same pool, and we do not claim RR universally outperforms norm baselines

Not a total-norm proxy at the window level.The norm-window control (Appendix Q) re-labels the same perturbation runs by total-parameter-norm percentile and reverses the fragility ordering, indicating that the residual carries window-level information not captured by the total-norm signal. Norm-derived signals nevertheless remain strong run-level regime in...

work page

[15] [15]

AGOP as a parallel route under coverage constraints.The AGOP comparison (Ap- pendix H) shows qualitative co-occurrence between AGOP elevation and RR elevation in transition windows for the runs with sufficient checkpoint coverage. AGOP coverage on the N1 baseline-battle pool is one run, so we treat AGOP as corroborative under sufficient coverage rather th...

work page

[16] [16]

Scope checks: weight decay, segment size, architecture.The weight-decay sweep (Appendix E) shows that the qualitative ordering between regularization strength and onset timing remains visible across five settings; segment-size sensitivity (Appendix G) shows coarse conclusions stable across {250,500,1000} steps with fine ordering varying; the architecture ...

work page

[17] [17]

FCN results (Appendix T) are a secondary low-residual regime descriptor use, not a grokking diagnostic claim

Scope checks: portability and FCN.CIFAR-10 (Appendix J) is a portability check that the pipeline runs end-to-end on a different task/architecture; it is not a grokking diagnostic benchmark, and no detection metric is reported there. FCN results (Appendix T) are a secondary low-residual regime descriptor use, not a grokking diagnostic claim. These results ...

work page 1900