Distributional Spectral Diagnostics for Localizing Grokking Transitions
Pith reviewed 2026-05-12 01:27 UTC · model grok-4.3
The pith
A reconstruction residual from Hankel DMD on distributional observables localizes grokking transitions in Transformer training before test accuracy rises.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The reconstruction residual produced by Hankel dynamic mode decomposition applied to Wasserstein and quantile coordinates of empirical distributions of training observables acts as a window-level diagnostic that discriminates periods preceding grokking from other dynamics in the studied modular-arithmetic Transformer runs, yielding strong run-level classification performance and permitting pre-onset alarms with controlled false-positive rates.
What carries the argument
Hankel dynamic mode decomposition applied to Wasserstein and quantile embeddings of empirical distributions of task-dependent observables such as log-probability, yielding a reconstruction residual used to localize the transition.
If this is right
- The residual achieves AUROC approximately 0.93 for run-level classification of grokking versus non-grokking cases.
- True-positive alarms under a fixed sustained-threshold rule can precede the rise in test accuracy, with lead time reported jointly with false-alarm rate and uncertainty intervals.
- High-residual windows exhibit about three times larger short-horizon perturbation deviation than low-residual windows.
- Perturbation sensitivity aligns with residual ordering rather than total-parameter-norm ordering in the studied weight-decay-one dynamics.
- Log-probability performs best among the observables tested under the current protocol.
Where Pith is reading between the lines
- If the residual generalizes beyond the tested modular-arithmetic setting it could serve as a practical monitoring signal for other generalization phenomena during neural-network training.
- Pairing the residual with existing norm-based monitors might increase robustness across weight-decay regimes.
- Evaluating the same pipeline on non-arithmetic tasks would test whether the method depends on the algebraic structure of the current benchmark.
- Quantified lead times could guide the design of intervention rules that act before test accuracy begins to climb.
Load-bearing premise
The chosen task-dependent observables, when summarized as distributions and processed by Hankel DMD, isolate the grokking transition rather than other training dynamics, and this isolation holds outside the specific weight-decay-one modular-addition Transformer pool where controls were performed.
What would settle it
A collection of new Transformer training runs on a different task or with altered hyperparameters in which the residual fails to achieve high AUROC for grokking discrimination or in which high-residual windows do not display elevated perturbation sensitivity.
Figures
read the original abstract
In grokking, a model first fits the training data while test accuracy remains low, and only later begins to generalize. We ask whether this transition can be localized from observed training trajectories before the test accuracy rises, and formulate grokking transition localization as a diagnostic problem with an explicit threshold/FPR/lead-time trade-off. Task-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and analyzed by Hankel dynamic mode decomposition (DMD); the resulting reconstruction residual, together with spectrum and effective rank, forms the diagnostic output. On held-out modular-addition Transformer runs, the residual achieves AUROC \(\approx \) 0.93 for grokking-vs-non-grokking discrimination at the run level; under a fixed sustained-threshold operating rule, true-positive alarms can precede onset, with lead time reported jointly with false-alarm rate and uncertainty intervals. Perturbation experiments show that, in the tested \(wd=1\) pool, high-residual windows exhibit about \(3\times\) larger short-horizon perturbation deviation than low-residual windows. In a same-data norm-window control, perturbation sensitivity aligns with the residual ordering rather than total-parameter-norm ordering, suggesting that the residual is not merely a total-norm proxy at the window level in the studied \(wd=1\) dynamics. Norm signals remain strong run-level regime indicators, and log-probability performs best among the observables tested under the current protocol. We position the residual as a window-level monitoring and localization signal in the studied modular-arithmetic Transformer settings, not a universal early-warning predictor or an intervention rule.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript formulates grokking transition localization as a diagnostic problem with an explicit threshold/FPR/lead-time trade-off. Task-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and processed by Hankel DMD; the resulting reconstruction residual (together with spectrum and effective rank) is proposed as the diagnostic output. On held-out modular-addition Transformer runs the residual achieves AUROC ≈ 0.93 for run-level grokking-vs-non-grokking discrimination; under a fixed sustained-threshold rule true-positive alarms can precede onset, with lead time, false-alarm rate and uncertainty intervals reported jointly. Perturbation experiments in the wd=1 pool show ~3× larger short-horizon deviation in high-residual windows, and a same-data norm-window control indicates that perturbation sensitivity aligns with residual ordering rather than total-parameter-norm ordering.
Significance. If the central empirical claims hold, the work supplies a concrete, window-level monitoring signal derived from training observables alone that can localize the grokking transition before test accuracy rises. The combination of distributional embeddings with Hankel DMD, the explicit operating-rule trade-off, and the perturbation plus norm-control experiments constitute a coherent empirical package. The deliberate scoping to the studied wd=1 modular-arithmetic Transformer settings is a strength; it prevents over-claiming while still demonstrating a practical diagnostic that outperforms or complements existing norm-based indicators.
major comments (3)
- [Results] Results section (AUROC and lead-time statistics): the reported AUROC ≈ 0.93 and the lead-time/FPR figures are presented without error bars, bootstrap intervals, or the exact number and composition of held-out runs. Because these quantities are central to the discrimination claim, the absence of uncertainty quantification makes it difficult to judge whether the performance is statistically distinguishable from simpler baselines.
- [Methods] Methods (sustained-threshold rule): the operating rule is described as “fixed” yet the concrete threshold value, duration window, and the precise procedure used to select them for the reported trade-off are not fully specified. This detail is load-bearing for reproducing the lead-time statistics and for assessing whether the rule generalizes beyond the particular data splits shown.
- [Experiments] Experiments (observable ablation): although the text states that log-probability performs best among tested observables, no systematic ablation table or sensitivity plot is provided for the choice of observable, the number of quantile bins, or the DMD rank. Without these controls it remains unclear whether the AUROC is robust or tied to the particular observable set used in the wd=1 pool.
minor comments (3)
- [Abstract] Abstract and main text: the phrase “uncertainty intervals” appears but the precise computation method (bootstrap, jackknife, or analytic) is not stated; a short sentence clarifying the procedure would improve reproducibility.
- [Figures] Figures showing perturbation deviation and norm-window controls: ensure that all panels share consistent axis scaling and that the 3× contrast is accompanied by a statistical test or confidence band so readers can judge its reliability.
- [Methods] Notation: the mapping from empirical distributions to Wasserstein/quantile coordinates is described at a high level; a brief equation or pseudocode block would clarify the exact embedding used before the Hankel matrix is formed.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and constructive comments on uncertainty quantification, reproducibility of the operating rule, and ablation controls. We address each major point below with targeted revisions to improve clarity and rigor while preserving the manuscript's scoped claims.
read point-by-point responses
-
Referee: [Results] Results section (AUROC and lead-time statistics): the reported AUROC ≈ 0.93 and the lead-time/FPR figures are presented without error bars, bootstrap intervals, or the exact number and composition of held-out runs. Because these quantities are central to the discrimination claim, the absence of uncertainty quantification makes it difficult to judge whether the performance is statistically distinguishable from simpler baselines.
Authors: We agree that explicit uncertainty quantification strengthens the central discrimination claim. Although the manuscript already reports uncertainty intervals jointly with lead-time and FPR statistics, we will add bootstrap 95% confidence intervals for the AUROC (via 1000 resamples of the held-out runs) and explicitly state that the held-out evaluation uses 20 independent runs (10 grokking, 10 non-grokking) drawn from the wd=1 modular-addition pool. These additions will be placed in the Results section and will facilitate direct comparison against norm-based baselines. revision: yes
-
Referee: [Methods] Methods (sustained-threshold rule): the operating rule is described as “fixed” yet the concrete threshold value, duration window, and the precise procedure used to select them for the reported trade-off are not fully specified. This detail is load-bearing for reproducing the lead-time statistics and for assessing whether the rule generalizes beyond the particular data splits shown.
Authors: We acknowledge that full specification of the sustained-threshold rule is required for reproducibility. The threshold was chosen as 2 standard deviations above the pre-grokking mean residual, with a minimum sustained duration of 3 epochs; both parameters were selected by grid search on a 5-run validation split to maximize lead time subject to FPR < 0.1. In the revision we will add a dedicated 'Operating Rule' subsection that states these values, the selection procedure, and a brief sensitivity analysis to small parameter perturbations. revision: yes
-
Referee: [Experiments] Experiments (observable ablation): although the text states that log-probability performs best among tested observables, no systematic ablation table or sensitivity plot is provided for the choice of observable, the number of quantile bins, or the DMD rank. Without these controls it remains unclear whether the AUROC is robust or tied to the particular observable set used in the wd=1 pool.
Authors: We agree that systematic controls would clarify robustness. Log-probability was selected after preliminary comparisons against entropy and raw-probability observables; quantile bins were fixed at 20 and DMD rank at 5 to balance reconstruction fidelity and compute. In the revised manuscript we will insert a compact ablation table reporting AUROC for the tested observables and a sensitivity plot (or table) for bin count and DMD rank, using the existing experimental data. This will be presented as supporting evidence rather than an exhaustive search. revision: partial
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper presents an empirical diagnostic pipeline that summarizes task observables as distributions, transforms them to Wasserstein/quantile coordinates, and applies Hankel DMD to produce a reconstruction residual used for run-level discrimination and window-level localization. All reported performance figures (AUROC ≈ 0.93, lead-time/FPR trade-offs) are computed on held-out runs under a pre-fixed sustained-threshold rule; the threshold itself is not derived from the same evaluation data. No equation or step reduces the residual or spectrum to a self-definition, a fitted parameter renamed as prediction, or a load-bearing self-citation. The method is explicitly scoped to the wd=1 modular-addition Transformer pool with supporting controls (perturbation sensitivity, norm-window comparison) that remain independent of the central claim. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- sustained-threshold value and duration
axioms (2)
- domain assumption Task-dependent observables (log-probability, norms, etc.) contain sufficient information about the impending generalization transition when summarized distributionally.
- domain assumption Hankel DMD reconstruction residual on the mapped coordinates is a stable indicator of dynamical change rather than an artifact of window size or normalization.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Task-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and analyzed by Hankel dynamic mode decomposition (DMD); the resulting reconstruction residual... forms the diagnostic output.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On held-out modular-addition Transformer runs, the residual achieves AUROC ≈ 0.93...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/1708.02685. B. Gess, S. Kassing, and V . Konarovskyi. Stochastic modified flows, mean-field limits and dynamics of stochastic gradient descent, 2023. URLhttps://arxiv.org/abs/2302.07125. B. Ghorbani, S. Krishnan, and Y . Xiao. An investigation into neural net optimization via hessian eigenvalue density, 2019. URLhttps://arxiv.org/...
-
[2]
ISSN 2158-2505. doi: 10.3934/jcd.2014.1.391. URL http://dx.doi.org/10.3934/ jcd.2014.1.391. A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks, 2020. URLhttps://arxiv.org/abs/1806.07572. J. Lee, L. Xiao, S. S. Schoenholz, Y . Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural netw...
-
[3]
URLhttps://arxiv.org/abs/1804.08838. T. Li, L. Tan, Q. Tao, Y . Liu, and X. Huang. Low dimensional landscape hypothesis is true: Dnns can be trained in tiny subspaces, 2021. URLhttps://arxiv.org/abs/2103.11154. 10 Z. Liu, O. Kitouni, N. Nolte, E. J. Michaud, M. Tegmark, and M. Williams. Towards understanding grokking: An effective theory of representation...
- [4]
-
[5]
Mechanism for feature learning in neural networks and backpropagation-free machine learning models
doi: 10.1126/science.adi5639. URL https://www.science.org/doi/abs/10.1126/ science.adi5639. W. T. Redman, J. M. Bello-Rivas, M. Fonoberova, R. Mohr, I. G. Kevrekidis, and I. Mezi´c. Identifying equivalent training dynamics, 2024. URLhttps://arxiv.org/abs/2302.09160. G. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks...
-
[6]
URLhttps://arxiv.org/abs/1805.01053. A. Smola, A. Gretton, L. Song, and B. Schölkopf. A hilbert space embedding for distributions. In M. Hutter, R. A. Servedio, and E. Takimoto, editors,Algorithmic Learning Theory, pages 13–31, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-540-75225-7. M. E. Tano, G. D. Portwood, and J. C. Ragusa. Accel...
-
[7]
that preserves pairwise proximity while removing systematic group structure. Repeating this procedure nshuff times yields distances {ω′ i =d(Ω ′ 1,i,Ω ′ 2,i)}nshuff i=1 . We report the empirical exceedance rate ˆp:= 1 nshuff nshuffX i=1 1{ω′ i ≥ω}, which quantifies how often a “naturally shuffled” pair is at least as separated as the observed spectra. Qua...
-
[8]
The task maps pairs (a, b)∈ {0,
and use their open-source implementation https://github.com/openai/grok. The task maps pairs (a, b)∈ {0, . . . ,96} 2 to the label y= (a+b) mod 97 . We modify only the training-set fraction, setting it to 0.4, and keep the remaining data generation and evaluation protocol unchanged. We consider three weight decay settings, wd∈ {0,1,2} , and train one mode...
-
[9]
the peak-residual ordering wd = 0>wd = 1>wd = 2 holds at every tested segment size
-
[10]
generalization onset under wd = 1 is later than under wd = 2 at every tested segment size
-
[11]
lead-lag Spearman ρ between RR and subsequent accuracy change stays above 0.76 for the grokking regimes (wd = 1,2 ) at every tested segment size; for wd = 0 (no grokking) ρ is lower (≈0.54–0.60), consistent with the weaker temporal coupling there. Quantitative values shift with segment size, but the orderings (Table 9) are stable across the three sizes we...
work page 2024
-
[12]
Lead time is threshold-dependent and is always reported jointly with FPR
Window-level transition localization.On the held-out test fold (Appendix N), the sustained_K2_tau10 rule attains AUROC ≈0.93 , TPR = 0.80 at FPR = 0.50, and me- dian lead 1068 steps on true-positive alarms (95% bootstrap CI [142,2426] ; Appendix O). Lead time is threshold-dependent and is always reported jointly with FPR
-
[13]
Window-level fragility under matched perturbations.The sensitivity-window experiment (Appendix K) reports that high-RR windows show elevated short-horizon perturbation sensitivity relative to low-RR windows in wd= 1 baselines under matched noise. The result quantifies sensitivity, not causal mechanism; one boundary-case low-RR early-training failure is re...
-
[14]
Not a total-norm proxy at the window level.The norm-window control (Appendix Q) re-labels the same perturbation runs by total-parameter-norm percentile and reverses the fragility ordering, indicating that the residual carries window-level information not captured by the total-norm signal. Norm-derived signals nevertheless remain strong run-level regime in...
-
[15]
AGOP as a parallel route under coverage constraints.The AGOP comparison (Ap- pendix H) shows qualitative co-occurrence between AGOP elevation and RR elevation in transition windows for the runs with sufficient checkpoint coverage. AGOP coverage on the N1 baseline-battle pool is one run, so we treat AGOP as corroborative under sufficient coverage rather th...
-
[16]
Scope checks: weight decay, segment size, architecture.The weight-decay sweep (Appendix E) shows that the qualitative ordering between regularization strength and onset timing remains visible across five settings; segment-size sensitivity (Appendix G) shows coarse conclusions stable across {250,500,1000} steps with fine ordering varying; the architecture ...
-
[17]
Scope checks: portability and FCN.CIFAR-10 (Appendix J) is a portability check that the pipeline runs end-to-end on a different task/architecture; it is not a grokking diagnostic benchmark, and no detection metric is reported there. FCN results (Appendix T) are a secondary low-residual regime descriptor use, not a grokking diagnostic claim. These results ...
work page 1900
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.