pith. machine review for the scientific record. sign in

arxiv: 2604.20923 · v1 · submitted 2026-04-22 · 💻 cs.LG

Recognition: unknown

ILDR: Geometric Early Detection of Grokking

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:18 UTC · model grok-4.3

classification 💻 cs.LG
keywords grokkingILDRearly detectionrepresentation geometryphase transitionmodular arithmeticgeneralizationFisher discriminant
0
0 comments X

The pith

A geometric ratio on second-to-last layer representations rises to 2.5 times baseline and signals grokking before validation accuracy improves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that grokking, where networks memorize training data long before they generalize, can be detected early by tracking geometric changes in the second-to-last layer. ILDR measures the ratio of separation between class centroids to the scatter within each class, and it crosses a 2.5 times baseline threshold ahead of the accuracy jump. This signal is computed only on held-out data, runs efficiently, and stays stable across random seeds while prior signals like weight norms lag or fluctuate. A reader cares because the lead time of 9 to 73 percent of training steps allows early stopping that cuts total training by 18.6 percent on average and even lets optimizer changes steer when generalization occurs.

Core claim

ILDR is the ratio of inter-class centroid separation to intra-class scatter in second-to-last layer representations, grounded in Fisher's linear discriminant criterion and computed without eigendecomposition in O(|C|^2 + N) time. It rises and crosses a threshold at 2.5 times its baseline before the grokking transition appears in validation accuracy, leading by 9 to 73 percent of the training budget on modular arithmetic and S5 tasks, with lead time growing with algebraic complexity. Across eight seeds it leads by 950 +/- 250 steps with 26 percent coefficient of variation, post-transition variance drops by 1696 times, and interventions at the threshold allow bidirectional control over when a

What carries the argument

The Inter/Intra-class Distance Ratio (ILDR), the ratio of inter-class centroid separation to intra-class scatter on held-out second-to-last layer representations, which tracks geometric reorganization ahead of accuracy changes.

If this is right

  • ILDR leads the grokking transition by 9 to 73 percent of the training budget, with longer leads on tasks of higher algebraic complexity.
  • Early stopping triggered at the ILDR threshold reduces total training by 18.6 percent on average.
  • Optimizer interventions at the ILDR crossing allow either advancing or delaying the generalization transition.
  • After the transition, ILDR variance drops by a factor of 1696, consistent with a sharp phase change in representation space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If ILDR captures the necessary conditions for generalization, similar geometric ratios could be applied to detect other abrupt capability jumps during training.
  • Bidirectional control through interventions implies grokking depends on maintaining specific representation geometries that can be monitored and adjusted in real time.
  • The method's use of only held-out data suggests it could generalize to settings where memorization effects must be avoided entirely.

Load-bearing premise

The second-to-last layer representations reliably encode the geometric conditions for generalization rather than downstream correlates, and the 2.5 times baseline threshold works without per-task retuning.

What would settle it

A grokking event where validation accuracy improves without ILDR having reached 2.5 times its baseline value beforehand, or where lead time is consistently zero or negative across seeds and tasks.

read the original abstract

Grokking describes a delayed generalization phenomenon in which a neural network achieves perfect training accuracy long before validation accuracy improves, followed by an abrupt transition to strong generalization. Existing detection signals are indirect: weight norm reflects parameter-space regularization and consistently lags the transition, while GrokFast's slow gradient EMA, used without gradient amplification, is unstable across seeds with standard deviation exceeding mean lead time. We propose the Inter/Intra-class Distance Ratio (ILDR), a geometric metric computed on second-to-last layer representations as the ratio of inter-class centroid separation to intra-class scatter. ILDR provides an early detection signal: it rises and crosses a threshold at 2.5 times its baseline before the grokking transition appears in validation accuracy, indicating early geometric reorganization in representation space. Grounded in Fisher's linear discriminant criterion, ILDR requires no eigendecomposition and runs in O(|C|^2 + N). It is evaluated exclusively on held-out data, making it robust to memorization effects. Across modular arithmetic and permutation group composition (S5), ILDR leads the grokking transition by 9 to 73 percent of the training budget, with lead time increasing with task algebraic complexity. Over eight random seeds, ILDR leads by 950 +/- 250 steps with a coefficient of variation of 26 percent, and post-grokking variance drops by 1696 times, consistent with a sharp phase transition in representation space. Using ILDR as an early stopping trigger reduces training by 18.6 percent on average. Optimizer interventions triggered at the ILDR threshold demonstrate bidirectional control over the transition, suggesting ILDR tracks representational conditions underlying generalization rather than a downstream correlate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that the Inter/Intra-class Distance Ratio (ILDR), computed as the ratio of inter-class centroid separation to intra-class scatter on second-to-last layer representations and grounded in Fisher's linear discriminant, serves as an early geometric signal for grokking. Specifically, ILDR crosses a 2.5× baseline threshold before validation accuracy improves, leading the transition by 9 to 73 percent of the training budget across modular arithmetic and S5 tasks, with an average lead of 950 ± 250 steps over eight seeds (CV 26%), enabling 18.6% training reduction when used as a trigger and allowing bidirectional control via optimizer interventions at the threshold.

Significance. Should the central claims hold, ILDR would represent a significant advance in detecting the onset of generalization in grokking by providing a direct, efficient geometric measure from representation space rather than indirect parameter-space signals. The explicit reporting of lead times with statistics across seeds, the use of held-out data to avoid memorization, the O(|C|^2 + N) runtime, and the intervention experiments demonstrating causal relevance are particular strengths that enhance the paper's contribution to understanding phase transitions in neural network training.

major comments (2)
  1. [Abstract and §3] The choice of the 2.5× multiplier for the ILDR threshold is presented as an observed value without a derivation from the underlying inter/intra-class geometry or Fisher's criterion; this fixed threshold is load-bearing for the reported lead times (950 ± 250 steps) and training reduction (18.6%), yet no sensitivity analysis or task-independent justification is provided, potentially requiring per-task retuning as suggested by the uniform application to both task families.
  2. [§4 (Experimental Results)] There is no ablation study on the choice of the second-to-last layer for ILDR computation; the assumption that these representations reliably encode the geometric conditions for generalization (rather than downstream correlates) is central to the early detection claim but untested against other layers, which could affect the reliability of the signal across tasks.
minor comments (3)
  1. [Abstract] Clarify the 'post-grokking variance drops by 1696 times' statement by specifying the quantity whose variance is measured (ILDR or accuracy) and providing the actual pre- and post-transition variance values.
  2. [§2 (Related Work)] The comparison to GrokFast notes instability (std > mean lead time), but additional details on the exact implementation of GrokFast without amplification would aid in reproducing the baseline results.
  3. [Figures] Ensure that all figures showing ILDR curves include the 2.5× threshold line and error bands across seeds for visual assessment of the lead time consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of our work and for the constructive major comments. We address each point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract and §3] The choice of the 2.5× multiplier for the ILDR threshold is presented as an observed value without a derivation from the underlying inter/intra-class geometry or Fisher's criterion; this fixed threshold is load-bearing for the reported lead times (950 ± 250 steps) and training reduction (18.6%), yet no sensitivity analysis or task-independent justification is provided, potentially requiring per-task retuning as suggested by the uniform application to both task families.

    Authors: We agree that the 2.5× threshold was selected empirically from initial modular arithmetic runs and applied uniformly, without a closed-form derivation from Fisher's criterion. In revision we will add a sensitivity analysis over multipliers 1.5×–3.5×, reporting lead times, variance, and training reduction separately for each task family. This will demonstrate robustness and remove any implication of per-task retuning. revision: yes

  2. Referee: [§4 (Experimental Results)] There is no ablation study on the choice of the second-to-last layer for ILDR computation; the assumption that these representations reliably encode the geometric conditions for generalization (rather than downstream correlates) is central to the early detection claim but untested against other layers, which could affect the reliability of the signal across tasks.

    Authors: We acknowledge the lack of layer ablation. The penultimate layer was chosen because it encodes high-level features immediately before the classifier, consistent with the geometric reorganization we measure. We will add an ablation computing ILDR on the last three layers across all tasks and seeds, confirming that the second-to-last layer yields the earliest, most stable signal. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines ILDR directly from geometric properties of held-out representations (inter-class centroid separation over intra-class scatter) and motivates it by reference to Fisher's criterion without deriving the specific 2.5x multiplier from that criterion or fitting it to validation accuracy. Lead times are measured independently after the fact, optimizer interventions are reported as external tests, and no self-citations or ansatzes reduce the central claim to its inputs by construction. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that penultimate-layer geometry tracks generalization conditions, plus one free parameter (the 2.5x multiplier) chosen to precede the observed transition.

free parameters (1)
  • 2.5x baseline multiplier
    Threshold at which ILDR is declared to have risen; stated as observed but functions as a tunable cutoff for early detection.
axioms (1)
  • domain assumption Fisher's linear discriminant criterion applied to second-to-last layer representations predicts the onset of generalization
    Invoked to justify why the inter/intra ratio should rise before validation accuracy improves.

pith-pipeline@v0.9.0 · 5595 in / 1404 out tokens · 38976 ms · 2026-05-10T01:18:33.744335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Power, A., et al.Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. arXiv preprint arXiv:2201.02177, 2022

  2. [2]

    Grokfast: Accelerated grokking by amplifying slow gradients, 2024

    Liu, Y., et al.Grokfast: Accelerating Grokking by Amplifying Slow Gradients. arXiv preprint arXiv:2405.20233, 2024

  3. [3]

    Advances in Neural Information Processing Systems (NeurIPS), 2017

    Vaswani, A., et al.Attention is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 2017

  4. [4]

    and Hutter, F.Decoupled Weight Decay Regularization

    Loshchilov, I. and Hutter, F.Decoupled Weight Decay Regularization. International Confer- ence on Learning Representations (ICLR), 2019

  5. [5]

    Progress measures for grokking via mechanistic interpretability

    Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J.Progress Measures for Grokking via Mechanistic Interpretability. arXiv preprint arXiv:2301.05217, 2023

  6. [6]

    arXiv preprint arXiv:2008.08186, 2020

    Papyan, V., Han, X.Y., and Donoho, D.L.Prevalence of Neural Collapse during the Termi- nal Phase of Deep Learning Training. arXiv preprint arXiv:2008.08186, 2020. 13