pith. sign in

arxiv: 2606.18465 · v1 · pith:WJHRH7QFnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

Pith reviewed 2026-06-27 00:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords grokkingweight normlogit scalecross-entropysoftmax saturationgeneralizationmemorizationtemperature scaling
0
0 comments X

The pith

Weight norm affects grokking delay only by setting the logit scale that controls softmax saturation under cross-entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the known link between smaller weight norms and earlier grokking is direct or mediated. By clamping the norm and varying only an output temperature, the delay can be shifted across its full range; restoring the original logit scale recovers most of the original timing. Across many norm-temperature pairs the delay variance collapses almost entirely onto the logit scale, leaving the norm with little independent explanatory power. The same pattern holds in controls that rule out rescaling artifacts and loss-function specifics. This reframes the norm as an indirect lever rather than the proximal cause.

Core claim

Across a grid of norms and temperatures the grokking delay collapses onto the logit scale alone (R2 = 0.97), with the norm adding 1-2% beyond it. Matching the effective logit scale back to baseline recovers about 85% of the delay at two moduli. The effect is loss-dependent: under mean-squared error the logit scale is pinned and the norm acts through a different route. A memorization control, a float64 softmax-collapse audit, and a no-LayerNorm transformer all point to the same channel. Forking arms from one identical state show the delay follows the held norm value and not the clamp operation itself.

What carries the argument

Logit-scale mediation: the effective scale of the output logits before the softmax, which determines saturation and thereby the timing of the memorization-to-generalization transition.

If this is right

  • Under cross-entropy the weight norm functions mainly as an upstream controller of logit magnitude rather than acting directly on generalization.
  • Temperature rescaling can be used to slide grokking timing across the range normally produced by norm changes.
  • The mediation disappears under mean-squared error because the logit scale becomes fixed by the loss itself.
  • The forking-arm result rules out the clamp operation as a source of the observed timing shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interventions that directly modulate output scale (temperature, final-layer scaling) may offer finer control over generalization timing than weight-norm penalties.
  • The finding suggests testing whether other regularization effects in transformers are similarly routed through logit saturation rather than through weight magnitudes.
  • If the mediation holds, then models trained with different norms but identical effective logit scales should exhibit statistically indistinguishable grokking curves.

Load-bearing premise

Clamping the weight norm while varying temperature cleanly isolates logit-scale effects without introducing other changes to optimization dynamics or model internals.

What would settle it

An experiment that matches logit scale across different clamped norms but still observes a large residual difference in grokking delay would falsify the claim that the scale is the dominant proximal variable.

Figures

Figures reproduced from arXiv: 2606.18465 by Truong Xuan Khanh.

Figure 1
Figure 1. Figure 1: At a fixed clamped norm, varying only the output temperature 𝜏 . (a) Under cross-entropy the cells trace one curve of delay against effective logit scale: the baseline (star) and the norm-up point (𝜏 = 1, top right) lie on it, and increasing 𝜏 slides the delay back down toward baseline. About 83–89% of the norm-up delay is recovered by matching the logit scale (two moduli). (b) Under mean-squared error the… view at source ↗
Figure 2
Figure 2. Figure 2: Data collapse across the 𝜌 × 𝜏 grid (12 cells per modulus, all grok 12/12). 𝑇grok against the effective logit scale at grokking, colored by the norm dose 𝜌. Cells of different 𝜌 but matched effective logit scale share the same delay, so the delay collapses onto the logit scale; the norm dose adds 1–2% of explained variance beyond it [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Under cross-entropy, the temperature changes the delayed-generalization phase, not memoriza￾tion. 𝑇mem (bottom) is flat across 𝜏 while 𝑇grok and the delay 𝑇grok − 𝑇mem fall together. 𝑦-axis is log scale. 4.4 The effect is on the delay, not on memorization A temperature that divides the logits changes both the softmax saturation and the gradient mag￾nitude, so a skeptic can ask whether 𝜏 simply speeds up tr… view at source ↗
Figure 4
Figure 4. Figure 4: Four arms forked from one identical post-memorization state (𝑡 = 800). The grokking delay (from the fork) rises monotonically with the held norm value across all twelve seeds. The labelled contrast, clamp-at-𝑁0 versus clamp-at-𝜌𝑁0, applies the identical clamp operation and differs only in the held value (3.3–3.4×), so the delay tracks the value, not the rescaling operation. Free (dotted) groks faster becau… view at source ↗
read the original abstract

Grokking, the delayed jump from memorization to generalization, is usually tied to the weight norm: a smaller norm generalizes sooner. We ask what the norm actually controls. Holding the weight norm fixed by clamping and varying only an output temperature, we slide the grokking delay across its entire norm-induced range under cross-entropy; matching the effective logit scale back to baseline recovers about 85% of the delay at two moduli. Across a grid of norms and temperatures the delay collapses onto the logit scale alone (R2 = 0.97), with the norm adding 1-2% beyond it. The effect is loss-dependent: under mean-squared error the logit scale is pinned and the norm acts through a different route. A memorization control, a float64 softmax-collapse audit, and a no-LayerNorm transformer point to the same channel. Forking arms from one identical state, the delay follows the held norm value and not the clamp operation, which closes a rescaling-artifact concern. The proximal variable is the logit scale and the softmax saturation it drives; the weight norm is only an upstream handle. All numbers, tables, and figures reproduce from released code and data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that under cross-entropy the weight norm influences grokking delay primarily via the effective logit scale and resulting softmax saturation. Clamping the norm and varying temperature slides the delay across its full norm-induced range; across a grid the delay collapses onto logit scale alone (R²=0.97) with norm adding only 1-2%. The effect is loss-dependent (absent under MSE). Controls include a memorization test, float64 softmax audit, no-LayerNorm transformer, and a forking-arms experiment from identical states showing delay tracks the held norm value rather than the clamp. All results are reproducible from released code and data.

Significance. If the mediation result holds, the work clarifies that weight norm is merely an upstream handle rather than the direct cause of delayed generalization, redirecting attention to logit-scale saturation mechanisms. The direct experimental manipulation at fixed norm, high R², multiple orthogonal controls, forking-arms artifact test, and full reproducibility via released code and data are notable strengths that would make this a solid contribution to the grokking literature.

major comments (1)
  1. [Abstract, forking-arms test] Abstract, forking-arms test: the experiment rules out rescaling artifacts from the clamp itself, but does not directly test whether the joint clamping-plus-temperature intervention alters gradient magnitudes, effective learning rates, or internal activation statistics in ways that could independently affect grokking delay. Because the central claim requires that the intervention cleanly isolates logit-scale effects, this leaves open the possibility that the R²=0.97 collapse is partly driven by correlated side effects rather than logit scale alone.
minor comments (1)
  1. [Abstract] Abstract: the statement that matching logit scale recovers 'about 85% of the delay at two moduli' would be clearer if the specific moduli and the exact matching procedure were stated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive overall assessment and for identifying this specific concern about whether the joint clamping-plus-temperature intervention cleanly isolates logit-scale effects. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Abstract, forking-arms test] Abstract, forking-arms test: the experiment rules out rescaling artifacts from the clamp itself, but does not directly test whether the joint clamping-plus-temperature intervention alters gradient magnitudes, effective learning rates, or internal activation statistics in ways that could independently affect grokking delay. Because the central claim requires that the intervention cleanly isolates logit-scale effects, this leaves open the possibility that the R²=0.97 collapse is partly driven by correlated side effects rather than logit scale alone.

    Authors: The forking-arms experiment forks models from an identical pre-intervention state and shows that subsequent grokking delay tracks the held norm value rather than any property of the clamping operation. Temperature scaling is a post-activation output adjustment that directly rescales logits without changing internal activations, parameter updates, or gradient flow through the network body. The observed collapse of delay onto logit scale (R²=0.97) across a full grid of clamped norms and temperatures, together with the complete absence of the effect under MSE (where logit scale remains pinned), makes it improbable that unmeasured side effects on gradients or activations are the primary driver; any such confounds would need to correlate almost perfectly with the computed logit scale. The no-LayerNorm control and float64 softmax audit provide further orthogonal support for the same channel. We will add a short clarifying paragraph in the discussion section acknowledging this isolation argument. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical mediation via direct parameter variation

full rationale

The paper's claims rest on controlled experiments that clamp weight norm while varying output temperature, then measure grokking delay and regress it against logit scale (R²=0.97). No derivation chain, fitted parameter renamed as prediction, or self-citation is invoked to justify the central result; the forking-arms control and code reproducibility provide independent empirical grounding. The analysis is therefore self-contained against external benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters are fitted to support the central claim. The work relies on standard properties of the softmax and cross-entropy.

axioms (1)
  • standard math Softmax saturation and cross-entropy loss depend on the magnitude of input logits
    Invoked to explain why logit scale controls the timing of the generalization transition.

pith-pipeline@v0.9.1-grok · 5744 in / 1157 out tokens · 56016 ms · 2026-06-27T00:41:28.027678+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    The implicit bias of logit regularization.arXiv preprint arXiv:2602.12039,

    Alon Beck, Yohai Bar-Sinai, and Noam Levi. The implicit bias of logit regularization.arXiv preprint arXiv:2602.12039,

  2. [2]

    Yuda Bi, Chenyu Zhang, Qiheng Wang, and Vince D. Calhoun. Grokking as a falsifiable finite-size transition.arXiv preprint arXiv:2603.24746,

  3. [3]

    Kenzo Clauw, Sebastiano Stramaglia, and Daniele Marinazzo

    arXiv:2505.20172. Kenzo Clauw, Sebastiano Stramaglia, and Daniele Marinazzo. Information-theoretic progress mea- sures reveal grokking is an emergent phase transition.arXiv preprint arXiv:2408.08944,

  4. [4]

    Grokking in the ising model.arXiv preprint arXiv:2510.25966,

    Karolina Hutchison and David Yevick. Grokking in the ising model.arXiv preprint arXiv:2510.25966,

  5. [5]

    Ziming Liu, Eric J Michaud, and Max Tegmark

    arXiv:2310.06110. Ziming Liu, Eric J Michaud, and Max Tegmark. Omnigrok: Grokking beyond algorithmic data. In International Conference on Learning Representations (ICLR),

  6. [6]

    arXiv:2210.01117. I. A. Lopatin, S. V. Kozyrev, and A. N. Pechen. Predator-prey model: Driven hunt for accelerated grokking.arXiv preprint arXiv:2509.10562,

  7. [7]

    Tiberiu Musat

    arXiv:1906.05890. Tiberiu Musat. The geometry of grokking: Norm minimization on the zero-loss manifold.arXiv preprint arXiv:2511.01938,

  8. [8]

    Progress measures for grokking via mechanistic interpretability

    arXiv:2301.05217. Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

  9. [9]

    Noa Rubin, Inbar Seroussi, and Zohar Ringel

    arXiv:2501.04697. Noa Rubin, Inbar Seroussi, and Zohar Ringel. Grokking as a first order phase transition in two layer networks. InInternational Conference on Learning Representations (ICLR),

  10. [10]

    Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind

    arXiv:2310.03789. Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817,

  11. [11]

    Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar

    arXiv:2506.05718, PMLR 267:28552–28618. Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

  12. [12]

    Dimensional Criticality at Grokking Across MLPs and Transformers

    Ping Wang. Dimensional criticality at grokking across mlps and transformers.arXiv preprint arXiv:2604.16431, 2026a. Ping Wang. Grokking as a dimensional phase transition in neural networks.arXiv preprint arXiv:2604.04655, 2026b. Bojan Žunkovič and Enej Ilievski. Grokking phase transitions in learning local rules with gradient descent.Journal of Machine Le...