pith. sign in

arxiv: 2605.06152 · v3 · pith:L7NYTNSTnew · submitted 2026-05-07 · 💻 cs.LG · cs.CL· math.OC· stat.ML

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

Pith reviewed 2026-05-13 07:06 UTC · model grok-4.3

classification 💻 cs.LG cs.CLmath.OCstat.ML
keywords slingshot mechanismnumerical feature inflationfloating-point precisionloss spikesgradient absorptiondeep neural networkspositive feedback loopparameter norm growth
0
0 comments X

The pith

Floating-point precision limits trigger slingshot loss spikes by creating numerical feature inflation in neural network training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that periodic loss spikes during long unregularized training arise from the limits of floating-point arithmetic rather than from optimization dynamics alone. Once training reaches a high-confidence regime, the difference between the correct-class logit and the others can exceed the threshold at which the correct-class gradient rounds exactly to zero in backpropagation while the incorrect-class gradients do not. The resulting imbalance violates the zero-sum property that gradients across classes should satisfy, producing a systematic drift in the classifier-layer parameters. This drift enters a positive feedback loop with the learned features, driving exponential growth in both the global classifier mean and the global feature mean. The mechanism, termed Numerical Feature Inflation, accounts for the observed pre-spike norm growth, the later reappearance of gradients, and the loss jump itself, while also explaining why similar drift can occur without visible spikes in practical tasks.

Core claim

The central claim is that the slingshot loss spike is produced by Numerical Feature Inflation (NFI). In the high-confidence stage, logit differences surpass the absorption-error threshold of floating-point arithmetic, so that the gradient for the correct class is rounded to zero while gradients for incorrect classes remain nonzero. This breaks the zero-sum constraint on class gradients and introduces a net drift in the parameter update of the classifier layer. The drift couples with the feature vectors to form a positive feedback loop, causing the global means of both the classifier weights and the features to grow exponentially. The resulting inflation produces the rapid norm increase that,

What carries the argument

Numerical Feature Inflation (NFI), the exponential growth of classifier and feature means that follows from selective rounding of the correct-class gradient to zero once logit differences exceed the floating-point absorption threshold.

If this is right

  • Rapid growth in classifier and feature norms before each spike follows directly from the unbalanced parameter drift.
  • Gradients reappear and produce the loss spike once the inflated values push the logit differences back into the representable range.
  • Partial absorption errors can drive abnormal parameter-norm growth without producing a visible loss spike in many practical training settings.
  • The slingshot phenomenon is a numerical artifact of finite-precision computation rather than an intrinsic property of the loss landscape.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Using higher-precision arithmetic or adding logit-difference-aware gradient clipping could suppress the onset of these spikes.
  • Tracking the growth of logit divergence during late-stage training could provide an early indicator of impending norm inflation.
  • The same rounding imbalance may underlie other late-training instabilities, such as sudden divergence or exploding activations, in very deep models.
  • Direct comparison of trajectories under float32 versus exact-arithmetic simulation would isolate the contribution of precision limits.

Load-bearing premise

Once the logit difference exceeds the absorption threshold, the correct-class gradient rounds exactly to zero while the incorrect-class gradients stay nonzero, and this imbalance necessarily creates an exponential positive feedback loop with the features.

What would settle it

Training identical models with higher-precision arithmetic (such as float64) or with enforced logit differences kept below the absorption threshold and observing whether the periodic loss spikes and rapid parameter-norm growth still appear.

Figures

Figures reproduced from arXiv: 2605.06152 by Jianjun Cao, Liu Hanqing, Yuanze Li, Zijian Zhou.

Figure 1
Figure 1. Figure 1: Precision-induced N FI dynamics. (a) Slingshot loss spikes disappear when training is performed in float64. Casting only the logits/loss computation to float64 is also sufficient to remove the spikes, showing that the instability originates from the loss computation. (b) Before most samples enter Softmax Collapse, the global feature mean grows slowly. Once most samples collapse, ∥µG∥ enters a rapid-growth … view at source ↗
Figure 2
Figure 2. Figure 2: Mechanistic evidence for Slingshot spikes. (a) Distribution of classifier-layer parameter updates around a loss spike. Before the spike, updates concentrate near zero. At the spike step, the distribution forms two sharp modes near −4 × 10−4 and 4 × 10−4 . After the spike, update magnitudes become dispersed across parameters. (b) Evolution of the residual probability mass ϵ across architectures. Before a sp… view at source ↗
Figure 3
Figure 3. Figure 3: Mitigation Study. (a) Adam’s ε. Increasing the optimizer’s ε parameter mitigates instability: while ε = 10−6 reduces the frequency of spikes, setting ε = 10−5 completely eliminates them. (b) Layer Norm. Applying LN changes the evolution of the last layer norm from a continuous trajectory to a distinct stepwise pattern. Notably, LN significantly increases the magnitude of the last layer norm. (c) Batch Norm… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Training Loss with different learning rates. (b) Train Loss with label smoothing. A.1.3 Bias We find that including a bias term in the classification layer accelerates the occurrence of Slingshots. This instability stems from a significant scale discrepancy between the updates of weights and biases. The gradient with respect to the weight Wk is scaled by the feature vector h: ∇Wk L = (ˆyk − yk)h (9) In… view at source ↗
Figure 5
Figure 5. Figure 5: Coexistence of EOS and Slingshot Phenomena. (a) Training Loss showing early EOS oscillations versus late-stage numerical spikes. (b) Evolution of Maximum Hessian Eigenvalue λmax. Cohen et al. [3] identified the phenomenon of “Progressive Sharpening” in neural network training, where the maximum eigenvalue of the Hessian, λmax, steadily increases until it reaches the stability threshold 2/η. Upon breaching … view at source ↗
Figure 6
Figure 6. Figure 6: Slingshot in Transformer on modular division. view at source ↗
Figure 7
Figure 7. Figure 7: Slingshot in MLP on modular division. 17 view at source ↗
Figure 8
Figure 8. Figure 8: Slingshot in MLP on CIFAR-10. 0 2000 4000 6000 8000 10000 Epoch 10 8 10 6 10 4 10 2 10 0 10 2 Train Loss (Log) 0 2000 4000 6000 8000 10000 Epoch 200 150 100 50 0 50 Average Target Logit view at source ↗
Figure 9
Figure 9. Figure 9: Slingshot in VGG11 on CIFAR-10. 0 2000 4000 6000 8000 10000 Epoch 10 8 10 6 10 4 10 2 10 0 Train Loss (Log) 0 2000 4000 6000 8000 10000 Epoch 1200 1000 800 600 400 200 0 Average Target Logit view at source ↗
Figure 10
Figure 10. Figure 10: Slingshot in VGG11 with BN on CIFAR-10. 0 2000 4000 6000 8000 10000 Epoch 10 8 10 6 10 4 10 2 10 0 Train Loss (Log) 0 2000 4000 6000 8000 10000 Epoch 600 500 400 300 200 100 0 Average Target Logit view at source ↗
Figure 11
Figure 11. Figure 11: Slingshot in ViT on CIFAR-10. 18 view at source ↗
Figure 12
Figure 12. Figure 12: No Slingshot in ResNet18 on CIFAR-10. 0 20000 40000 60000 80000 100000 Steps 0 100 200 300 400 500 Average Logit FP32 FP64 view at source ↗
Figure 13
Figure 13. Figure 13: The average logit of different precisions in LLM training. view at source ↗
read the original abstract

Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that slingshot loss spikes during unregularized long-term DNN training arise from floating-point precision limits rather than intrinsic optimization dynamics. When logit differences exceed an absorption-error threshold in high-confidence regimes, backpropagation through softmax+cross-entropy rounds the correct-class gradient exactly to zero while incorrect-class gradients remain nonzero; this breaks the zero-sum property, induces a systematic drift in classifier weights, and creates a positive feedback loop (Numerical Feature Inflation, NFI) that drives exponential growth in both global classifier means and global feature means. The work further distinguishes NFI from visible loss spikes and reinterprets slingshot events as numerical artifacts of finite-precision training.

Significance. If the central mechanism holds, the paper supplies a concrete, testable numerical account of late-stage parameter-norm growth and logit divergence that is independent of optimizer choice or regularization. It shifts explanatory focus from continuous dynamics to discrete rounding behavior and offers a route to mitigation via precision control or explicit zero-sum enforcement. The absence of free parameters in the core argument and the provision of a falsifiable prediction (spikes vanish under higher precision) are notable strengths.

major comments (3)
  1. [§3 (NFI derivation)] The proof that correct-class gradient rounds exactly to zero while incorrect-class gradients remain nonzero once the logit gap exceeds the absorption threshold is load-bearing for the entire NFI claim. In IEEE-754 arithmetic the gradient vector is (p-y) scaled by upstream factors; when one logit dominates, all p_i for i≠correct are already near machine epsilon, so the same rounding that zeros the correct term can also zero or denormalize the incorrect terms. The manuscript must supply the explicit floating-point analysis or simulation (with concrete mantissa/exponent values) showing that the imbalance persists across steps rather than being restored by simultaneous rounding of the incorrect-class terms.
  2. [§4 (positive-feedback analysis)] The exponential-growth derivation for global classifier and feature means assumes the drift compounds without damping from the simultaneous update of the feature extractor. The feedback loop must be written out step-by-step, showing why the feature-mean inflation is not counteracted by the finite dynamic range of activations or by the weight updates themselves. Without this, the claim that NFI produces unbounded exponential growth remains unverified.
  3. [§5 (experiments)] The experimental section should demonstrate that the reported spikes disappear when training is repeated in float64 or with explicit gradient clipping to enforce zero-sum, and should quantify the logit-difference threshold at which absorption begins for the specific model and dataset used. Current results appear to rely on the default float32 behavior without these controls.
minor comments (2)
  1. [§2] The notation for NFI is introduced without a compact mathematical definition; a single equation summarizing the drift term would improve readability.
  2. [Figures 2-4] Figure captions should explicitly state the floating-point format and the presence/absence of any gradient clipping or normalization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript accordingly to strengthen the numerical analysis and experimental validation.

read point-by-point responses
  1. Referee: [§3 (NFI derivation)] The proof that correct-class gradient rounds exactly to zero while incorrect-class gradients remain nonzero once the logit gap exceeds the absorption threshold is load-bearing. In IEEE-754 the same rounding that zeros the correct term can also zero or denormalize the incorrect terms. The manuscript must supply explicit floating-point analysis or simulation with concrete mantissa/exponent values showing the imbalance persists.

    Authors: We thank the referee for this important clarification. Section 3 derives the absorption threshold based on the (p-y) scaling and shows that for logit gaps exceeding ~20 in float32 the correct-class term rounds to zero while incorrect-class terms (scaled by small but nonzero p_i) remain above the denormal threshold in the relevant regime. To make the persistence explicit, we will add a dedicated floating-point error analysis subsection with concrete mantissa/exponent calculations and a minimal simulation demonstrating that the gradient imbalance is not simultaneously restored over successive steps. revision: yes

  2. Referee: [§4 (positive-feedback analysis)] The exponential-growth derivation assumes the drift compounds without damping from the simultaneous update of the feature extractor. The feedback loop must be written out step-by-step, showing why feature-mean inflation is not counteracted by finite dynamic range of activations or weight updates.

    Authors: We agree that a more granular exposition is needed. In the revision we will expand §4 with an explicit per-step breakdown of one full forward-backward-update cycle, deriving the compounded growth factor while bounding the damping from activation saturation and feature-extractor updates. We show that within the high-confidence regime the logit-gap amplification outpaces these damping effects until the spike threshold is reached. revision: yes

  3. Referee: [§5 (experiments)] The experimental section should demonstrate that the reported spikes disappear when training is repeated in float64 or with explicit gradient clipping to enforce zero-sum, and should quantify the logit-difference threshold at which absorption begins for the specific model and dataset used.

    Authors: We will add the requested controls: (i) identical training runs in float64, where spikes are expected to be absent or substantially delayed; (ii) runs with explicit zero-sum enforcement via gradient normalization; and (iii) direct measurement of the logit-difference threshold at which absorption occurs for the models and datasets in the paper. These results will be reported with quantitative thresholds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation follows from floating-point rules and softmax gradient properties

full rationale

The paper's central derivation of Numerical Feature Inflation starts from standard IEEE-754 absorption thresholds and the zero-sum property of softmax+cross-entropy gradients. Once the logit gap exceeds the absorption threshold, the claimed rounding of the correct-class gradient to exactly zero (while incorrect-class terms remain nonzero) is presented as a direct numerical consequence rather than a fitted or self-defined quantity. The subsequent positive-feedback loop between classifier drift and feature-mean growth is then derived algebraically from this imbalance without invoking self-citations, parameter fits to the target phenomenon, or renaming of prior results. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on two standard properties of floating-point arithmetic and classification losses plus the newly introduced concept of numerical feature inflation; no free parameters are stated.

axioms (2)
  • standard math Floating-point numbers have an absorption threshold beyond which small differences round to zero.
    Invoked to explain why the correct-class gradient becomes exactly zero.
  • domain assumption Gradients of the softmax cross-entropy loss across classes sum to zero.
    Used to claim that rounding one gradient to zero breaks the zero-sum constraint.
invented entities (1)
  • Numerical Feature Inflation (NFI) no independent evidence
    purpose: Describes the positive feedback loop between drifted classifier weights and growing feature norms.
    Newly named mechanism introduced to explain the exponential growth; no independent falsifiable prediction is given in the abstract.

pith-pipeline@v0.9.0 · 5569 in / 1451 out tokens · 69215 ms · 2026-05-13T07:06:05.189778+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.