Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Pith reviewed 2026-05-13 07:06 UTC · model grok-4.3
The pith
Floating-point precision limits trigger slingshot loss spikes by creating numerical feature inflation in neural network training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the slingshot loss spike is produced by Numerical Feature Inflation (NFI). In the high-confidence stage, logit differences surpass the absorption-error threshold of floating-point arithmetic, so that the gradient for the correct class is rounded to zero while gradients for incorrect classes remain nonzero. This breaks the zero-sum constraint on class gradients and introduces a net drift in the parameter update of the classifier layer. The drift couples with the feature vectors to form a positive feedback loop, causing the global means of both the classifier weights and the features to grow exponentially. The resulting inflation produces the rapid norm increase that,
What carries the argument
Numerical Feature Inflation (NFI), the exponential growth of classifier and feature means that follows from selective rounding of the correct-class gradient to zero once logit differences exceed the floating-point absorption threshold.
If this is right
- Rapid growth in classifier and feature norms before each spike follows directly from the unbalanced parameter drift.
- Gradients reappear and produce the loss spike once the inflated values push the logit differences back into the representable range.
- Partial absorption errors can drive abnormal parameter-norm growth without producing a visible loss spike in many practical training settings.
- The slingshot phenomenon is a numerical artifact of finite-precision computation rather than an intrinsic property of the loss landscape.
Where Pith is reading between the lines
- Using higher-precision arithmetic or adding logit-difference-aware gradient clipping could suppress the onset of these spikes.
- Tracking the growth of logit divergence during late-stage training could provide an early indicator of impending norm inflation.
- The same rounding imbalance may underlie other late-training instabilities, such as sudden divergence or exploding activations, in very deep models.
- Direct comparison of trajectories under float32 versus exact-arithmetic simulation would isolate the contribution of precision limits.
Load-bearing premise
Once the logit difference exceeds the absorption threshold, the correct-class gradient rounds exactly to zero while the incorrect-class gradients stay nonzero, and this imbalance necessarily creates an exponential positive feedback loop with the features.
What would settle it
Training identical models with higher-precision arithmetic (such as float64) or with enforced logit differences kept below the absorption threshold and observing whether the periodic loss spikes and rapid parameter-norm growth still appear.
Figures
read the original abstract
Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that slingshot loss spikes during unregularized long-term DNN training arise from floating-point precision limits rather than intrinsic optimization dynamics. When logit differences exceed an absorption-error threshold in high-confidence regimes, backpropagation through softmax+cross-entropy rounds the correct-class gradient exactly to zero while incorrect-class gradients remain nonzero; this breaks the zero-sum property, induces a systematic drift in classifier weights, and creates a positive feedback loop (Numerical Feature Inflation, NFI) that drives exponential growth in both global classifier means and global feature means. The work further distinguishes NFI from visible loss spikes and reinterprets slingshot events as numerical artifacts of finite-precision training.
Significance. If the central mechanism holds, the paper supplies a concrete, testable numerical account of late-stage parameter-norm growth and logit divergence that is independent of optimizer choice or regularization. It shifts explanatory focus from continuous dynamics to discrete rounding behavior and offers a route to mitigation via precision control or explicit zero-sum enforcement. The absence of free parameters in the core argument and the provision of a falsifiable prediction (spikes vanish under higher precision) are notable strengths.
major comments (3)
- [§3 (NFI derivation)] The proof that correct-class gradient rounds exactly to zero while incorrect-class gradients remain nonzero once the logit gap exceeds the absorption threshold is load-bearing for the entire NFI claim. In IEEE-754 arithmetic the gradient vector is (p-y) scaled by upstream factors; when one logit dominates, all p_i for i≠correct are already near machine epsilon, so the same rounding that zeros the correct term can also zero or denormalize the incorrect terms. The manuscript must supply the explicit floating-point analysis or simulation (with concrete mantissa/exponent values) showing that the imbalance persists across steps rather than being restored by simultaneous rounding of the incorrect-class terms.
- [§4 (positive-feedback analysis)] The exponential-growth derivation for global classifier and feature means assumes the drift compounds without damping from the simultaneous update of the feature extractor. The feedback loop must be written out step-by-step, showing why the feature-mean inflation is not counteracted by the finite dynamic range of activations or by the weight updates themselves. Without this, the claim that NFI produces unbounded exponential growth remains unverified.
- [§5 (experiments)] The experimental section should demonstrate that the reported spikes disappear when training is repeated in float64 or with explicit gradient clipping to enforce zero-sum, and should quantify the logit-difference threshold at which absorption begins for the specific model and dataset used. Current results appear to rely on the default float32 behavior without these controls.
minor comments (2)
- [§2] The notation for NFI is introduced without a compact mathematical definition; a single equation summarizing the drift term would improve readability.
- [Figures 2-4] Figure captions should explicitly state the floating-point format and the presence/absence of any gradient clipping or normalization.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript accordingly to strengthen the numerical analysis and experimental validation.
read point-by-point responses
-
Referee: [§3 (NFI derivation)] The proof that correct-class gradient rounds exactly to zero while incorrect-class gradients remain nonzero once the logit gap exceeds the absorption threshold is load-bearing. In IEEE-754 the same rounding that zeros the correct term can also zero or denormalize the incorrect terms. The manuscript must supply explicit floating-point analysis or simulation with concrete mantissa/exponent values showing the imbalance persists.
Authors: We thank the referee for this important clarification. Section 3 derives the absorption threshold based on the (p-y) scaling and shows that for logit gaps exceeding ~20 in float32 the correct-class term rounds to zero while incorrect-class terms (scaled by small but nonzero p_i) remain above the denormal threshold in the relevant regime. To make the persistence explicit, we will add a dedicated floating-point error analysis subsection with concrete mantissa/exponent calculations and a minimal simulation demonstrating that the gradient imbalance is not simultaneously restored over successive steps. revision: yes
-
Referee: [§4 (positive-feedback analysis)] The exponential-growth derivation assumes the drift compounds without damping from the simultaneous update of the feature extractor. The feedback loop must be written out step-by-step, showing why feature-mean inflation is not counteracted by finite dynamic range of activations or weight updates.
Authors: We agree that a more granular exposition is needed. In the revision we will expand §4 with an explicit per-step breakdown of one full forward-backward-update cycle, deriving the compounded growth factor while bounding the damping from activation saturation and feature-extractor updates. We show that within the high-confidence regime the logit-gap amplification outpaces these damping effects until the spike threshold is reached. revision: yes
-
Referee: [§5 (experiments)] The experimental section should demonstrate that the reported spikes disappear when training is repeated in float64 or with explicit gradient clipping to enforce zero-sum, and should quantify the logit-difference threshold at which absorption begins for the specific model and dataset used.
Authors: We will add the requested controls: (i) identical training runs in float64, where spikes are expected to be absent or substantially delayed; (ii) runs with explicit zero-sum enforcement via gradient normalization; and (iii) direct measurement of the logit-difference threshold at which absorption occurs for the models and datasets in the paper. These results will be reported with quantitative thresholds. revision: yes
Circularity Check
No significant circularity; derivation follows from floating-point rules and softmax gradient properties
full rationale
The paper's central derivation of Numerical Feature Inflation starts from standard IEEE-754 absorption thresholds and the zero-sum property of softmax+cross-entropy gradients. Once the logit gap exceeds the absorption threshold, the claimed rounding of the correct-class gradient to exactly zero (while incorrect-class terms remain nonzero) is presented as a direct numerical consequence rather than a fitted or self-defined quantity. The subsequent positive-feedback loop between classifier drift and feature-mean growth is then derived algebraically from this imbalance without invoking self-citations, parameter fits to the target phenomenon, or renaming of prior results. No load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Floating-point numbers have an absorption threshold beyond which small differences round to zero.
- domain assumption Gradients of the softmax cross-entropy loss across classes sum to zero.
invented entities (1)
-
Numerical Feature Inflation (NFI)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint... positive feedback loop... exponential growth
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.7 (Numerical Feature Inflation)... lim ∥W_G(t)∥ ∝ (1 + ηϵ/√K)^t
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.