pith. sign in

arxiv: 2605.17659 · v2 · pith:EYKHEQCAnew · submitted 2026-05-17 · 💻 cs.LG

Bug or Feature²: Weight Drift, Activation Sparsity and Spikes

Pith reviewed 2026-05-22 09:11 UTC · model grok-4.3

classification 💻 cs.LG
keywords weight driftactivation sparsityReLUneural training dynamicsgradient expectationtransformer sparsitysquared activations
0
0 comments X

The pith

Under MSE or cross-entropy loss, gradients on positive pre-activations are non-negative in expectation at initialization, driving downstream weights negative.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that common loss functions interact with positive-biased activations to create a negative drift in weights during early training. This happens because the expected gradient for positive pre-activations is non-negative, pushing connected weights downward regardless of the data. Across many models like transformers and ResNets, this leads to high activation sparsity with ReLU, reaching 90 percent in small language models. The authors map the sparsity-accuracy trade-off and find a sharp drop in performance above 70 percent sparsity, while testing squared versions of activations to balance the effects.

Core claim

We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures and asymmetric activation functions. Coupled with ReLU, it produces activation sparsity reaching up to 90% in GPT-nano, with a sharp accuracy cliff above ~70% activation sparsity. Clipped squared activations resolve spikes while preserving benefits.

What carries the argument

The non-negative expected gradient with respect to positive pre-activations under standard losses, which induces negative weight drift.

Load-bearing premise

The non-negative expected gradient at initialization continues to dominate the early training dynamics without being overridden by data or later effects.

What would settle it

Measuring the sign of gradients for positive pre-activations in a randomly initialized network trained on a simple dataset and checking if they are non-negative on average.

Figures

Figures reproduced from arXiv: 2605.17659 by Aleksandr Serkov, Egor Shvetsov, Evgeny Burnaev, Redko Dmitry, Shokorov Viacheslav, Vladislav Goloshchapov.

Figure 1
Figure 1. Figure 1: Weight drift measured as average absolute Z-score per layer over the first 100 training steps for an MLP trained on CIFAR￾10. Momentum accelerates initial weight changes and leads to rapid convergence toward asymptotic drift levels, while plain SGD exhibits slower progression. Results are in log scale. 2. Empirical Results for Negative Weight Drift In previous section we established that gradient descent d… view at source ↗
Figure 2
Figure 2. Figure 2: Random inputs with MSE loss [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CIFAR-10 with cross-entropy loss [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Weight drift in MP-SENet. Training dynamics under GELU, ReLU, and SiLU with the AdamW optimizer (lr = 5 × 10−4 ), averaged across all model layers. Drift patterns are qualitatively consistent with those observed in the MLP and ResNet settings, with a sharp weight change again emerging at very early iterations. The covariance term is an order of magnitude smaller than the gradient value. near-complete reten… view at source ↗
Figure 6
Figure 6. Figure 6: Scaled performance versus post-activation sparsity. The dashed curve denotes the fitted power-law decay model, with statistics confirming that sparsity level is the dominant predic￾tor of accuracy (R 2 = 0.565), while the choice of mechanism (TOP-K vs. PC) contributes only a marginal incremental gain (∆R 2 = +0.053). 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Input ranges into up- and down-projections across layer indices, aggregated over 21 runs for different activation functions, normalization and sparsification strategies. Runs with non clipped squared functions are excluded [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Weight drift before and after Accumulation Stop. Average weight drift (mean |Z-score|) for GELU, ReLU, and TopKSparseGELU-50 under Standard, PercentileBN, and Per￾centileLN normalization strategies. The dashed vertical line marks the AS boundary at step 100. Solid lines: without AS. Dashed lines: with frozen EMA statistics. Trajectories remain continuous and stable after the boundary across all configurati… view at source ↗
Figure 9
Figure 9. Figure 9: Effect of Top-K sparsity on weight drift. A three-layer MLP with GELU activations is trained on random {X, Y } pairs sampled from N (0, 1) under four Top-K retention levels (K ∈ {0.10, 0.25, 0.75, 0.90}). Top row: gradient mean per layer. Bottom row: weight mean per layer. Trajectories are averaged across 20 runs (hidden dimension 128, SGD with lr= 0.001). As K increases, gradient bias and weight drift gro… view at source ↗
Figure 10
Figure 10. Figure 10: Weight drift in ResNet-18. Training dynamics under GELU, ReLU, and SiLU with Adam optimizer (lr= 10−3 ). Four metrics are tracked: (1) average gradient mean, (2) gradient norm, (3) weight mean, and (4) covariance term Cov(∂ℓ/∂p, x). tive fractions (∼0.77–0.86) at low sparsity levels, suggest￾ing weight drift is especially pronounced in attention-based architectures. The fact that per-channel and per-activ… view at source ↗
Figure 11
Figure 11. Figure 11: Training dynamics of an MLP without (top) and with (bottom) skip connection (see Appendix G.2 for architecture details), under GELU, NoisyReLU, ReLU, SiLU, and SUGARBSiLU with Adam optimizer (lr= 10−3 ). Four metrics are tracked over 1000 training steps: (1) average gradient magnitude, (2) gradient standard deviation, (3) weight drift (Z-score), and (4) fraction of sparse activations. In the plain MLP, Re… view at source ↗
Figure 14
Figure 14. Figure 14: Weight standard deviation averaged across sublayers per layer index, aggregated over 23 runs for different activation functions, normalization and sparsification strategies. All runs show a consistent monotonic increase. Squared activation (orange) exhibiting slightly elevated values in later layers. G. Technical Details This appendix provides full implementation details for all experiments in the paper. … view at source ↗
Figure 13
Figure 13. Figure 13: GPT-nano performance with different activation func￾tions, sparsification and normalization approaches, transparent bars reflect train loss and solid lines reflect test loss. 0 1 2 3 4 5 6 7 8 9 10 11 Layer index (all sublayers) 0.10 0.12 0.14 0.16 0.18 0.20 Weight Baselines Squared Clipped Squared TopK Sparse GELU PercentileNorm Other [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 12
Figure 12. Figure 12: DiT-S/2 generated samples across configurations at 50K, 100K, and 300K training steps. Columns correspond to GELU + LayerNorm, GELU + Percentile LayerNorm, ReLU + LayerNorm, ReLU + Percentile LayerNorm, and Top-K GELU + LayerNorm. All configurations produce coherent compositions by 50K steps. High-frequency differences become apparent at later checkpoints: the baseline ReLU configuration produces blurred … view at source ↗
read the original abstract

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above $\sim$70\% activation sparsity. While ReLU$^2$ achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU$^2$ outperforms its unclipped version, and GELU$^2$ achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that standard losses (MSE or cross-entropy) induce a negative weight drift when combined with positively biased activations (ReLU, GELU, SiLU). It proves that the expected gradient w.r.t. positive pre-activations is non-negative at random zero-mean initialization, which drives downstream weights negative in early training. This effect produces high activation sparsity (up to 90% in GPT-nano with ReLU) and is characterized empirically across MLP, ResNet, ViT, GPT-nano and MP-SENe architectures in 79 configurations. The authors further examine the sparsity-accuracy tradeoff, identify an accuracy cliff above ~70% sparsity, and propose clipped ReLU² and GELU² variants that improve the tradeoff while mitigating activation spikes.

Significance. If the central claim holds, the work supplies a clean theoretical account of an intrinsic optimization dynamic that explains a widespread empirical observation in ReLU-based networks. The parameter-free derivation from standard initialization and loss assumptions, the multi-architecture empirical support, and the public code release are strengths. The sparsity-accuracy characterization and the practical recommendation for clipped squared activations are directly useful to practitioners training transformers.

major comments (1)
  1. The proof establishes non-negative E[∂L/∂a] for positive pre-activations a at t=0 under zero-mean finite-variance initialization. The subsequent claim that this drives downstream weights negative 'during early training' requires that the initialization-time expectation is not immediately reversed by data-dependent terms or accumulated updates. No explicit bound or post-initialization simulation is provided showing the sign remains positive after even one epoch on real data; this step is load-bearing for the 'early training' part of the central claim.
minor comments (1)
  1. The description of the 79 configurations and the exact data-exclusion rules used in the sparsity-accuracy plots could be stated more explicitly to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, the recognition of the work's strengths, and the recommendation for minor revision. We address the major comment below.

read point-by-point responses
  1. Referee: The proof establishes non-negative E[∂L/∂a] for positive pre-activations a at t=0 under zero-mean finite-variance initialization. The subsequent claim that this drives downstream weights negative 'during early training' requires that the initialization-time expectation is not immediately reversed by data-dependent terms or accumulated updates. No explicit bound or post-initialization simulation is provided showing the sign remains positive after even one epoch on real data; this step is load-bearing for the 'early training' part of the central claim.

    Authors: We agree that the formal result is stated at initialization (t=0). The manuscript's 'early training' language is grounded in the rapid emergence of high activation sparsity (up to 90%) within the first few epochs across all 79 configurations and five architectures, as documented in Section 4 and the associated figures. To directly strengthen this link, the revised manuscript will add a new supplementary figure that tracks the sign and magnitude of the expected gradient w.r.t. positive pre-activations over the first 10 epochs on CIFAR-10 (ResNet) and WikiText (GPT-nano). These plots confirm that the non-negative bias persists in practice during early training before data-dependent terms dominate. revision: yes

Circularity Check

0 steps flagged

No circularity: gradient non-negativity derived from standard initialization and loss definitions

full rationale

The paper's central derivation establishes that E[∂L/∂a] ≥ 0 for positive pre-activations a under MSE or cross-entropy at random zero-mean finite-variance initialization. This follows directly from the definitions of the loss functions and the symmetry properties of the initial weight distribution without any fitted parameters, data-dependent terms, or self-referential predictions. No load-bearing step reduces to a self-citation chain, an ansatz smuggled from prior work, or a renaming of an empirical pattern. The claim that this drives negative weight drift in early training is presented as a consequence rather than a fitted or self-defined quantity. The derivation is therefore self-contained against external mathematical benchmarks and receives the default non-circularity assessment.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard mathematical assumptions about random initialization and loss functions, plus a small number of experimental hyperparameters such as clipping thresholds; no new entities are postulated.

free parameters (1)
  • clipping threshold
    Hyperparameter selected to suppress activation spikes while retaining sparsity benefits of squared activations.
axioms (2)
  • standard math Weights initialized from zero-mean distribution with finite variance
    Invoked to compute the expected non-negative gradient at initialization.
  • domain assumption Activation functions are positively biased (output non-negative for positive inputs)
    Required for the interaction between loss gradients and pre-activations to produce the drift.

pith-pipeline@v0.9.0 · 5801 in / 1502 out tokens · 58240 ms · 2026-05-22T09:11:52.912307+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

  1. [1]

    , title =

    Nair, Vinod and Hinton, Geoffrey E. , title =. Proceedings of the 27th International Conference on Machine Learning (ICML) , year =

  2. [2]

    Neural Networks , volume =

    Elfwing, Stefan and Uchibe, Eiji and Doya, Kenji , title =. Neural Networks , volume =

  3. [3]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  4. [4]

    2022 , url =

    Karpathy, Andrej , title =. 2022 , url =

  5. [5]

    The lazy neuron phenomenon: On emergence of activation sparsity in transformers.arXiv preprint arXiv:2210.06313, 2022

    The lazy neuron phenomenon: On emergence of activation sparsity in transformers , author=. arXiv preprint arXiv:2210.06313 , year=

  6. [6]

    Advances in Neural Information Processing Systems , volume =

    How Does Batch Normalization Help Optimization? , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url =

  7. [7]

    The Twelfth International Conference on Learning Representations , year=

    Relu strikes back: Exploiting activation sparsity in large language models , author=. The Twelfth International Conference on Learning Representations , year=

  8. [8]

    ReLU 2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024

    ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , author=. arXiv preprint arXiv:2402.03804 , year=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    Sparse weight activation training , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    arXiv preprint arXiv:2603.05498 , year=

    The spike, the sparse and the sink: Anatomy of massive activations and attention sinks , author=. arXiv preprint arXiv:2603.05498 , year=

  11. [11]

    arXiv preprint arXiv:2505.22074 , year=

    The resurrection of the relu , author=. arXiv preprint arXiv:2505.22074 , year=

  12. [12]

    arXiv preprint arXiv:2412.07174 , year=

    Post-training statistical calibration for higher activation sparsity , author=. arXiv preprint arXiv:2412.07174 , year=

  13. [13]

    2023 International Joint Conference on Neural Networks (IJCNN) , pages=

    Synaptic stripping: How pruning can bring dead neurons back to life , author=. 2023 International Joint Conference on Neural Networks (IJCNN) , pages=. 2023 , organization=

  14. [14]

    International conference on machine learning , pages=

    Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. International conference on machine learning , pages=. 2015 , organization=

  15. [15]

    Layer Normalization

    Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

  16. [16]

    Gaussian Error Linear Units (GELUs)

    Gaussian error linear units (gelus) , author=. arXiv preprint arXiv:1606.08415 , year=

  17. [17]

    Advances in neural information processing systems , volume=

    Root mean square layer normalization , author=. Advances in neural information processing systems , volume=

  18. [18]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  19. [19]

    International conference on machine learning , pages=

    Noisy activation functions , author=. International conference on machine learning , pages=. 2016 , organization=

  20. [20]

    Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv

    Primer: Searching for efficient transformers for language modeling, 2022 , author=. URL https://arxiv. org/abs/2109.08668 , year=

  21. [21]

    arXiv preprint arXiv:2509.25359 , year=

    From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation , author=. arXiv preprint arXiv:2509.25359 , year=

  22. [22]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  23. [23]

    Forty-first international conference on machine learning , year=

    Scaling rectified flow transformers for high-resolution image synthesis , author=. Forty-first international conference on machine learning , year=

  24. [24]

    2024 , howpublished =

  25. [25]

    Journal of Experimental Algorithmics (JEA) , volume=

    Gpu-quicksort: A practical quicksort algorithm for graphics processors , author=. Journal of Experimental Algorithmics (JEA) , volume=. 2010 , publisher=

  26. [26]

    2009 IEEE International Symposium on Parallel & Distributed Processing , pages=

    Designing efficient sorting algorithms for manycore GPUs , author=. 2009 IEEE International Symposium on Parallel & Distributed Processing , pages=. 2009 , organization=

  27. [27]

    Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference , doi =

    Kanavalau, Andrei and Alonso, Carmen and Lall, Sanjay , year =. Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference , doi =

  28. [28]

    Interspeech 2023 , pages =

    MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra , author =. Interspeech 2023 , pages =. 2023 , month = aug, publisher =. doi:10.21437/interspeech.2023-1441 , url =

  29. [29]

    European conference on computer vision , pages=

    Maxvit: Multi-axis vision transformer , author=. European conference on computer vision , pages=. 2022 , organization=

  30. [30]

    International Conference on Machine Learning , pages=

    Deja vu: Contextual sparsity for efficient llms at inference time , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  31. [31]

    arXiv preprint arXiv:2406.05955 , year=

    Turbo sparse: Achieving llm sota performance with minimal activated parameters , author=. arXiv preprint arXiv:2406.05955 , year=

  32. [32]

    Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , pages=

    Powerinfer: Fast large language model serving with a consumer-grade gpu , author=. Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , pages=

  33. [33]

    Training-free activation sparsity in large language models.arXiv preprint arXiv:2408.14690, 2024

    Training-free activation sparsity in large language models , author=. arXiv preprint arXiv:2408.14690 , year=

  34. [34]

    Cats: Contextually-aware thresholding for sparsity in large language models.arXiv preprint arXiv:2404.08763, 2024

    Cats: Contextually-aware thresholding for sparsity in large language models , author=. arXiv preprint arXiv:2404.08763 , year=

  35. [35]

    arXiv preprint arXiv:2505.14884 , year=

    Polar sparsity: High throughput batched LLM inferencing with scalable contextual sparsity , author=. arXiv preprint arXiv:2505.14884 , year=

  36. [36]

    arXiv preprint arXiv:2503.05613 , year=

    A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models , author=. arXiv preprint arXiv:2503.05613 , year=

  37. [37]

    arXiv preprint arXiv:1903.06733 , year=

    Dying relu and initialization: Theory and numerical examples , author=. arXiv preprint arXiv:1903.06733 , year=