Bug or Feature²: Weight Drift, Activation Sparsity and Spikes
Pith reviewed 2026-05-22 09:11 UTC · model grok-4.3
The pith
Under MSE or cross-entropy loss, gradients on positive pre-activations are non-negative in expectation at initialization, driving downstream weights negative.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures and asymmetric activation functions. Coupled with ReLU, it produces activation sparsity reaching up to 90% in GPT-nano, with a sharp accuracy cliff above ~70% activation sparsity. Clipped squared activations resolve spikes while preserving benefits.
What carries the argument
The non-negative expected gradient with respect to positive pre-activations under standard losses, which induces negative weight drift.
Load-bearing premise
The non-negative expected gradient at initialization continues to dominate the early training dynamics without being overridden by data or later effects.
What would settle it
Measuring the sign of gradients for positive pre-activations in a randomly initialized network trained on a simple dataset and checking if they are non-negative on average.
Figures
read the original abstract
The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above $\sim$70\% activation sparsity. While ReLU$^2$ achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU$^2$ outperforms its unclipped version, and GELU$^2$ achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard losses (MSE or cross-entropy) induce a negative weight drift when combined with positively biased activations (ReLU, GELU, SiLU). It proves that the expected gradient w.r.t. positive pre-activations is non-negative at random zero-mean initialization, which drives downstream weights negative in early training. This effect produces high activation sparsity (up to 90% in GPT-nano with ReLU) and is characterized empirically across MLP, ResNet, ViT, GPT-nano and MP-SENe architectures in 79 configurations. The authors further examine the sparsity-accuracy tradeoff, identify an accuracy cliff above ~70% sparsity, and propose clipped ReLU² and GELU² variants that improve the tradeoff while mitigating activation spikes.
Significance. If the central claim holds, the work supplies a clean theoretical account of an intrinsic optimization dynamic that explains a widespread empirical observation in ReLU-based networks. The parameter-free derivation from standard initialization and loss assumptions, the multi-architecture empirical support, and the public code release are strengths. The sparsity-accuracy characterization and the practical recommendation for clipped squared activations are directly useful to practitioners training transformers.
major comments (1)
- The proof establishes non-negative E[∂L/∂a] for positive pre-activations a at t=0 under zero-mean finite-variance initialization. The subsequent claim that this drives downstream weights negative 'during early training' requires that the initialization-time expectation is not immediately reversed by data-dependent terms or accumulated updates. No explicit bound or post-initialization simulation is provided showing the sign remains positive after even one epoch on real data; this step is load-bearing for the 'early training' part of the central claim.
minor comments (1)
- The description of the 79 configurations and the exact data-exclusion rules used in the sparsity-accuracy plots could be stated more explicitly to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, the recognition of the work's strengths, and the recommendation for minor revision. We address the major comment below.
read point-by-point responses
-
Referee: The proof establishes non-negative E[∂L/∂a] for positive pre-activations a at t=0 under zero-mean finite-variance initialization. The subsequent claim that this drives downstream weights negative 'during early training' requires that the initialization-time expectation is not immediately reversed by data-dependent terms or accumulated updates. No explicit bound or post-initialization simulation is provided showing the sign remains positive after even one epoch on real data; this step is load-bearing for the 'early training' part of the central claim.
Authors: We agree that the formal result is stated at initialization (t=0). The manuscript's 'early training' language is grounded in the rapid emergence of high activation sparsity (up to 90%) within the first few epochs across all 79 configurations and five architectures, as documented in Section 4 and the associated figures. To directly strengthen this link, the revised manuscript will add a new supplementary figure that tracks the sign and magnitude of the expected gradient w.r.t. positive pre-activations over the first 10 epochs on CIFAR-10 (ResNet) and WikiText (GPT-nano). These plots confirm that the non-negative bias persists in practice during early training before data-dependent terms dominate. revision: yes
Circularity Check
No circularity: gradient non-negativity derived from standard initialization and loss definitions
full rationale
The paper's central derivation establishes that E[∂L/∂a] ≥ 0 for positive pre-activations a under MSE or cross-entropy at random zero-mean finite-variance initialization. This follows directly from the definitions of the loss functions and the symmetry properties of the initial weight distribution without any fitted parameters, data-dependent terms, or self-referential predictions. No load-bearing step reduces to a self-citation chain, an ansatz smuggled from prior work, or a renaming of an empirical pattern. The claim that this drives negative weight drift in early training is presented as a consequence rather than a fitted or self-defined quantity. The derivation is therefore self-contained against external mathematical benchmarks and receives the default non-circularity assessment.
Axiom & Free-Parameter Ledger
free parameters (1)
- clipping threshold
axioms (2)
- standard math Weights initialized from zero-mean distribution with finite variance
- domain assumption Activation functions are positively biased (output non-negative for positive inputs)
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1.2 (MSE loss). … E[∂ℓ/∂p(l)_i] ≥ 0 … with strict inequality whenever p(l)_i > 0
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1.1 … E[⟨v_i, v_j⟩] ≥ 0 … conditioned on ReLU survival
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Elfwing, Stefan and Uchibe, Eiji and Doya, Kenji , title =. Neural Networks , volume =
-
[3]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =
- [4]
-
[5]
The lazy neuron phenomenon: On emergence of activation sparsity in transformers , author=. arXiv preprint arXiv:2210.06313 , year=
-
[6]
Advances in Neural Information Processing Systems , volume =
How Does Batch Normalization Help Optimization? , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url =
work page 2018
-
[7]
The Twelfth International Conference on Learning Representations , year=
Relu strikes back: Exploiting activation sparsity in large language models , author=. The Twelfth International Conference on Learning Representations , year=
-
[8]
ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , author=. arXiv preprint arXiv:2402.03804 , year=
-
[9]
Advances in Neural Information Processing Systems , volume=
Sparse weight activation training , author=. Advances in Neural Information Processing Systems , volume=
-
[10]
arXiv preprint arXiv:2603.05498 , year=
The spike, the sparse and the sink: Anatomy of massive activations and attention sinks , author=. arXiv preprint arXiv:2603.05498 , year=
-
[11]
arXiv preprint arXiv:2505.22074 , year=
The resurrection of the relu , author=. arXiv preprint arXiv:2505.22074 , year=
-
[12]
arXiv preprint arXiv:2412.07174 , year=
Post-training statistical calibration for higher activation sparsity , author=. arXiv preprint arXiv:2412.07174 , year=
-
[13]
2023 International Joint Conference on Neural Networks (IJCNN) , pages=
Synaptic stripping: How pruning can bring dead neurons back to life , author=. 2023 International Joint Conference on Neural Networks (IJCNN) , pages=. 2023 , organization=
work page 2023
-
[14]
International conference on machine learning , pages=
Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. International conference on machine learning , pages=. 2015 , organization=
work page 2015
-
[15]
Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Gaussian Error Linear Units (GELUs)
Gaussian error linear units (gelus) , author=. arXiv preprint arXiv:1606.08415 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Advances in neural information processing systems , volume=
Root mean square layer normalization , author=. Advances in neural information processing systems , volume=
-
[18]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
International conference on machine learning , pages=
Noisy activation functions , author=. International conference on machine learning , pages=. 2016 , organization=
work page 2016
-
[20]
Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv
Primer: Searching for efficient transformers for language modeling, 2022 , author=. URL https://arxiv. org/abs/2109.08668 , year=
-
[21]
arXiv preprint arXiv:2509.25359 , year=
From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation , author=. arXiv preprint arXiv:2509.25359 , year=
-
[22]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[23]
Forty-first international conference on machine learning , year=
Scaling rectified flow transformers for high-resolution image synthesis , author=. Forty-first international conference on machine learning , year=
-
[24]
2024 , howpublished =
work page 2024
-
[25]
Journal of Experimental Algorithmics (JEA) , volume=
Gpu-quicksort: A practical quicksort algorithm for graphics processors , author=. Journal of Experimental Algorithmics (JEA) , volume=. 2010 , publisher=
work page 2010
-
[26]
2009 IEEE International Symposium on Parallel & Distributed Processing , pages=
Designing efficient sorting algorithms for manycore GPUs , author=. 2009 IEEE International Symposium on Parallel & Distributed Processing , pages=. 2009 , organization=
work page 2009
-
[27]
Kanavalau, Andrei and Alonso, Carmen and Lall, Sanjay , year =. Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference , doi =
-
[28]
MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra , author =. Interspeech 2023 , pages =. 2023 , month = aug, publisher =. doi:10.21437/interspeech.2023-1441 , url =
-
[29]
European conference on computer vision , pages=
Maxvit: Multi-axis vision transformer , author=. European conference on computer vision , pages=. 2022 , organization=
work page 2022
-
[30]
International Conference on Machine Learning , pages=
Deja vu: Contextual sparsity for efficient llms at inference time , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[31]
arXiv preprint arXiv:2406.05955 , year=
Turbo sparse: Achieving llm sota performance with minimal activated parameters , author=. arXiv preprint arXiv:2406.05955 , year=
-
[32]
Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , pages=
Powerinfer: Fast large language model serving with a consumer-grade gpu , author=. Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , pages=
-
[33]
Training-free activation sparsity in large language models.arXiv preprint arXiv:2408.14690, 2024
Training-free activation sparsity in large language models , author=. arXiv preprint arXiv:2408.14690 , year=
-
[34]
Cats: Contextually-aware thresholding for sparsity in large language models , author=. arXiv preprint arXiv:2404.08763 , year=
-
[35]
arXiv preprint arXiv:2505.14884 , year=
Polar sparsity: High throughput batched LLM inferencing with scalable contextual sparsity , author=. arXiv preprint arXiv:2505.14884 , year=
-
[36]
arXiv preprint arXiv:2503.05613 , year=
A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models , author=. arXiv preprint arXiv:2503.05613 , year=
-
[37]
arXiv preprint arXiv:1903.06733 , year=
Dying relu and initialization: Theory and numerical examples , author=. arXiv preprint arXiv:1903.06733 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.