Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes

Aleksandr Serkov; Egor Shvetsov; Evgeny Burnaev; Redko Dmitry; Shokorov Viacheslav; Vladislav Goloshchapov

arxiv: 2605.17659 · v2 · pith:EYKHEQCAnew · submitted 2026-05-17 · 💻 cs.LG

Bug or Feature²: Weight Drift, Activation Sparsity and Spikes

Egor Shvetsov , Aleksandr Serkov , Shokorov Viacheslav , Redko Dmitry , Vladislav Goloshchapov , Evgeny Burnaev This is my paper

Pith reviewed 2026-05-22 09:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords weight driftactivation sparsityReLUneural training dynamicsgradient expectationtransformer sparsitysquared activations

0 comments

The pith

Under MSE or cross-entropy loss, gradients on positive pre-activations are non-negative in expectation at initialization, driving downstream weights negative.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that common loss functions interact with positive-biased activations to create a negative drift in weights during early training. This happens because the expected gradient for positive pre-activations is non-negative, pushing connected weights downward regardless of the data. Across many models like transformers and ResNets, this leads to high activation sparsity with ReLU, reaching 90 percent in small language models. The authors map the sparsity-accuracy trade-off and find a sharp drop in performance above 70 percent sparsity, while testing squared versions of activations to balance the effects.

Core claim

We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures and asymmetric activation functions. Coupled with ReLU, it produces activation sparsity reaching up to 90% in GPT-nano, with a sharp accuracy cliff above ~70% activation sparsity. Clipped squared activations resolve spikes while preserving benefits.

What carries the argument

The non-negative expected gradient with respect to positive pre-activations under standard losses, which induces negative weight drift.

Load-bearing premise

The non-negative expected gradient at initialization continues to dominate the early training dynamics without being overridden by data or later effects.

What would settle it

Measuring the sign of gradients for positive pre-activations in a randomly initialized network trained on a simple dataset and checking if they are non-negative on average.

Figures

Figures reproduced from arXiv: 2605.17659 by Aleksandr Serkov, Egor Shvetsov, Evgeny Burnaev, Redko Dmitry, Shokorov Viacheslav, Vladislav Goloshchapov.

**Figure 1.** Figure 1: Weight drift measured as average absolute Z-score per layer over the first 100 training steps for an MLP trained on CIFAR10. Momentum accelerates initial weight changes and leads to rapid convergence toward asymptotic drift levels, while plain SGD exhibits slower progression. Results are in log scale. 2. Empirical Results for Negative Weight Drift In previous section we established that gradient descent d… view at source ↗

**Figure 2.** Figure 2: Random inputs with MSE loss [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: CIFAR-10 with cross-entropy loss [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Weight drift in MP-SENet. Training dynamics under GELU, ReLU, and SiLU with the AdamW optimizer (lr = 5 × 10−4 ), averaged across all model layers. Drift patterns are qualitatively consistent with those observed in the MLP and ResNet settings, with a sharp weight change again emerging at very early iterations. The covariance term is an order of magnitude smaller than the gradient value. near-complete reten… view at source ↗

**Figure 6.** Figure 6: Scaled performance versus post-activation sparsity. The dashed curve denotes the fitted power-law decay model, with statistics confirming that sparsity level is the dominant predictor of accuracy (R 2 = 0.565), while the choice of mechanism (TOP-K vs. PC) contributes only a marginal incremental gain (∆R 2 = +0.053). 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Input ranges into up- and down-projections across layer indices, aggregated over 21 runs for different activation functions, normalization and sparsification strategies. Runs with non clipped squared functions are excluded [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Weight drift before and after Accumulation Stop. Average weight drift (mean |Z-score|) for GELU, ReLU, and TopKSparseGELU-50 under Standard, PercentileBN, and PercentileLN normalization strategies. The dashed vertical line marks the AS boundary at step 100. Solid lines: without AS. Dashed lines: with frozen EMA statistics. Trajectories remain continuous and stable after the boundary across all configurati… view at source ↗

**Figure 9.** Figure 9: Effect of Top-K sparsity on weight drift. A three-layer MLP with GELU activations is trained on random {X, Y } pairs sampled from N (0, 1) under four Top-K retention levels (K ∈ {0.10, 0.25, 0.75, 0.90}). Top row: gradient mean per layer. Bottom row: weight mean per layer. Trajectories are averaged across 20 runs (hidden dimension 128, SGD with lr= 0.001). As K increases, gradient bias and weight drift gro… view at source ↗

**Figure 10.** Figure 10: Weight drift in ResNet-18. Training dynamics under GELU, ReLU, and SiLU with Adam optimizer (lr= 10−3 ). Four metrics are tracked: (1) average gradient mean, (2) gradient norm, (3) weight mean, and (4) covariance term Cov(∂ℓ/∂p, x). tive fractions (∼0.77–0.86) at low sparsity levels, suggesting weight drift is especially pronounced in attention-based architectures. The fact that per-channel and per-activ… view at source ↗

**Figure 11.** Figure 11: Training dynamics of an MLP without (top) and with (bottom) skip connection (see Appendix G.2 for architecture details), under GELU, NoisyReLU, ReLU, SiLU, and SUGARBSiLU with Adam optimizer (lr= 10−3 ). Four metrics are tracked over 1000 training steps: (1) average gradient magnitude, (2) gradient standard deviation, (3) weight drift (Z-score), and (4) fraction of sparse activations. In the plain MLP, Re… view at source ↗

**Figure 14.** Figure 14: Weight standard deviation averaged across sublayers per layer index, aggregated over 23 runs for different activation functions, normalization and sparsification strategies. All runs show a consistent monotonic increase. Squared activation (orange) exhibiting slightly elevated values in later layers. G. Technical Details This appendix provides full implementation details for all experiments in the paper. … view at source ↗

**Figure 13.** Figure 13: GPT-nano performance with different activation functions, sparsification and normalization approaches, transparent bars reflect train loss and solid lines reflect test loss. 0 1 2 3 4 5 6 7 8 9 10 11 Layer index (all sublayers) 0.10 0.12 0.14 0.16 0.18 0.20 Weight Baselines Squared Clipped Squared TopK Sparse GELU PercentileNorm Other [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 12.** Figure 12: DiT-S/2 generated samples across configurations at 50K, 100K, and 300K training steps. Columns correspond to GELU + LayerNorm, GELU + Percentile LayerNorm, ReLU + LayerNorm, ReLU + Percentile LayerNorm, and Top-K GELU + LayerNorm. All configurations produce coherent compositions by 50K steps. High-frequency differences become apparent at later checkpoints: the baseline ReLU configuration produces blurred … view at source ↗

read the original abstract

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above $\sim$70\% activation sparsity. While ReLU$^2$ achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU$^2$ outperforms its unclipped version, and GELU$^2$ achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Negative weight drift from init-time non-negative gradients drives ReLU sparsity, with clipped squared activations fixing spikes and improving small transformer loss.

read the letter

The main point is that under MSE or cross-entropy the expected gradient for positive pre-activations is non-negative at standard zero-mean initialization, which pushes downstream weights negative and produces high activation sparsity. The paper proves this expectation and shows it leads to 90% sparsity in GPT-nano with ReLU, while mapping a clear accuracy cliff above roughly 70% sparsity across 79 runs. They also test squared variants and find that clipping ReLU squared removes pathological spikes in transformer layers while GELU squared reaches the lowest validation loss on their nano model. The derivation uses only the loss definitions and init assumptions, and the multi-architecture tests (MLP through ViT and GPT) make the intrinsic nature of the drift plausible. Code release lets others check the exact setups. The experiments give a practical picture of the sparsity-accuracy tradeoff that was not previously quantified this way. The soft spot is the step from the t=0 proof to early-training drift: the paper does not include a direct simulation or bound showing the expectation stays positive after the first epoch once data statistics enter. That link is asserted rather than measured, so the persistence claim rests on an untested assumption. Readers working on training dynamics or activation choices will find the mechanistic account and the new variants useful. The work is grounded enough and the experiments broad enough that it deserves a serious referee rather than a desk reject.

Referee Report

1 major / 1 minor

Summary. The paper claims that standard losses (MSE or cross-entropy) induce a negative weight drift when combined with positively biased activations (ReLU, GELU, SiLU). It proves that the expected gradient w.r.t. positive pre-activations is non-negative at random zero-mean initialization, which drives downstream weights negative in early training. This effect produces high activation sparsity (up to 90% in GPT-nano with ReLU) and is characterized empirically across MLP, ResNet, ViT, GPT-nano and MP-SENe architectures in 79 configurations. The authors further examine the sparsity-accuracy tradeoff, identify an accuracy cliff above ~70% sparsity, and propose clipped ReLU² and GELU² variants that improve the tradeoff while mitigating activation spikes.

Significance. If the central claim holds, the work supplies a clean theoretical account of an intrinsic optimization dynamic that explains a widespread empirical observation in ReLU-based networks. The parameter-free derivation from standard initialization and loss assumptions, the multi-architecture empirical support, and the public code release are strengths. The sparsity-accuracy characterization and the practical recommendation for clipped squared activations are directly useful to practitioners training transformers.

major comments (1)

The proof establishes non-negative E[∂L/∂a] for positive pre-activations a at t=0 under zero-mean finite-variance initialization. The subsequent claim that this drives downstream weights negative 'during early training' requires that the initialization-time expectation is not immediately reversed by data-dependent terms or accumulated updates. No explicit bound or post-initialization simulation is provided showing the sign remains positive after even one epoch on real data; this step is load-bearing for the 'early training' part of the central claim.

minor comments (1)

The description of the 79 configurations and the exact data-exclusion rules used in the sparsity-accuracy plots could be stated more explicitly to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, the recognition of the work's strengths, and the recommendation for minor revision. We address the major comment below.

read point-by-point responses

Referee: The proof establishes non-negative E[∂L/∂a] for positive pre-activations a at t=0 under zero-mean finite-variance initialization. The subsequent claim that this drives downstream weights negative 'during early training' requires that the initialization-time expectation is not immediately reversed by data-dependent terms or accumulated updates. No explicit bound or post-initialization simulation is provided showing the sign remains positive after even one epoch on real data; this step is load-bearing for the 'early training' part of the central claim.

Authors: We agree that the formal result is stated at initialization (t=0). The manuscript's 'early training' language is grounded in the rapid emergence of high activation sparsity (up to 90%) within the first few epochs across all 79 configurations and five architectures, as documented in Section 4 and the associated figures. To directly strengthen this link, the revised manuscript will add a new supplementary figure that tracks the sign and magnitude of the expected gradient w.r.t. positive pre-activations over the first 10 epochs on CIFAR-10 (ResNet) and WikiText (GPT-nano). These plots confirm that the non-negative bias persists in practice during early training before data-dependent terms dominate. revision: yes

Circularity Check

0 steps flagged

No circularity: gradient non-negativity derived from standard initialization and loss definitions

full rationale

The paper's central derivation establishes that E[∂L/∂a] ≥ 0 for positive pre-activations a under MSE or cross-entropy at random zero-mean finite-variance initialization. This follows directly from the definitions of the loss functions and the symmetry properties of the initial weight distribution without any fitted parameters, data-dependent terms, or self-referential predictions. No load-bearing step reduces to a self-citation chain, an ansatz smuggled from prior work, or a renaming of an empirical pattern. The claim that this drives negative weight drift in early training is presented as a consequence rather than a fitted or self-defined quantity. The derivation is therefore self-contained against external mathematical benchmarks and receives the default non-circularity assessment.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard mathematical assumptions about random initialization and loss functions, plus a small number of experimental hyperparameters such as clipping thresholds; no new entities are postulated.

free parameters (1)

clipping threshold
Hyperparameter selected to suppress activation spikes while retaining sparsity benefits of squared activations.

axioms (2)

standard math Weights initialized from zero-mean distribution with finite variance
Invoked to compute the expected non-negative gradient at initialization.
domain assumption Activation functions are positively biased (output non-negative for positive inputs)
Required for the interaction between loss gradients and pre-activations to produce the drift.

pith-pipeline@v0.9.0 · 5801 in / 1502 out tokens · 58240 ms · 2026-05-22T09:11:52.912307+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1.2 (MSE loss). … E[∂ℓ/∂p(l)_i] ≥ 0 … with strict inequality whenever p(l)_i > 0
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1.1 … E[⟨v_i, v_j⟩] ≥ 0 … conditioned on ReLU survival

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

[1]

, title =

Nair, Vinod and Hinton, Geoffrey E. , title =. Proceedings of the 27th International Conference on Machine Learning (ICML) , year =

work page
[2]

Neural Networks , volume =

Elfwing, Stefan and Uchibe, Eiji and Doya, Kenji , title =. Neural Networks , volume =

work page
[3]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[4]

2022 , url =

Karpathy, Andrej , title =. 2022 , url =

work page 2022
[5]

The lazy neuron phenomenon: On emergence of activation sparsity in transformers.arXiv preprint arXiv:2210.06313, 2022

The lazy neuron phenomenon: On emergence of activation sparsity in transformers , author=. arXiv preprint arXiv:2210.06313 , year=

work page arXiv
[6]

Advances in Neural Information Processing Systems , volume =

How Does Batch Normalization Help Optimization? , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url =

work page 2018
[7]

The Twelfth International Conference on Learning Representations , year=

Relu strikes back: Exploiting activation sparsity in large language models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[8]

ReLU 2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024

ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , author=. arXiv preprint arXiv:2402.03804 , year=

work page arXiv
[9]

Advances in Neural Information Processing Systems , volume=

Sparse weight activation training , author=. Advances in Neural Information Processing Systems , volume=

work page
[10]

arXiv preprint arXiv:2603.05498 , year=

The spike, the sparse and the sink: Anatomy of massive activations and attention sinks , author=. arXiv preprint arXiv:2603.05498 , year=

work page arXiv
[11]

arXiv preprint arXiv:2505.22074 , year=

The resurrection of the relu , author=. arXiv preprint arXiv:2505.22074 , year=

work page arXiv
[12]

arXiv preprint arXiv:2412.07174 , year=

Post-training statistical calibration for higher activation sparsity , author=. arXiv preprint arXiv:2412.07174 , year=

work page arXiv
[13]

2023 International Joint Conference on Neural Networks (IJCNN) , pages=

Synaptic stripping: How pruning can bring dead neurons back to life , author=. 2023 International Joint Conference on Neural Networks (IJCNN) , pages=. 2023 , organization=

work page 2023
[14]

International conference on machine learning , pages=

Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[15]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Gaussian Error Linear Units (GELUs)

Gaussian error linear units (gelus) , author=. arXiv preprint arXiv:1606.08415 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Advances in neural information processing systems , volume=

Root mean square layer normalization , author=. Advances in neural information processing systems , volume=

work page
[18]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

International conference on machine learning , pages=

Noisy activation functions , author=. International conference on machine learning , pages=. 2016 , organization=

work page 2016
[20]

Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv

Primer: Searching for efficient transformers for language modeling, 2022 , author=. URL https://arxiv. org/abs/2109.08668 , year=

work page arXiv 2022
[21]

arXiv preprint arXiv:2509.25359 , year=

From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation , author=. arXiv preprint arXiv:2509.25359 , year=

work page arXiv
[22]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[23]

Forty-first international conference on machine learning , year=

Scaling rectified flow transformers for high-resolution image synthesis , author=. Forty-first international conference on machine learning , year=

work page
[24]

2024 , howpublished =

work page 2024
[25]

Journal of Experimental Algorithmics (JEA) , volume=

Gpu-quicksort: A practical quicksort algorithm for graphics processors , author=. Journal of Experimental Algorithmics (JEA) , volume=. 2010 , publisher=

work page 2010
[26]

2009 IEEE International Symposium on Parallel & Distributed Processing , pages=

Designing efficient sorting algorithms for manycore GPUs , author=. 2009 IEEE International Symposium on Parallel & Distributed Processing , pages=. 2009 , organization=

work page 2009
[27]

Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference , doi =

Kanavalau, Andrei and Alonso, Carmen and Lall, Sanjay , year =. Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference , doi =

work page
[28]

Interspeech 2023 , pages =

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra , author =. Interspeech 2023 , pages =. 2023 , month = aug, publisher =. doi:10.21437/interspeech.2023-1441 , url =

work page doi:10.21437/interspeech.2023-1441 2023
[29]

European conference on computer vision , pages=

Maxvit: Multi-axis vision transformer , author=. European conference on computer vision , pages=. 2022 , organization=

work page 2022
[30]

International Conference on Machine Learning , pages=

Deja vu: Contextual sparsity for efficient llms at inference time , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[31]

arXiv preprint arXiv:2406.05955 , year=

Turbo sparse: Achieving llm sota performance with minimal activated parameters , author=. arXiv preprint arXiv:2406.05955 , year=

work page arXiv
[32]

Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , pages=

Powerinfer: Fast large language model serving with a consumer-grade gpu , author=. Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , pages=

work page
[33]

Training-free activation sparsity in large language models.arXiv preprint arXiv:2408.14690, 2024

Training-free activation sparsity in large language models , author=. arXiv preprint arXiv:2408.14690 , year=

work page arXiv
[34]

Cats: Contextually-aware thresholding for sparsity in large language models.arXiv preprint arXiv:2404.08763, 2024

Cats: Contextually-aware thresholding for sparsity in large language models , author=. arXiv preprint arXiv:2404.08763 , year=

work page arXiv
[35]

arXiv preprint arXiv:2505.14884 , year=

Polar sparsity: High throughput batched LLM inferencing with scalable contextual sparsity , author=. arXiv preprint arXiv:2505.14884 , year=

work page arXiv
[36]

arXiv preprint arXiv:2503.05613 , year=

A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models , author=. arXiv preprint arXiv:2503.05613 , year=

work page arXiv
[37]

arXiv preprint arXiv:1903.06733 , year=

Dying relu and initialization: Theory and numerical examples , author=. arXiv preprint arXiv:1903.06733 , year=

work page arXiv 1903

[1] [1]

, title =

Nair, Vinod and Hinton, Geoffrey E. , title =. Proceedings of the 27th International Conference on Machine Learning (ICML) , year =

work page

[2] [2]

Neural Networks , volume =

Elfwing, Stefan and Uchibe, Eiji and Doya, Kenji , title =. Neural Networks , volume =

work page

[3] [3]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page

[4] [4]

2022 , url =

Karpathy, Andrej , title =. 2022 , url =

work page 2022

[5] [5]

The lazy neuron phenomenon: On emergence of activation sparsity in transformers.arXiv preprint arXiv:2210.06313, 2022

The lazy neuron phenomenon: On emergence of activation sparsity in transformers , author=. arXiv preprint arXiv:2210.06313 , year=

work page arXiv

[6] [6]

Advances in Neural Information Processing Systems , volume =

How Does Batch Normalization Help Optimization? , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url =

work page 2018

[7] [7]

The Twelfth International Conference on Learning Representations , year=

Relu strikes back: Exploiting activation sparsity in large language models , author=. The Twelfth International Conference on Learning Representations , year=

work page

[8] [8]

ReLU 2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024

ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , author=. arXiv preprint arXiv:2402.03804 , year=

work page arXiv

[9] [9]

Advances in Neural Information Processing Systems , volume=

Sparse weight activation training , author=. Advances in Neural Information Processing Systems , volume=

work page

[10] [10]

arXiv preprint arXiv:2603.05498 , year=

The spike, the sparse and the sink: Anatomy of massive activations and attention sinks , author=. arXiv preprint arXiv:2603.05498 , year=

work page arXiv

[11] [11]

arXiv preprint arXiv:2505.22074 , year=

The resurrection of the relu , author=. arXiv preprint arXiv:2505.22074 , year=

work page arXiv

[12] [12]

arXiv preprint arXiv:2412.07174 , year=

Post-training statistical calibration for higher activation sparsity , author=. arXiv preprint arXiv:2412.07174 , year=

work page arXiv

[13] [13]

2023 International Joint Conference on Neural Networks (IJCNN) , pages=

Synaptic stripping: How pruning can bring dead neurons back to life , author=. 2023 International Joint Conference on Neural Networks (IJCNN) , pages=. 2023 , organization=

work page 2023

[14] [14]

International conference on machine learning , pages=

Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015

[15] [15]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Gaussian Error Linear Units (GELUs)

Gaussian error linear units (gelus) , author=. arXiv preprint arXiv:1606.08415 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Advances in neural information processing systems , volume=

Root mean square layer normalization , author=. Advances in neural information processing systems , volume=

work page

[18] [18]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

International conference on machine learning , pages=

Noisy activation functions , author=. International conference on machine learning , pages=. 2016 , organization=

work page 2016

[20] [20]

Primer: Searching for efficient transformers for language modeling, 2022.URL https://arxiv

Primer: Searching for efficient transformers for language modeling, 2022 , author=. URL https://arxiv. org/abs/2109.08668 , year=

work page arXiv 2022

[21] [21]

arXiv preprint arXiv:2509.25359 , year=

From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation , author=. arXiv preprint arXiv:2509.25359 , year=

work page arXiv

[22] [22]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[23] [23]

Forty-first international conference on machine learning , year=

Scaling rectified flow transformers for high-resolution image synthesis , author=. Forty-first international conference on machine learning , year=

work page

[24] [24]

2024 , howpublished =

work page 2024

[25] [25]

Journal of Experimental Algorithmics (JEA) , volume=

Gpu-quicksort: A practical quicksort algorithm for graphics processors , author=. Journal of Experimental Algorithmics (JEA) , volume=. 2010 , publisher=

work page 2010

[26] [26]

2009 IEEE International Symposium on Parallel & Distributed Processing , pages=

Designing efficient sorting algorithms for manycore GPUs , author=. 2009 IEEE International Symposium on Parallel & Distributed Processing , pages=. 2009 , organization=

work page 2009

[27] [27]

Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference , doi =

Kanavalau, Andrei and Alonso, Carmen and Lall, Sanjay , year =. Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference , doi =

work page

[28] [28]

Interspeech 2023 , pages =

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra , author =. Interspeech 2023 , pages =. 2023 , month = aug, publisher =. doi:10.21437/interspeech.2023-1441 , url =

work page doi:10.21437/interspeech.2023-1441 2023

[29] [29]

European conference on computer vision , pages=

Maxvit: Multi-axis vision transformer , author=. European conference on computer vision , pages=. 2022 , organization=

work page 2022

[30] [30]

International Conference on Machine Learning , pages=

Deja vu: Contextual sparsity for efficient llms at inference time , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[31] [31]

arXiv preprint arXiv:2406.05955 , year=

Turbo sparse: Achieving llm sota performance with minimal activated parameters , author=. arXiv preprint arXiv:2406.05955 , year=

work page arXiv

[32] [32]

Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , pages=

Powerinfer: Fast large language model serving with a consumer-grade gpu , author=. Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , pages=

work page

[33] [33]

Training-free activation sparsity in large language models.arXiv preprint arXiv:2408.14690, 2024

Training-free activation sparsity in large language models , author=. arXiv preprint arXiv:2408.14690 , year=

work page arXiv

[34] [34]

Cats: Contextually-aware thresholding for sparsity in large language models.arXiv preprint arXiv:2404.08763, 2024

Cats: Contextually-aware thresholding for sparsity in large language models , author=. arXiv preprint arXiv:2404.08763 , year=

work page arXiv

[35] [35]

arXiv preprint arXiv:2505.14884 , year=

Polar sparsity: High throughput batched LLM inferencing with scalable contextual sparsity , author=. arXiv preprint arXiv:2505.14884 , year=

work page arXiv

[36] [36]

arXiv preprint arXiv:2503.05613 , year=

A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models , author=. arXiv preprint arXiv:2503.05613 , year=

work page arXiv

[37] [37]

arXiv preprint arXiv:1903.06733 , year=

Dying relu and initialization: Theory and numerical examples , author=. arXiv preprint arXiv:1903.06733 , year=

work page arXiv 1903