Relu strikes back: Exploiting activation sparsity in large language models

Relu strikes back: Exploiting activation sparsity in large language models , author= · 2023 · arXiv 2310.04564

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Second-Order Path Kernel Interpolation Formulas in Machine Learning

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

Derives second-order path-kernel interpolation formulas for gradient descent, SGD, and momentum training, adding curvature terms and a concentration estimate around the expected prediction.

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

RT-Lynx shifts DiT sparsity from weights to activations, reports up to 1.55x linear-layer speedup while preserving generation quality across multiple diffusion models.

SiLIF: Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks

cs.NE · 2025-06-04 · unverdicted · novelty 6.0

SiLIF models apply SSM dynamics and parametrization to spiking neurons for stable training, reaching new SOTA on event-based and raw-audio speech datasets while using half the compute of SSMs via synaptic delays.

Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity

cs.LG · 2025-12-14 · unverdicted · novelty 5.0 · 2 refs

SPON adds a small set of trainable input-independent activation vectors as representational anchors, trained by distribution matching, to stabilize sparse activation in LLMs and recover performance lost to hidden-state distribution shifts.

Motivating Next-Gen Accelerators with Flexible (N:M) Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches

cs.LG · 2025-09-26 · unverdicted · novelty 5.0

Post-training N:M activation pruning preserves generative performance in LLMs better than equivalent weight pruning, with the 8:16 pattern emerging as a practical hardware-friendly choice.

PowLU: An Activation Function for Stable Pre-Training of LLMs

cs.CL · 2026-05-25 · unverdicted · novelty 4.0

PowLU replaces SwiGLU with a rational-power activation to reduce outlier amplification and numerical instability during large-scale LLM pre-training while matching performance.

citing papers explorer

Showing 6 of 6 citing papers after filters.

Second-Order Path Kernel Interpolation Formulas in Machine Learning cs.LG · 2026-06-05 · unverdicted · none · ref 162
Derives second-order path-kernel interpolation formulas for gradient descent, SGD, and momentum training, adding curvature terms and a concentration estimate around the expected prediction.
RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models cs.LG · 2026-05-26 · unverdicted · none · ref 44
RT-Lynx shifts DiT sparsity from weights to activations, reports up to 1.55x linear-layer speedup while preserving generation quality across multiple diffusion models.
SiLIF: Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks cs.NE · 2025-06-04 · unverdicted · none · ref 3
SiLIF models apply SSM dynamics and parametrization to spiking neurons for stable training, reaching new SOTA on event-based and raw-audio speech datasets while using half the compute of SSMs via synaptic delays.
Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity cs.LG · 2025-12-14 · unverdicted · none · ref 15 · 2 links
SPON adds a small set of trainable input-independent activation vectors as representational anchors, trained by distribution matching, to stabilize sparse activation in LLMs and recover performance lost to hidden-state distribution shifts.
Motivating Next-Gen Accelerators with Flexible (N:M) Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches cs.LG · 2025-09-26 · unverdicted · none · ref 17
Post-training N:M activation pruning preserves generative performance in LLMs better than equivalent weight pruning, with the 8:16 pattern emerging as a practical hardware-friendly choice.
PowLU: An Activation Function for Stable Pre-Training of LLMs cs.CL · 2026-05-25 · unverdicted · none · ref 14
PowLU replaces SwiGLU with a rational-power activation to reduce outlier amplification and numerical instability during large-scale LLM pre-training while matching performance.

Relu strikes back: Exploiting activation sparsity in large language models

fields

years

verdicts

representative citing papers

citing papers explorer