Motivating Next-Gen Accelerators with Flexible (N:M) Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches

Alexey Dontsov; Alina Kostromina; Egor Shvetsov; Ekaterina Galaeva; Evgeny Burnaev; Kristina Kazistova; Maxim Zhelnin; Redko Dmitry; Shirin Alanova; Vladimir Smirnov

arxiv: 2509.22166 · v4 · submitted 2025-09-26 · 💻 cs.LG · cs.AI

Motivating Next-Gen Accelerators with Flexible (N:M) Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches

Shirin Alanova , Kristina Kazistova , Ekaterina Galaeva , Alina Kostromina , Vladimir Smirnov , Redko Dmitry , Alexey Dontsov , Maxim Zhelnin

show 2 more authors

Evgeny Burnaev Egor Shvetsov

This is my paper

Pith reviewed 2026-05-18 13:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords activation pruningN:M sparsityLLM inferencepost-training sparsificationsemi-structured pruninggenerative capabilitieshardware accelerationsparsity patterns

0 comments

The pith

Pruning activations in LLMs preserves generative capabilities better than pruning weights at equivalent sparsity levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines post-training N:M sparsity applied to activations rather than weights in large language models. It finds that activation pruning keeps generative task performance higher than weight pruning when sparsity levels match. The authors test lightweight, plug-and-play pruning criteria and error mitigation steps that need only minimal calibration data. They compare sparsity patterns and identify the 8:16 pattern as a strong practical choice that balances performance, flexibility, and hardware feasibility while the 16:32 pattern approaches unstructured sparsity quality. The results supply ready-to-use methods and argue that future accelerators should support more flexible sparsity to exploit activation pruning.

Core claim

Across multiple LLMs, pruning activations enables superior preservation of generative capabilities compared to weight pruning at equivalent sparsity levels. Lightweight post-training methods with minimal calibration data deliver effective N:M activation pruning. The 16:32 pattern performs nearly as well as unstructured sparsity, yet the 8:16 pattern is recommended after weighing flexibility against hardware implementation complexity.

What carries the argument

Post-training N:M activation pruning with lightweight plug-and-play error mitigation techniques and pruning criteria that adapt to input while using little calibration data.

If this is right

Activation pruning can be applied after training with only minimal calibration data and still outperform weight pruning on generative tasks.
The 16:32 sparsity pattern achieves performance nearly on par with unstructured sparsity.
The 8:16 sparsity pattern offers a favorable trade-off between flexibility and hardware complexity.
Hardware designs should incorporate support for flexible sparsity patterns beyond the standard 2:4 to enable efficient activation pruning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Accelerator architects could add native support for 8:16 or similar patterns to reduce I/O and compute costs during inference.
The same lightweight methods might combine with weight pruning to reach higher overall sparsity without further quality loss.
Testing these activation pruning steps on non-generative tasks such as classification or retrieval could reveal broader applicability.

Load-bearing premise

Lightweight error mitigation techniques and pruning criteria need only minimal calibration data yet still maintain their performance advantage on held-out generative tasks.

What would settle it

A test on additional LLMs or harder generative benchmarks where activation pruning at a given sparsity level degrades performance as much as or more than weight pruning at the same sparsity.

Figures

Figures reproduced from arXiv: 2509.22166 by Alexey Dontsov, Alina Kostromina, Egor Shvetsov, Ekaterina Galaeva, Evgeny Burnaev, Kristina Kazistova, Maxim Zhelnin, Redko Dmitry, Shirin Alanova, Vladimir Smirnov.

**Figure 1.** Figure 1: Comparison of unstructured sparsity in activations (ACT) and weights (WT) averaged across four datasets at varying sparsity ratios. Higher is Better. More detailed results are presented in Appendix [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

The demand for efficient large language model (LLM) inference has intensified the focus on sparsification techniques. While semi-structured (N:M) pruning is well-established for weights, its application to activation pruning remains underexplored despite its potential for dynamic, input-adaptive compression and reductions in I/O overhead. This work presents a comprehensive analysis of methods for post-training N:M activation pruning in LLMs. Across multiple LLMs, we demonstrate that pruning activations enables superior preservation of generative capabilities compared to weight pruning at equivalent sparsity levels. We evaluate lightweight, plug-and-play error mitigation techniques and pruning criteria, establishing strong hardware-friendly baselines that require minimal calibration. Furthermore, we explore sparsity patterns beyond NVIDIA's standard 2:4, showing that the 16:32 pattern achieves performance nearly on par with unstructured sparsity. However, considering the trade-off between flexibility and hardware implementation complexity, we focus on the 8:16 pattern as a superior candidate. Our findings provide both effective practical methods for activation pruning and a motivation for future hardware to support more flexible sparsity patterns. Our code is available https://anonymous.4open.science/r/Structured-Sparse-Activations-Inference-EC3C/README.md .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript benchmarks lightweight post-training methods for semi-structured (N:M) activation pruning in LLMs. Its central claim is that activation pruning, using plug-and-play error mitigation and pruning criteria with minimal calibration, preserves generative capabilities better than weight pruning at equivalent sparsity levels across multiple models. It further shows that the 16:32 pattern nearly matches unstructured sparsity performance, recommends the 8:16 pattern for its flexibility-complexity trade-off, and argues for hardware support of more flexible sparsity patterns beyond 2:4.

Significance. If the empirical comparisons hold under stricter validation, the work supplies practical baselines for activation sparsity and concrete motivation for next-generation accelerators to implement flexible (N:M) patterns. The public code release supports reproducibility.

major comments (1)

[Experimental protocol and results sections] The central claim that activation pruning with minimal-calibration criteria outperforms weight pruning on held-out generative tasks rests on the unverified assumption that the chosen calibration data is small yet distributionally representative. No quantitative bound on calibration-set size or explicit verification that evaluation prompts are disjoint in content and style from calibration data is provided; this directly affects whether the reported advantage is robust or an artifact of overlap.

minor comments (2)

[Abstract] The abstract states 'minimal calibration' without defining the term quantitatively (e.g., number of tokens or examples); adding a precise figure would improve clarity.
[Figures and tables] Ensure all figures reporting perplexity or generation metrics include error bars or multiple random seeds to allow assessment of variability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have carefully reviewed the major comment and provide a point-by-point response below. We agree that additional details on the experimental protocol will strengthen the presentation and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Experimental protocol and results sections] The central claim that activation pruning with minimal-calibration criteria outperforms weight pruning on held-out generative tasks rests on the unverified assumption that the chosen calibration data is small yet distributionally representative. No quantitative bound on calibration-set size or explicit verification that evaluation prompts are disjoint in content and style from calibration data is provided; this directly affects whether the reported advantage is robust or an artifact of overlap.

Authors: We appreciate the referee's emphasis on the need for explicit verification of calibration data properties to support the robustness of our central claim. In the revised manuscript, we have added a quantitative description of the calibration-set size in the Experimental Setup section, along with an explicit statement confirming that the evaluation prompts (drawn from standard generative benchmarks) are disjoint in content and style from the calibration data. These clarifications address the concern directly and confirm that the observed advantages of activation pruning are not artifacts of data overlap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking of activation pruning methods

full rationale

The paper conducts an empirical benchmarking study of post-training N:M activation sparsification techniques across LLMs, comparing activation pruning to weight pruning at equivalent sparsity levels. Central claims rest on experimental results using lightweight plug-and-play error mitigation and pruning criteria evaluated on held-out generative tasks. No equations, derivations, or fitted parameters are presented that reduce reported performance metrics to quantities defined or computed inside the same study by construction. The work is self-contained against external benchmarks and standard evaluation protocols, with no load-bearing self-citations or self-definitional steps in the reported findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard definitions of N:M sparsity and existing LLM evaluation protocols; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5801 in / 1100 out tokens · 36628 ms · 2026-05-18T13:32:25.790371+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate lightweight, plug-and-play error mitigation techniques and pruning criteria... focus on the 8:16 pattern as a superior candidate.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

semi-structured (N:M) pruning... 2:4, 4:8, 8:16, 16:32

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 8 internal anchors

[1]

Amber pruner: Leveraging n: M activation sparsity for efficient prefill in large language models.arXiv preprint arXiv:2508.02128, 2025a

Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, and Gongyi Wang. Amber pruner: Leveraging n: M activation sparsity for efficient prefill in large language models.arXiv preprint arXiv:2508.02128, 2025a. Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, and Gongyi Wang. Amber ...

work page arXiv
[2]

arXiv preprint arXiv:2412.07174 , year=

Vui Seng Chua, Yujie Pan, and Nilesh Jain. Post-training statistical calibration for higher activation sparsity.arXiv preprint arXiv:2412.07174,

work page arXiv
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Extreme compression of large language models via additive quantization.arXiv preprint arXiv:2401.06118,

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Al- istarh. Extreme compression of large language models via additive quantization.arXiv preprint arXiv:2401.06118,

work page arXiv
[6]

Inference economics of language models,

Ege Erdil. Inference economics of language models.arXiv preprint arXiv:2506.04645,

work page arXiv
[7]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

10 September 2025 Song Han, Jeff Pool, John Tran, and William J

URLhttps://zenodo.org/records/ 10256836. 10 September 2025 Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks,

work page 2025
[9]

Learning both Weights and Connections for Efficient Neural Networks

URLhttps://arxiv.org/abs/1506.02626. Daniel Haziza, Timothy Chou, Dhruv Choudhary, Luca Wehrstedt, Francisco Massa, Jiecao Yu, Geonhwa Jeong, Supriya Rao, Patrick Labatut, and Jesse Cai. Accelerating transformer inference and training with 2: 4 activation sparsity.arXiv preprint arXiv:2503.16672,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[11]

Accelerating transformer pre- training with 2: 4 sparsity.arXiv preprint arXiv:2404.01847,

Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, and Jun Zhu. Accelerating transformer pre- training with 2: 4 sparsity.arXiv preprint arXiv:2404.01847,

work page arXiv
[12]

Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Bykov Mikhail, and Evgeny Burnaev

URLhttps://arxiv.org/abs/ 2504.18959. Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Bykov Mikhail, and Evgeny Burnaev. Investigating the impact of quantization methods on the safety and reliability of large language models.arXiv preprint arXiv:2502.15799,

work page arXiv
[13]

arXiv preprint arXiv:2404.08763 , year=

Je-Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, and Azalia Mirhoseini. Cats: Contextually-aware thresholding for sparsity in large language models, 2024.URL https://arxiv. org/abs/2404.08763,

work page arXiv 2024
[14]

arXiv preprint arXiv:2408.14690 , year=

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, and Ben Athiwaratkun. Training- free activation sparsity in large language models.arXiv preprint arXiv:2408.14690,

work page arXiv
[15]

From 2:4 to 8:16 sparsity patterns in LLMs for Outliers and Weights with Variance Correction

Egor Maximov, Yulia Kuzkina, Azamat Kanametov, Alexander Prutko, Aleksei Goncharov, Maxim Zhelnin, and Egor Shvetsov. From 2: 4 to 8: 16 sparsity patterns in llms for outliers and weights with variance correction.arXiv preprint arXiv:2507.03052,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Ace: Exploring activation co- sine similarity and variance for accurate and calibration-efficient llm pruning.arXiv preprint arXiv:2505.21987,

Zhendong Mi, Zhenglun Kong, Geng Yuan, and Shaoyi Huang. Ace: Exploring activation co- sine similarity and variance for accurate and calibration-efficient llm pruning.arXiv preprint arXiv:2505.21987,

work page arXiv
[17]

Relu strikes back: Exploiting activation sparsity in large language models, 2023

11 September 2025 Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models.arXiv preprint arXiv:2310.04564,

work page arXiv 2025
[18]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The lambada dataset: Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

arXiv preprint arXiv:2210.03044 , year=

Mansheej Paul, Feng Chen, Brett W Larsen, Jonathan Frankle, Surya Ganguli, and Gintare Karolina Dziugaite. Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask? arXiv preprint arXiv:2210.03044,

work page arXiv
[20]

arXiv preprint arXiv:2505.14884 , year=

Susav Shrestha, Brad Settlemyer, Nikoli Dryden, and Narasimha Reddy. Polar sparsity: High throughput batched llm inferencing with scalable contextual sparsity.arXiv preprint arXiv:2505.14884,

work page arXiv
[21]

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen

doi: 10.1109/ ACCESS.2024.3446039. Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pp. 590–606, 2024a. Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, and Haibo Chen. Turbo sparse: Achi...

work page arXiv 2024
[22]

Q-sparse: All large language models can be fully sparsely-activated.arXiv preprint arXiv:2407.10969,

Hongyu Wang, Shuming Ma, Ruiping Wang, and Furu Wei. Q-sparse: All large language models can be fully sparsely-activated.arXiv preprint arXiv:2407.10969,

work page arXiv
[23]

Gift-sw: Gaussian noise injected fine-tuning of salient weights for llms.arXiv preprint arXiv:2408.15300,

Maxim Zhelnin, Viktor Moskvoretskii, Egor Shvetsov, Egor Venediktov, Mariya Krylova, Aleksandr Zuev, and Evgeny Burnaev. Gift-sw: Gaussian noise injected fine-tuning of salient weights for llms.arXiv preprint arXiv:2408.15300,

work page arXiv
[24]

Instruction-Following Evaluation for Large Language Models

URLhttps: //arxiv.org/abs/2311.07911. Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization.arXiv preprint arXiv:1612.01064,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

12 September 2025 APPENDIX A R-SPARSEDETAILS Finally, we includeR-Sparse(Kamirul et al., 2025), which combines activation sparsity with a low-rank approximation of the weight matrix. Instead of pruning solely by magnitude, R-Sparse decomposes the computation into two parts: (i) sparse channels with high-magnitude activations, and (ii) a low-rank component...

work page 2025
[26]

Dataset Description Metric WikiText-2 (Merity et al.,

Models Method Llama2-7B Qwen2.5-7B Gemma3-4B LLama3-8B Average Drop (↓) CLACT + PTS5.63%−5.06%0.50%8.55%2.40% CLACT + V AR5.07%−2.90%0.54%8.59%2.82% Amber-Pruner + PTS6.16%−3.47%0.17%7.42%2.57% Amber-Pruner + V AR4.74%−3.63%−0.16%8.39%2.34% L-PTS + V AR6.87%2.86%3.41%7.15%5.07% C DATASETS 13 September 2025 Table 8: Datasets used to evaluate hypotheses.Pro...

work page 2025
[27]

Contains 5957 4-way multiple-choice questions

Open-book question answering dataset requiring retrieval of el- ementary science facts. Contains 5957 4-way multiple-choice questions. Accuracy RTE (Dagan et al., 2005; Bar-Haim et al.,

work page 2005
[28]

Accuracy (Prompt-level) Accuracy (Instruct-level) 14 September 2025 D WEIGHTS VERSUSACTIVATIONS Table 9: The performance of models with applied unstructured activation pruning

Benchmark with 541 prompts containing verifiable instructions to measure instruction-following fidelity. Accuracy (Prompt-level) Accuracy (Instruct-level) 14 September 2025 D WEIGHTS VERSUSACTIVATIONS Table 9: The performance of models with applied unstructured activation pruning. We show that even with severe sparsity (70-90%) models were able to perform...

work page 2025
[29]

15 September 2025 Table 10:Semi-Structured 2:4 Sparsification- performance Metrics, for calibration, when it is required, and perplexity we use WikiText2. Average Drop is computed without accounting for perplexity Pruning WikiText2↓ARC Easy BoolQ PIQA WinoGrande Average Drop % Llama2-7B6.94 0.74 0.80 0.76 0.66 - ACT10.23 0.66 0.71 0.71 0.60 9.43% WT42.400...

work page 2025

[1] [1]

Amber pruner: Leveraging n: M activation sparsity for efficient prefill in large language models.arXiv preprint arXiv:2508.02128, 2025a

Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, and Gongyi Wang. Amber pruner: Leveraging n: M activation sparsity for efficient prefill in large language models.arXiv preprint arXiv:2508.02128, 2025a. Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, and Gongyi Wang. Amber ...

work page arXiv

[2] [2]

arXiv preprint arXiv:2412.07174 , year=

Vui Seng Chua, Yujie Pan, and Nilesh Jain. Post-training statistical calibration for higher activation sparsity.arXiv preprint arXiv:2412.07174,

work page arXiv

[3] [3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Extreme compression of large language models via additive quantization.arXiv preprint arXiv:2401.06118,

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Al- istarh. Extreme compression of large language models via additive quantization.arXiv preprint arXiv:2401.06118,

work page arXiv

[6] [6]

Inference economics of language models,

Ege Erdil. Inference economics of language models.arXiv preprint arXiv:2506.04645,

work page arXiv

[7] [7]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

10 September 2025 Song Han, Jeff Pool, John Tran, and William J

URLhttps://zenodo.org/records/ 10256836. 10 September 2025 Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks,

work page 2025

[9] [9]

Learning both Weights and Connections for Efficient Neural Networks

URLhttps://arxiv.org/abs/1506.02626. Daniel Haziza, Timothy Chou, Dhruv Choudhary, Luca Wehrstedt, Francisco Massa, Jiecao Yu, Geonhwa Jeong, Supriya Rao, Patrick Labatut, and Jesse Cai. Accelerating transformer inference and training with 2: 4 activation sparsity.arXiv preprint arXiv:2503.16672,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[11] [11]

Accelerating transformer pre- training with 2: 4 sparsity.arXiv preprint arXiv:2404.01847,

Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, and Jun Zhu. Accelerating transformer pre- training with 2: 4 sparsity.arXiv preprint arXiv:2404.01847,

work page arXiv

[12] [12]

Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Bykov Mikhail, and Evgeny Burnaev

URLhttps://arxiv.org/abs/ 2504.18959. Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Bykov Mikhail, and Evgeny Burnaev. Investigating the impact of quantization methods on the safety and reliability of large language models.arXiv preprint arXiv:2502.15799,

work page arXiv

[13] [13]

arXiv preprint arXiv:2404.08763 , year=

Je-Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, and Azalia Mirhoseini. Cats: Contextually-aware thresholding for sparsity in large language models, 2024.URL https://arxiv. org/abs/2404.08763,

work page arXiv 2024

[14] [14]

arXiv preprint arXiv:2408.14690 , year=

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, and Ben Athiwaratkun. Training- free activation sparsity in large language models.arXiv preprint arXiv:2408.14690,

work page arXiv

[15] [15]

From 2:4 to 8:16 sparsity patterns in LLMs for Outliers and Weights with Variance Correction

Egor Maximov, Yulia Kuzkina, Azamat Kanametov, Alexander Prutko, Aleksei Goncharov, Maxim Zhelnin, and Egor Shvetsov. From 2: 4 to 8: 16 sparsity patterns in llms for outliers and weights with variance correction.arXiv preprint arXiv:2507.03052,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Ace: Exploring activation co- sine similarity and variance for accurate and calibration-efficient llm pruning.arXiv preprint arXiv:2505.21987,

Zhendong Mi, Zhenglun Kong, Geng Yuan, and Shaoyi Huang. Ace: Exploring activation co- sine similarity and variance for accurate and calibration-efficient llm pruning.arXiv preprint arXiv:2505.21987,

work page arXiv

[17] [17]

Relu strikes back: Exploiting activation sparsity in large language models, 2023

11 September 2025 Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models.arXiv preprint arXiv:2310.04564,

work page arXiv 2025

[18] [18]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The lambada dataset: Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

arXiv preprint arXiv:2210.03044 , year=

Mansheej Paul, Feng Chen, Brett W Larsen, Jonathan Frankle, Surya Ganguli, and Gintare Karolina Dziugaite. Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask? arXiv preprint arXiv:2210.03044,

work page arXiv

[20] [20]

arXiv preprint arXiv:2505.14884 , year=

Susav Shrestha, Brad Settlemyer, Nikoli Dryden, and Narasimha Reddy. Polar sparsity: High throughput batched llm inferencing with scalable contextual sparsity.arXiv preprint arXiv:2505.14884,

work page arXiv

[21] [21]

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen

doi: 10.1109/ ACCESS.2024.3446039. Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pp. 590–606, 2024a. Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, and Haibo Chen. Turbo sparse: Achi...

work page arXiv 2024

[22] [22]

Q-sparse: All large language models can be fully sparsely-activated.arXiv preprint arXiv:2407.10969,

Hongyu Wang, Shuming Ma, Ruiping Wang, and Furu Wei. Q-sparse: All large language models can be fully sparsely-activated.arXiv preprint arXiv:2407.10969,

work page arXiv

[23] [23]

Gift-sw: Gaussian noise injected fine-tuning of salient weights for llms.arXiv preprint arXiv:2408.15300,

Maxim Zhelnin, Viktor Moskvoretskii, Egor Shvetsov, Egor Venediktov, Mariya Krylova, Aleksandr Zuev, and Evgeny Burnaev. Gift-sw: Gaussian noise injected fine-tuning of salient weights for llms.arXiv preprint arXiv:2408.15300,

work page arXiv

[24] [24]

Instruction-Following Evaluation for Large Language Models

URLhttps: //arxiv.org/abs/2311.07911. Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization.arXiv preprint arXiv:1612.01064,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

12 September 2025 APPENDIX A R-SPARSEDETAILS Finally, we includeR-Sparse(Kamirul et al., 2025), which combines activation sparsity with a low-rank approximation of the weight matrix. Instead of pruning solely by magnitude, R-Sparse decomposes the computation into two parts: (i) sparse channels with high-magnitude activations, and (ii) a low-rank component...

work page 2025

[26] [26]

Dataset Description Metric WikiText-2 (Merity et al.,

Models Method Llama2-7B Qwen2.5-7B Gemma3-4B LLama3-8B Average Drop (↓) CLACT + PTS5.63%−5.06%0.50%8.55%2.40% CLACT + V AR5.07%−2.90%0.54%8.59%2.82% Amber-Pruner + PTS6.16%−3.47%0.17%7.42%2.57% Amber-Pruner + V AR4.74%−3.63%−0.16%8.39%2.34% L-PTS + V AR6.87%2.86%3.41%7.15%5.07% C DATASETS 13 September 2025 Table 8: Datasets used to evaluate hypotheses.Pro...

work page 2025

[27] [27]

Contains 5957 4-way multiple-choice questions

Open-book question answering dataset requiring retrieval of el- ementary science facts. Contains 5957 4-way multiple-choice questions. Accuracy RTE (Dagan et al., 2005; Bar-Haim et al.,

work page 2005

[28] [28]

Accuracy (Prompt-level) Accuracy (Instruct-level) 14 September 2025 D WEIGHTS VERSUSACTIVATIONS Table 9: The performance of models with applied unstructured activation pruning

Benchmark with 541 prompts containing verifiable instructions to measure instruction-following fidelity. Accuracy (Prompt-level) Accuracy (Instruct-level) 14 September 2025 D WEIGHTS VERSUSACTIVATIONS Table 9: The performance of models with applied unstructured activation pruning. We show that even with severe sparsity (70-90%) models were able to perform...

work page 2025

[29] [29]

15 September 2025 Table 10:Semi-Structured 2:4 Sparsification- performance Metrics, for calibration, when it is required, and perplexity we use WikiText2. Average Drop is computed without accounting for perplexity Pruning WikiText2↓ARC Easy BoolQ PIQA WinoGrande Average Drop % Llama2-7B6.94 0.74 0.80 0.76 0.66 - ACT10.23 0.66 0.71 0.71 0.60 9.43% WT42.400...

work page 2025