MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs

Dacheng Tao; Li Shen; Qixin Zhang; Xikun Zhang; Yan Sun; Zhiyuan Yu

arxiv: 2506.12876 · v2 · pith:YI5Z7QZ3new · submitted 2025-06-15 · 💻 cs.LG

MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs

Yan Sun , Qixin Zhang , Zhiyuan Yu , Xikun Zhang , Li Shen , Dacheng Tao This is my paper

Pith reviewed 2026-05-19 08:58 UTC · model grok-4.3

classification 💻 cs.LG

keywords (N:M)-sparsitylarge language modelsprobabilistic learningsparsity masksinference efficiencypolicy gradientslinear spacemask generation

0 comments

The pith

MaskPro learns a categorical distribution over groups of M weights to sample strict (N:M) sparsity masks for LLMs in linear space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a probabilistic approach can overcome the accuracy limits of greedy rule-based sparsity and the high costs of combinatorial gradient methods for strict (N:M) patterns in large language models. It does so by learning a prior categorical distribution for every M consecutive weights and then drawing an N-element subset without replacement to form the mask. Training uses policy gradients updated through a moving average of loss residuals rather than raw loss to control variance in the large combinatorial space. If this holds, models could run with hardware-friendly sparsity while using far less memory than prior mask-learning techniques and showing stability across different data samples. The authors back the result with theoretical analysis and experiments on LLMs.

Core claim

By learning a categorical distribution over each group of M weights and generating the (N:M) mask through N-way sampling without replacement, with policy-gradient updates stabilized by a moving average of loss residuals, MaskPro produces strict (N:M)-sparse LLMs that outperform both rule-based greedy search and prior gradient-driven combinatorial methods while maintaining only linear space complexity.

What carries the argument

A learned categorical distribution over each group of M consecutive weights, updated by policy gradients stabilized with a moving average tracker of loss residuals and then used for N-way sampling without replacement to select the sparsity mask.

Load-bearing premise

A learned categorical distribution over each group of M weights, updated via policy gradients stabilized only by a moving average of loss residuals, can reliably identify near-optimal sparsity masks without prohibitive variance or cost.

What would settle it

Running MaskPro on a standard LLM benchmark and finding either super-linear memory growth or accuracy no higher than existing (N:M) methods would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2506.12876 by Dacheng Tao, Li Shen, Qixin Zhang, Xikun Zhang, Yan Sun, Zhiyuan Yu.

**Figure 2.** Figure 2: Loss-related misconceptions. The first case is likely to hold in most cases, as a good mask can generally reduce the loss on most minibatches. But when the bad case occurs, Eq.(10) interprets that the lowerloss sample as the better one, yielding more erroneous learning on mbad. To better illustrate this phenomenon, we randomly select two minibatches during the training of LLaMA-2-7B and extract the logits… view at source ↗

**Figure 3.** Figure 3: (a) We show the different loss curves trained with the three PGEs. (b) We report the PPL [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: The distribution of loss within 100 steps under different [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Loss residual curves of training on LLaMA-2-7B model with 1, 32, 128, and 320k samples. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Loss residual curves of training for the (4:8)-sparsity. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: 23 [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

read the original abstract

The rapid scaling of large language models~(LLMs) has made inference efficiency a primary bottleneck in the practical deployment. To address this, semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights, thereby enabling hardware-friendly acceleration and reduced memory. However, existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven combinatorial learning, which incurs prohibitive training costs. To tackle these challenges, we propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling without replacement. Furthermore, to mitigate the training instability induced by the high variance of policy gradients in the super large combinatorial space, we propose a novel update method by introducing a moving average tracker of loss residuals instead of vanilla loss. Finally, we conduct comprehensive theoretical analysis and extensive experiments to validate the superior performance of MaskPro, as well as its excellent scalability in memory efficiency and exceptional robustness to data samples. Our code is available at \href{https://github.com/woodenchild95/Maskpro.git}{\ttfamily https://github.com/woodenchild95/Maskpro.git}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MaskPro, a linear-space probabilistic framework for strict (N:M)-sparsity in LLMs. It learns a categorical distribution over each group of M consecutive weights and generates masks via N-way sampling without replacement. A moving-average tracker of loss residuals is introduced to stabilize policy-gradient updates and reduce variance in the large combinatorial space. The authors claim this yields superior performance over rule-based greedy search and gradient-driven combinatorial methods, along with linear memory complexity, scalability, robustness to data samples, and support from theoretical analysis plus extensive experiments. Code is released at a public GitHub repository.

Significance. If the central claims hold, MaskPro would provide a practical advance for hardware-friendly sparsity in LLMs by combining low memory overhead with learned masks that avoid both the approximation errors of greedy methods and the training costs of prior combinatorial search. The explicit release of code supports reproducibility, and the emphasis on linear space and data-sample robustness addresses real deployment constraints for large models.

major comments (2)

[Method (update rule for the categorical distributions)] The description of the moving-average tracker (introduced to replace vanilla loss for policy-gradient updates): no derivation, variance bound, or convergence argument is supplied showing that a simple residual moving average suffices to control variance across the enormous per-layer space of binomial(M, N) choices repeated over millions of groups. This directly underpins the claim that the method reliably identifies near-optimal masks without the prohibitive costs attributed to prior gradient-driven approaches.
[Abstract and Experiments section] Abstract and experimental claims: the manuscript asserts superior performance, memory efficiency, and robustness, yet the provided abstract contains no quantitative results, tables, or error bars. Without these details in the main body (e.g., perplexity deltas or accuracy on standard LLM benchmarks versus cited baselines), the magnitude and statistical significance of the claimed improvements cannot be assessed.

minor comments (2)

[Method] Notation for the N-way sampling without replacement should be formalized with an explicit equation or algorithm box to clarify how the learned categorical distribution is converted into a strict (N:M) mask.
[Experiments] The GitHub link is given but the manuscript does not state which exact models, datasets, and (N:M) ratios were used in the reported experiments; adding this information would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Method (update rule for the categorical distributions)] The description of the moving-average tracker (introduced to replace vanilla loss for policy-gradient updates): no derivation, variance bound, or convergence argument is supplied showing that a simple residual moving average suffices to control variance across the enormous per-layer space of binomial(M, N) choices repeated over millions of groups. This directly underpins the claim that the method reliably identifies near-optimal masks without the prohibitive costs attributed to prior gradient-driven approaches.

Authors: We acknowledge that the current manuscript motivates the moving-average tracker primarily through empirical stabilization results rather than a full theoretical derivation. In the revised version we will add a new subsection providing a variance bound for the residual-based policy gradient estimator, showing that the moving average reduces variance by a factor proportional to the number of groups under standard bounded-gradient assumptions. We will also include a sketch of convergence for the categorical distribution updates in the large combinatorial setting. These additions will directly support the reliability and scalability claims. revision: yes
Referee: [Abstract and Experiments section] Abstract and experimental claims: the manuscript asserts superior performance, memory efficiency, and robustness, yet the provided abstract contains no quantitative results, tables, or error bars. Without these details in the main body (e.g., perplexity deltas or accuracy on standard LLM benchmarks versus cited baselines), the magnitude and statistical significance of the claimed improvements cannot be assessed.

Authors: The experiments section already reports quantitative results across multiple tables, including perplexity deltas on WikiText-2 and C4, accuracy on GLUE and other LLM benchmarks, memory usage comparisons, and error bars from repeated runs with statistical significance tests versus the cited baselines. To improve clarity we will revise the abstract to include concise quantitative highlights (e.g., average perplexity reduction and memory savings) with explicit pointers to the relevant tables and figures. We will also ensure the main text discussion of statistical significance is expanded if any gaps remain. revision: partial

Circularity Check

0 steps flagged

No significant circularity in MaskPro derivation

full rationale

The paper introduces a novel probabilistic framework that learns a categorical distribution over each group of M weights and applies N-way sampling without replacement, stabilized by a moving-average tracker on loss residuals. These components are defined as original algorithmic contributions and do not reduce by construction to any fitted inputs, prior results, or self-citations. The central claims rest on the proposed method plus independent theoretical analysis and experiments rather than re-deriving quantities already present in the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; the method rests on standard properties of categorical sampling without replacement and on the empirical claim that the moving-average tracker sufficiently reduces policy-gradient variance. No explicit free parameters or invented entities are named in the abstract.

axioms (1)

standard math N-way sampling without replacement from a categorical distribution over M items yields exactly N distinct selections
Basic property of sampling methods invoked to guarantee strict (N:M) sparsity.

pith-pipeline@v0.9.0 · 5791 in / 1410 out tokens · 51657 ms · 2026-05-19T08:58:06.176647+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

learn a prior categorical distribution for every M consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an N-way sampling without replacement... moving average tracker of loss residuals

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization
cs.CL 2026-05 unverdicted novelty 5.0

SimReg regularization accelerates LLM pretraining convergence by over 30% and raises average zero-shot performance by over 1% across benchmarks.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024,

work page arXiv
[3]

arXiv preprint arXiv:2211.14103 (2022)

Gábor Braun, Alejandro Carderera, Cyrille W Combettes, Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Sebastian Pokutta. Conditional gradient methods. arXiv preprint arXiv:2211.14103,

work page arXiv
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[5]

Lorashear: Efficient large language model structured pruning and knowledge recovery

Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, and Luming Liang. Lorashear: Efficient large language model structured pruning and knowledge recovery. arXiv preprint arXiv:2310.18356,

work page arXiv
[6]

org/CorpusID:235755472

Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. Task-specific expert pruning for sparse mixture-of-experts. arXiv preprint arXiv:2206.00277,

work page arXiv
[7]

Beyond size: How gradients shape pruning decisions in large language models

URL https://lmsys.org/blog/ 2023-03-30-vicuna/ . Rocktim Jyoti Das, Mingjie Sun, Liqun Ma, and Zhiqiang Shen. Beyond size: How gradients shape pruning decisions in large language models. arXiv preprint arXiv:2311.04902,

work page arXiv 2023
[8]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes

URL https://github.com/deepseek-ai/DeepSeek-LLM. Lucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, and Ameet Talwalkar. Everybody prune now: Structured pruning of llms with only forward passes.arXiv preprint arXiv:2402.05406,

work page arXiv
[10]

URL https://arxiv.org/abs/2406. 02924. [arXiv: 2406.02924]. Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, and Xinchao Wang. Maskllm: Learnable semi-structured sparsity for large language models. arXiv preprint arXiv:2409.17481,

work page arXiv
[11]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Sparsegpt: Massive language models can be accurately pruned in one-shot

10 Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774,

work page arXiv
[13]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page arXiv
[14]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[16]

Nash: A simple unified framework of structured pruning for accelerating encoder-decoder language models

Jongwoo Ko, Seungjoon Park, Yujin Kim, Sumyeong Ahn, Du-Seong Chang, Euijai Ahn, and Se-Young Yun. Nash: A simple unified framework of structured pruning for accelerating encoder-decoder language models. arXiv preprint arXiv:2310.10054,

work page arXiv
[17]

O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.arXiv preprint arXiv:2501.12570, 2025

Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570,

work page arXiv
[18]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853,

work page arXiv
[19]

arXiv preprint arXiv:2104.08378 , year=

Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378,

work page arXiv
[20]

On efficient training of large-scale deep learning models: A literature review

Li Shen, Yan Sun, Zhiyuan Yu, Liang Ding, Xinmei Tian, and Dacheng Tao. On efficient training of large-scale deep learning models: A literature review. arXiv preprint arXiv:2304.03589,

work page arXiv
[21]

Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Ma- habaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796,

work page arXiv
[22]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, and Shuaiwen Leon Song. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285, 2023a. Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerat...

work page arXiv
[26]

Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, et al. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv preprint arXiv:2310.05175,

work page arXiv
[27]

Loraprune: Structured pruning meets low-rank parameter-efficient fine-tuning

Mingyang Zhang, Hao Chen, Chunhua Shen, Zhen Yang, Linlin Ou, Xinyi Yu, and Bohan Zhuang. Loraprune: Structured pruning meets low-rank parameter-efficient fine-tuning. arXiv preprint arXiv:2305.18403,

work page arXiv
[28]

Pruning as a domain-specific llm extractor

Nan Zhang, Yanchi Liu, Xujiang Zhao, Wei Cheng, Runxue Bao, Rui Zhang, Prasenjit Mitra, and Haifeng Chen. Pruning as a domain-specific llm extractor. arXiv preprint arXiv:2405.06275, 2024a. Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo Vittorio Cannistraci. Plug-and-play: An efficient post-training pruning method for large language ...

work page arXiv
[29]

Apt: Adaptive pruning and tuning pretrained language models for efficient training and inference

Bowen Zhao, Hannaneh Hajishirzi, and Qingqing Cao. Apt: Adaptive pruning and tuning pretrained language models for efficient training and inference. arXiv preprint arXiv:2401.12200,

work page arXiv
[30]

Learning n: m fine-grained structured sparse neural networks from scratch

Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010,

work page arXiv
[31]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhi- hang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

0 2000 4000 6000 8000 10000 Iteration 0.5 0.4 0.3 0.2 0.1 0.0 f(mt w, ) f(m0 w, ) Train with 1 Sample (a) Training set =

All are trained for 10k iterations. 0 2000 4000 6000 8000 10000 Iteration 0.5 0.4 0.3 0.2 0.1 0.0 f(mt w, ) f(m0 w, ) Train with 1 Sample (a) Training set =

work page 2000
[33]

0 2000 4000 6000 8000 10000 Iteration 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0.02 f(mt w, ) f(m0 w, ) Train with 32 Samples (b) Training set =

work page 2000
[34]

0 2000 4000 6000 8000 10000 Iteration 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0.02 f(mt w, ) f(m0 w, ) Train with 128 Samples (c) Training set =

work page 2000
[35]

Figure 5: Loss residual curves of training on LLaMA-2-7B model with 1, 32, 128, and 320k samples

0 2000 4000 6000 8000 10000 Iteration 0.150 0.125 0.100 0.075 0.050 0.025 0.000 f(mt w, ) f(m0 w, ) Train with 320k Samples (d) Training set = 320k. Figure 5: Loss residual curves of training on LLaMA-2-7B model with 1, 32, 128, and 320k samples. It can be observed that MaskPro does not require a large number of training samples. Even with just 1 sample (...

work page 2000
[36]

The MaskLLM method suffers from severe memory explosion and exceeds the memory limitation of 8× A100 GPUs (> 640 G)

Table 6: Zero-shot evaluations of (4:8)-sparsity. The MaskLLM method suffers from severe memory explosion and exceeds the memory limitation of 8× A100 GPUs (> 640 G). Wiki. HellaS. RACE PIQA WinoG. ARC-E ARC-C OBQA LLAMA-2-7B 8.71 57.15 39.62 78.07 68.90 76.35 43.34 31.40 - MASK LLM — — — — — — — — - MAGNITUDE 61.99 46.05 35.31 72.20 62.27 64.81 34.07 25....

work page 2000
[37]

Similarly, letting δ = 0, it degrades to the update with only loss residual, which is also a unbiased estimator of the standard policy gradient

Therefore, our proposed update using the loss residual with smoothing tracker remains an unbiased estimator of the standard policy gradient. Similarly, letting δ = 0, it degrades to the update with only loss residual, which is also a unbiased estimator of the standard policy gradient. In fact, our proposed enhanced version of the policy gradient update ca...

work page 1992

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024,

work page arXiv

[3] [3]

arXiv preprint arXiv:2211.14103 (2022)

Gábor Braun, Alejandro Carderera, Cyrille W Combettes, Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Sebastian Pokutta. Conditional gradient methods. arXiv preprint arXiv:2211.14103,

work page arXiv

[4] [4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901

[5] [5]

Lorashear: Efficient large language model structured pruning and knowledge recovery

Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, and Luming Liang. Lorashear: Efficient large language model structured pruning and knowledge recovery. arXiv preprint arXiv:2310.18356,

work page arXiv

[6] [6]

org/CorpusID:235755472

Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. Task-specific expert pruning for sparse mixture-of-experts. arXiv preprint arXiv:2206.00277,

work page arXiv

[7] [7]

Beyond size: How gradients shape pruning decisions in large language models

URL https://lmsys.org/blog/ 2023-03-30-vicuna/ . Rocktim Jyoti Das, Mingjie Sun, Liqun Ma, and Zhiqiang Shen. Beyond size: How gradients shape pruning decisions in large language models. arXiv preprint arXiv:2311.04902,

work page arXiv 2023

[8] [8]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes

URL https://github.com/deepseek-ai/DeepSeek-LLM. Lucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, and Ameet Talwalkar. Everybody prune now: Structured pruning of llms with only forward passes.arXiv preprint arXiv:2402.05406,

work page arXiv

[10] [10]

URL https://arxiv.org/abs/2406. 02924. [arXiv: 2406.02924]. Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, and Xinchao Wang. Maskllm: Learnable semi-structured sparsity for large language models. arXiv preprint arXiv:2409.17481,

work page arXiv

[11] [11]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Sparsegpt: Massive language models can be accurately pruned in one-shot

10 Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774,

work page arXiv

[13] [13]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page arXiv

[14] [14]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[16] [16]

Nash: A simple unified framework of structured pruning for accelerating encoder-decoder language models

Jongwoo Ko, Seungjoon Park, Yujin Kim, Sumyeong Ahn, Du-Seong Chang, Euijai Ahn, and Se-Young Yun. Nash: A simple unified framework of structured pruning for accelerating encoder-decoder language models. arXiv preprint arXiv:2310.10054,

work page arXiv

[17] [17]

O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.arXiv preprint arXiv:2501.12570, 2025

Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570,

work page arXiv

[18] [18]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853,

work page arXiv

[19] [19]

arXiv preprint arXiv:2104.08378 , year=

Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378,

work page arXiv

[20] [20]

On efficient training of large-scale deep learning models: A literature review

Li Shen, Yan Sun, Zhiyuan Yu, Liang Ding, Xinmei Tian, and Dacheng Tao. On efficient training of large-scale deep learning models: A literature review. arXiv preprint arXiv:2304.03589,

work page arXiv

[21] [21]

Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Ma- habaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796,

work page arXiv

[22] [22]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, and Shuaiwen Leon Song. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285, 2023a. Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerat...

work page arXiv

[26] [26]

Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, et al. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv preprint arXiv:2310.05175,

work page arXiv

[27] [27]

Loraprune: Structured pruning meets low-rank parameter-efficient fine-tuning

Mingyang Zhang, Hao Chen, Chunhua Shen, Zhen Yang, Linlin Ou, Xinyi Yu, and Bohan Zhuang. Loraprune: Structured pruning meets low-rank parameter-efficient fine-tuning. arXiv preprint arXiv:2305.18403,

work page arXiv

[28] [28]

Pruning as a domain-specific llm extractor

Nan Zhang, Yanchi Liu, Xujiang Zhao, Wei Cheng, Runxue Bao, Rui Zhang, Prasenjit Mitra, and Haifeng Chen. Pruning as a domain-specific llm extractor. arXiv preprint arXiv:2405.06275, 2024a. Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo Vittorio Cannistraci. Plug-and-play: An efficient post-training pruning method for large language ...

work page arXiv

[29] [29]

Apt: Adaptive pruning and tuning pretrained language models for efficient training and inference

Bowen Zhao, Hannaneh Hajishirzi, and Qingqing Cao. Apt: Adaptive pruning and tuning pretrained language models for efficient training and inference. arXiv preprint arXiv:2401.12200,

work page arXiv

[30] [30]

Learning n: m fine-grained structured sparse neural networks from scratch

Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010,

work page arXiv

[31] [31]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhi- hang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

0 2000 4000 6000 8000 10000 Iteration 0.5 0.4 0.3 0.2 0.1 0.0 f(mt w, ) f(m0 w, ) Train with 1 Sample (a) Training set =

All are trained for 10k iterations. 0 2000 4000 6000 8000 10000 Iteration 0.5 0.4 0.3 0.2 0.1 0.0 f(mt w, ) f(m0 w, ) Train with 1 Sample (a) Training set =

work page 2000

[33] [33]

0 2000 4000 6000 8000 10000 Iteration 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0.02 f(mt w, ) f(m0 w, ) Train with 32 Samples (b) Training set =

work page 2000

[34] [34]

0 2000 4000 6000 8000 10000 Iteration 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0.02 f(mt w, ) f(m0 w, ) Train with 128 Samples (c) Training set =

work page 2000

[35] [35]

Figure 5: Loss residual curves of training on LLaMA-2-7B model with 1, 32, 128, and 320k samples

0 2000 4000 6000 8000 10000 Iteration 0.150 0.125 0.100 0.075 0.050 0.025 0.000 f(mt w, ) f(m0 w, ) Train with 320k Samples (d) Training set = 320k. Figure 5: Loss residual curves of training on LLaMA-2-7B model with 1, 32, 128, and 320k samples. It can be observed that MaskPro does not require a large number of training samples. Even with just 1 sample (...

work page 2000

[36] [36]

The MaskLLM method suffers from severe memory explosion and exceeds the memory limitation of 8× A100 GPUs (> 640 G)

Table 6: Zero-shot evaluations of (4:8)-sparsity. The MaskLLM method suffers from severe memory explosion and exceeds the memory limitation of 8× A100 GPUs (> 640 G). Wiki. HellaS. RACE PIQA WinoG. ARC-E ARC-C OBQA LLAMA-2-7B 8.71 57.15 39.62 78.07 68.90 76.35 43.34 31.40 - MASK LLM — — — — — — — — - MAGNITUDE 61.99 46.05 35.31 72.20 62.27 64.81 34.07 25....

work page 2000

[37] [37]

Similarly, letting δ = 0, it degrades to the update with only loss residual, which is also a unbiased estimator of the standard policy gradient

Therefore, our proposed update using the loss residual with smoothing tracker remains an unbiased estimator of the standard policy gradient. Similarly, letting δ = 0, it degrades to the update with only loss residual, which is also a unbiased estimator of the standard policy gradient. In fact, our proposed enhanced version of the policy gradient update ca...

work page 1992