LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

Mahyar Najibi; Mohammad Mozaffari; Mohammad Rastegari; Younes Hourri

arxiv: 2605.17289 · v1 · pith:OXAK6PTBnew · submitted 2026-05-17 · 💻 cs.LG · cs.AI

LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

Mohammad Mozaffari , Younes Hourri , Mohammad Rastegari , Mahyar Najibi This is my paper

Pith reviewed 2026-05-20 13:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM pruningunstructured sparsityend-to-end learningGumbel-sigmoidmodel compressionzero-shot accuracylearnable masks

0 comments

The pith

LEAP makes end-to-end unstructured pruning practical for LLMs by learning per-weight masks with a Gumbel-sigmoid relaxation, which improves zero-shot accuracy over layer-wise baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that layer-wise pruning approaches lose accuracy at high sparsity because they optimize masks independently per layer and ignore cross-layer dependencies. LEAP replaces earlier end-to-end mask schemes with a simple per-weight Bernoulli distribution relaxed through the Gumbel-sigmoid trick, turning mask selection into a continuous, differentiable process that can be trained jointly across the whole model. Experiments on five LLM families between 0.5B and 8B parameters at 50% and 60% sparsity show an average gain of 2.59 points on a six-task zero-shot suite compared with the strongest layer-wise competitor. A reader would care because recent hardware now runs unstructured sparse models efficiently, so a better pruning algorithm directly reduces memory and latency while preserving capability.

Core claim

By replacing the intractable categorical-over-patterns parameterization of prior end-to-end methods with a per-weight Bernoulli-via-Gumbel-sigmoid relaxation, LEAP renders end-to-end learning of unstructured sparsity masks tractable for large language models and produces higher retained accuracy than layer-wise surrogates derived from the Optimal Brain Surgeon principle.

What carries the argument

Per-weight Bernoulli-via-Gumbel-sigmoid relaxation that turns mask selection into a differentiable optimization variable.

If this is right

Higher zero-shot accuracy is retained at 50% and 60% unstructured sparsity.
The gains appear consistently across five LLM families from 0.5B to 8B parameters.
End-to-end mask learning closes the accuracy gap that layer-wise methods exhibit under aggressive pruning.
The same relaxation removes the scaling barrier that prevented earlier end-to-end methods from handling unstructured sparsity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may extend naturally to joint optimization of sparsity and quantization because both become differentiable under similar relaxations.
For models larger than 8B the value of cross-layer mask coordination is likely to grow because layer-wise decisions compound more severely.
If the continuous-to-discrete gap remains small, post-pruning fine-tuning steps could be shortened or eliminated in some deployment pipelines.

Load-bearing premise

The continuous relaxation must produce masks whose behavior after rounding or sampling stays close enough to the optimized continuous trajectory that the final discrete model retains the accuracy gains.

What would settle it

Train models with the continuous relaxation, then evaluate zero-shot accuracy using the rounded discrete masks; if the accuracy lift over ADMM vanishes or reverses, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.17289 by Mahyar Najibi, Mohammad Mozaffari, Mohammad Rastegari, Younes Hourri.

**Figure 1.** Figure 1: Learned per-block density at a global unstructured sparsity budget of 50% and 60% for Qwen-2.5 0.5B, Gemma-3 1B, LLaMA-3.2 1B, and LLaMA-3.2 3B. Masks are initialized from Wanda and trained for 2,000 steps. The grey dashed line marks the global budget. MaskLLM and PATCH from extending to unstructured sparsity, and we proposed LEAP, a per-weight Bernoulli-viaGumbel-sigmoid reformulation that makes end-to-… view at source ↗

read the original abstract

Unstructured sparsity is now natively accelerated by recent GPU kernels and dataflow hardware, shifting the bottleneck from inference execution to the pruning algorithm. State-of-the-art methods for unstructured LLM pruning are layer-wise surrogates derived from the Optimal Brain Surgeon principle, and they sacrifice end-to-end accuracy, especially under aggressive sparsity. End-to-end alternatives such as MaskLLM and PATCH show that learnable masks can close this gap, but their categorical-over-patterns parameterization scales with the number of valid masks per row and does not port to the unstructured setting. We introduce LEAP, which replaces this intractable parameterization with a per-weight Bernoulli-via-Gumbel- sigmoid relaxation that makes end-to-end unstructured mask learning tractable. Across five LLM families from 0.5B to 8B parameters at 50% and 60% sparsity, LEAP improves six-task average zero-shot accuracy by +2.59 points on average over ADMM, the best layer-wise baseline in our sweep.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LEAP swaps in a per-weight Gumbel-sigmoid relaxation to make end-to-end unstructured pruning tractable and reports a 2.59-point accuracy lift over ADMM, but the continuous-to-discrete alignment still needs checking.

read the letter

The main takeaway is that LEAP makes end-to-end mask learning practical for unstructured sparsity in LLMs by using a per-weight Bernoulli distribution relaxed through Gumbel-sigmoid. This sidesteps the scaling issues in prior categorical methods. What is new is the shift to independent per-weight decisions with this specific relaxation. MaskLLM and PATCH used patterns, which works for structured but not for the unstructured case the paper targets. By going per-weight, they can optimize the entire model jointly while keeping computation feasible. The results show this pays off: across models from 0.5B to 8B parameters and at 50% and 60% sparsity, they get a 2.59 point average improvement on six zero-shot tasks over ADMM. The work does a decent job of positioning itself against layer-wise surrogates like those from Optimal Brain Surgeon and showing why end-to-end can help under aggressive sparsity. The experimental setup sweeps multiple model families, which adds some credibility to the claim. Where it feels light is in the details around the relaxation itself. The abstract gives the accuracy delta but skips variance across runs or any check on how sensitive the outcome is to the Gumbel temperature. The assumption that the continuous optimization produces masks whose rounded version matches the trained trajectory is central, and if that link is loose, the advantage over cheaper layer-wise methods disappears. The stress-test concern about decoupling at these sparsity levels is reasonable to probe. This paper will interest readers who focus on model compression and efficient inference for LLMs. Someone looking for new pruning algorithms that can be trained jointly will find the parameterization worth examining. I would recommend sending it for peer review. The core technical change is clear and the empirical comparison is there, so a referee can ask for the missing controls and assess whether the gains are robust.

Referee Report

2 major / 2 minor

Summary. The paper introduces LEAP, which parameterizes unstructured sparsity masks for LLMs via a per-weight Bernoulli distribution relaxed through Gumbel-sigmoid. This enables end-to-end gradient-based optimization of the masks for zero-shot task performance, in contrast to layer-wise surrogate methods such as ADMM. The central empirical claim is an average +2.59 point gain in six-task zero-shot accuracy over ADMM across five model families (0.5B–8B) at 50% and 60% sparsity.

Significance. If the reported gains prove robust, the work would be significant because it demonstrates that a tractable end-to-end formulation can outperform established layer-wise pruning baselines precisely in the unstructured regime now supported by GPU kernels. The per-weight Gumbel relaxation removes the combinatorial scaling that limited prior end-to-end methods (MaskLLM, PATCH) to structured patterns, thereby opening a path to direct task-aware unstructured pruning.

major comments (2)

[§3.2] §3.2 (Gumbel-sigmoid relaxation): The manuscript presents the continuous surrogate but provides no quantitative analysis or ablation showing that the optimized continuous masks, after rounding or sampling, yield discrete performance that correlates with the surrogate loss. This correlation is load-bearing for the claim that the observed +2.59 point gain arises from end-to-end optimization rather than from a more expensive layer-wise surrogate.
[§4.2] §4.2 and Table 2: The +2.59 average improvement is reported without per-run standard deviations, statistical significance tests, or an ablation on Gumbel temperature annealing schedule. Given that the central claim rests on these empirical deltas, the absence of these controls leaves open the possibility that the gains are sensitive to random seeds or hyper-parameter choices.

minor comments (2)

[Figure 3] Figure 3: The legend and axis labels do not explicitly state whether the plotted curves correspond to continuous surrogate loss or to discrete zero-shot accuracy after mask discretization.
[Related Work] Related-work section: The discussion of ADMM could include a brief equation-level comparison showing how LEAP’s per-weight relaxation differs from ADMM’s layer-wise quadratic approximation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation of our results.

read point-by-point responses

Referee: [§3.2] §3.2 (Gumbel-sigmoid relaxation): The manuscript presents the continuous surrogate but provides no quantitative analysis or ablation showing that the optimized continuous masks, after rounding or sampling, yield discrete performance that correlates with the surrogate loss. This correlation is load-bearing for the claim that the observed +2.59 point gain arises from end-to-end optimization rather than from a more expensive layer-wise surrogate.

Authors: We agree that an explicit analysis correlating the continuous surrogate loss with final discrete performance after rounding would strengthen the justification for the end-to-end formulation. In the revised manuscript, we will add a new ablation subsection that reports the correlation between surrogate objective values and the resulting zero-shot accuracy on the discrete masks across the evaluated models and sparsity levels. This will include quantitative metrics (e.g., Pearson correlation) and visualizations to show that improvements in the relaxed objective translate to discrete gains, thereby supporting that the reported benefits derive from the end-to-end optimization rather than surrogate artifacts. revision: yes
Referee: [§4.2] §4.2 and Table 2: The +2.59 average improvement is reported without per-run standard deviations, statistical significance tests, or an ablation on Gumbel temperature annealing schedule. Given that the central claim rests on these empirical deltas, the absence of these controls leaves open the possibility that the gains are sensitive to random seeds or hyper-parameter choices.

Authors: We concur that additional statistical controls and robustness checks are warranted given the centrality of the empirical gains. In the revision, we will rerun the main experiments over multiple random seeds (at least 3 per configuration) and report mean accuracies with standard deviations in Table 2. We will also add statistical significance tests (e.g., paired t-tests with p-values) comparing LEAP against the ADMM baseline. Additionally, we will include an ablation study on the Gumbel temperature annealing schedule, testing variations in initial temperature and decay rate, to confirm that the observed improvements are not overly sensitive to these choices. These elements will be incorporated into the updated §4.2 and supplementary material. revision: yes

Circularity Check

0 steps flagged

LEAP introduces independent Gumbel-sigmoid parameterization with empirical gains; no derivation reduces to fitted inputs or self-citation

full rationale

The paper's core step replaces categorical mask parameterization with a per-weight Bernoulli-via-Gumbel-sigmoid relaxation to enable end-to-end unstructured pruning. Reported gains (+2.59 zero-shot accuracy over ADMM) are measured on held-out tasks across model families rather than derived from any quantity fitted inside the optimization or from prior self-citations. No equation equates the final discrete mask performance to the continuous surrogate by construction, and the method does not invoke uniqueness theorems or ansatzes from the authors' own prior work as load-bearing justification. This yields a minor score reflecting normal self-citation patterns without circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the Gumbel-sigmoid continuous relaxation as a faithful proxy for discrete mask selection and on the assumption that end-to-end gradient flow through this proxy yields masks superior to layer-wise surrogates.

free parameters (1)

Gumbel temperature
Controls the sharpness of the sigmoid relaxation; must be chosen or annealed and directly affects the quality of the learned masks.

axioms (1)

domain assumption The straight-through estimator or reparameterization gradient through the Gumbel-sigmoid yields unbiased or low-bias updates for the mask probabilities.
Standard assumption in Gumbel-softmax literature but remains an approximation whose error grows with temperature mismatch.

pith-pipeline@v0.9.0 · 5713 in / 1289 out tokens · 48668 ms · 2026-05-20T13:31:28.138710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Fast and effective weight update for pruned large language models.arXiv preprint arXiv:2401.02938,

Boˇza, V . Fast and effective weight update for pruned large language models.arXiv preprint arXiv:2401.02938,

work page arXiv
[2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try ARC, the AI2 reasoning chal- lenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Hourri, Y ., Mozaffari, M., and Dehnavi, M. M. PATCH: Learnable tile-level hybrid sparsity for large language models.arXiv preprint arXiv:2509.23410,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URL https: //arxiv.org/abs/2504.05346. Lie, S. Harnessing the power of sparsity for large GPT AI models. Technical report, Cerebras Systems,

work page arXiv
[7]

and Bo ˇza, V

Macko, V . and Bo ˇza, V . MACKO: Sparse matrix- vector multiplication for low sparsity.arXiv preprint arXiv:2511.13061,

work page arXiv
[8]

A., Pool, J., Stosic, D., Stosic, D., Venkatesh, G., Yu, C., and Micikevicius, P

Mishra, A., Latorre, J. A., Pool, J., Stosic, D., Stosic, D., Venkatesh, G., Yu, C., and Micikevicius, P. Acceler- ating sparse deep neural networks. InarXiv preprint arXiv:2104.08378,

work page arXiv
[9]

M., and Yazdan- bakhsh, A

Mozaffari, M., Kushnir, S., Dehnavi, M. M., and Yazdan- bakhsh, A. Optima: Optimal one-shot pruning for llms via quadratic programming reconstruction.arXiv preprint arXiv:2512.13886, 2025a. Mozaffari, M., Yazdanbakhsh, A., and Mehri Dehnavi, M. SLiM: One-shot Quantized Sparse Plus Low- rank Approximation of LLMs. InForty-second In- ternational Conference ...

work page arXiv
[10]

7 LEAP: Learnable End-to-End Adaptive Pruning of LLMs Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P

Accessed: 2026-04-24. 7 LEAP: Learnable End-to-End Adaptive Pruning of LLMs Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21 (140):1–67,

work page 2026
[11]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Wan, Y ., Liu, Y ....

work page internal anchor Pith review Pith/arXiv arXiv
[13]

The MaskLLM† rows are 2:4 semi-structured at 50% density and are reproduced from (Hourri et al., 2025)

Each table covers one model at 50% and 60% unstructured sparsity across the six zero-shot tasks (MMLU, PIQA, ARC-E, ARC-C, Winogrande, OBQA) plus WikiText2 perplexity at sequence length4096. The MaskLLM† rows are 2:4 semi-structured at 50% density and are reproduced from (Hourri et al., 2025). Table 5.Per-task results on Qwen-2.5 0.5B. PPL is on WikiText2...

work page arXiv 2025

[1] [1]

Fast and effective weight update for pruned large language models.arXiv preprint arXiv:2401.02938,

Boˇza, V . Fast and effective weight update for pruned large language models.arXiv preprint arXiv:2401.02938,

work page arXiv

[2] [2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try ARC, the AI2 reasoning chal- lenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Hourri, Y ., Mozaffari, M., and Dehnavi, M. M. PATCH: Learnable tile-level hybrid sparsity for large language models.arXiv preprint arXiv:2509.23410,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [6]

URL https: //arxiv.org/abs/2504.05346. Lie, S. Harnessing the power of sparsity for large GPT AI models. Technical report, Cerebras Systems,

work page arXiv

[6] [7]

and Bo ˇza, V

Macko, V . and Bo ˇza, V . MACKO: Sparse matrix- vector multiplication for low sparsity.arXiv preprint arXiv:2511.13061,

work page arXiv

[7] [8]

A., Pool, J., Stosic, D., Stosic, D., Venkatesh, G., Yu, C., and Micikevicius, P

Mishra, A., Latorre, J. A., Pool, J., Stosic, D., Stosic, D., Venkatesh, G., Yu, C., and Micikevicius, P. Acceler- ating sparse deep neural networks. InarXiv preprint arXiv:2104.08378,

work page arXiv

[8] [9]

M., and Yazdan- bakhsh, A

Mozaffari, M., Kushnir, S., Dehnavi, M. M., and Yazdan- bakhsh, A. Optima: Optimal one-shot pruning for llms via quadratic programming reconstruction.arXiv preprint arXiv:2512.13886, 2025a. Mozaffari, M., Yazdanbakhsh, A., and Mehri Dehnavi, M. SLiM: One-shot Quantized Sparse Plus Low- rank Approximation of LLMs. InForty-second In- ternational Conference ...

work page arXiv

[9] [10]

7 LEAP: Learnable End-to-End Adaptive Pruning of LLMs Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P

Accessed: 2026-04-24. 7 LEAP: Learnable End-to-End Adaptive Pruning of LLMs Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21 (140):1–67,

work page 2026

[10] [11]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [12]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Wan, Y ., Liu, Y ....

work page internal anchor Pith review Pith/arXiv arXiv

[12] [13]

The MaskLLM† rows are 2:4 semi-structured at 50% density and are reproduced from (Hourri et al., 2025)

Each table covers one model at 50% and 60% unstructured sparsity across the six zero-shot tasks (MMLU, PIQA, ARC-E, ARC-C, Winogrande, OBQA) plus WikiText2 perplexity at sequence length4096. The MaskLLM† rows are 2:4 semi-structured at 50% density and are reproduced from (Hourri et al., 2025). Table 5.Per-task results on Qwen-2.5 0.5B. PPL is on WikiText2...

work page arXiv 2025