pith. sign in

arxiv: 2605.21426 · v1 · pith:MRDQXYHUnew · submitted 2026-05-20 · 💻 cs.LG

Adaptive Signal Resuscitation: Channel-wise Post-Pruning Repair for Sparse Vision Networks

Pith reviewed 2026-05-21 04:57 UTC · model grok-4.3

classification 💻 cs.LG
keywords pruningsparse networkspost-pruning repairchannel-wise adaptationconvolutional networksaccuracy recoverytraining-free repair
0
0 comments X

The pith

Channel-wise repair after pruning recovers accuracy lost in high-sparsity vision networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that one-shot magnitude pruning causes accuracy collapse because repair methods fix entire layers while damage hits individual channels differently. Some channels nearly collapse while others keep useful signals inside the same layer. Adaptive Signal Resuscitation fixes this by computing a separate correction for each output channel based on matching activation variance, then shrinks unreliable corrections using a small calibration dataset. This training-free step runs before standard BatchNorm recalibration and lifts performance across multiple networks and sparsity levels. If correct, it means sparse models can reach higher compression without the usual accuracy penalty or need for retraining.

Core claim

ASR estimates a variance-matching correction for each output channel and stabilizes it with a data-driven shrinkage rule, suppressing unreliable corrections for channels with weak post-pruning signal while preserving corrections for healthier channels. Applied before BatchNorm recalibration, ASR requires only forward passes on a small calibration set and no retraining.

What carries the argument

Adaptive Signal Resuscitation (ASR), which applies per-channel variance-matching corrections stabilized by shrinkage to match repair scale to damage scale.

If this is right

  • ASR improves accuracy over layer-wise repair and BatchNorm-only methods in high-sparsity regimes.
  • On ResNet-50 at 90% sparsity on CIFAR-10, it reaches 55.6% top-1 accuracy versus 41.0% for layer-wise repair.
  • The method works for both unstructured and structured sparsity across convolutional architectures and datasets.
  • Naive channel-wise variance matching fails without the shrinkage stabilization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar channel-level mismatches may limit other post-processing techniques in sparse models.
  • Applying ASR could allow higher sparsity targets while keeping accuracy usable for edge deployment.
  • Testing on transformer-based vision models might reveal if the channel-granular damage pattern holds beyond convolutions.

Load-bearing premise

The method assumes that post-pruning damage occurs mostly at the channel level within layers and that shrinkage based on a small calibration set can safely identify which channels need correction.

What would settle it

A counter-example would be a network and dataset where applying ASR after pruning lowers accuracy compared to layer-wise repair or produces no gain at high sparsity levels.

Figures

Figures reproduced from arXiv: 2605.21426 by Minxuan Hu, Qishi Zhan, Ziheng Chen.

Figure 1
Figure 1. Figure 1: A granularity mismatch in post-pruning repair. Global pruning creates heterogeneous channel [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ASR. ASR repairs heterogeneous post-pruning channel collapse by applying [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Top-1 accuracy versus calibration batch size on CIFAR-10. ASR+BN generally leads BN Only [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-channel activation variance after repair for ResNet-50 on CIFAR-10 with NM 2:4 sparsity [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: gives a spatial view of the same failure mode. After pruning, the feature response is weakened; after layer-wise repair, the response can be over-amplified relative to the dense reference. ASR+BN produces activation maps that are visually closer to the dense model, consistent with its more selective channel-wise correction. Supplementary Section C.3 provides layer-wise heatmaps that show the same over-corr… view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy gain of ASR+BN over LW+BN versus pruning severity (average channel variance [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: BN Only top-1 accuracy (median and 95% quantile band) as a function of BatchNorm recalibration [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Top-1 accuracy versus calibration batch size on CIFAR-100. The overall trend is consistent with [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy gap between BN Only and LW+BN as a function of pruning severity (left column: [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-layer channel variance statistics for ResNet-18 on CIFAR-100 at 90% global L1 sparsity. [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-layer channel variance statistics for VGG-16-BN on CIFAR-10 at 90% global L1 sparsity. Over [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Top-1 accuracy versus calibration batch size for ResNet-50 under NM 2:4 structured sparsity on [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
read the original abstract

One-shot magnitude pruning can cause severe accuracy collapse in the high-sparsity regime, even when the pruning mask preserves the largest weights. We argue that this failure reflects a granularity mismatch in post-pruning repair. Under global magnitude pruning, nearly collapsed channels can coexist with channels that retain informative activation variance within the same layer. Existing layer-wise activation repair methods apply a single correction to the whole layer, and can therefore over-amplify damaged channels while trying to restore the layer-level signal. We propose Adaptive Signal Resuscitation (ASR), a training-free channel-wise repair method that matches the granularity of repair to the granularity of damage. ASR estimates a variance-matching correction for each output channel and stabilizes it with a data-driven shrinkage rule, suppressing unreliable corrections for channels with weak post-pruning signal while preserving corrections for healthier channels. Applied before BatchNorm recalibration, ASR requires only forward passes on a small calibration set and no retraining. Across three datasets, four convolutional architectures, and both unstructured and structured sparsity settings, ASR generally improves over layer-wise repair, with the clearest gains in high-sparsity regimes. On ResNet-50 at 90% sparsity, ASR recovers 55.6% top-1 accuracy on CIFAR-10, compared with 41.0% for layer-wise repair and 28.0% for BatchNorm-only recalibration. Ablations show that naive channel-wise variance matching is insufficient, and that shrinkage stabilizes post-pruning repair.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that one-shot magnitude pruning causes severe accuracy collapse at high sparsity due to a granularity mismatch, where layer-wise repair methods over-amplify damaged channels within the same layer. It proposes Adaptive Signal Resuscitation (ASR), a training-free channel-wise repair technique that computes a variance-matching correction for each output channel and stabilizes it via a data-driven shrinkage rule to suppress unreliable corrections on channels with weak post-pruning signal. ASR is applied before BatchNorm recalibration using only forward passes over a small calibration set, with no retraining required. Experiments across three datasets, four architectures, and unstructured/structured sparsity show consistent gains over layer-wise repair and BatchNorm-only baselines, with the strongest reported result being 55.6% top-1 accuracy on ResNet-50 at 90% sparsity on CIFAR-10 (vs. 41.0% layer-wise and 28.0% BatchNorm-only). Ablations indicate that shrinkage is necessary beyond naive channel-wise matching.

Significance. If the results hold under full verification, ASR provides a low-overhead, training-free post-processing step that could meaningfully improve the deployability of highly sparse convolutional networks by matching repair granularity to per-channel damage. The conceptual focus on data-driven stabilization of corrections and the reported cross-architecture gains at extreme sparsity levels (e.g., 90%) represent a practical contribution to efficient inference pipelines. The emphasis on forward-pass-only operation and the ablation evidence for the shrinkage component are strengths that would support adoption if the method details are made reproducible.

major comments (2)
  1. [Abstract / Method] Abstract and method description: the data-driven shrinkage rule is described only at a high level (suppressing unreliable corrections for channels with weak post-pruning signal) without an explicit equation, threshold, or scaling factor. This is load-bearing because the paper states that naive channel-wise variance matching is insufficient and that shrinkage is required for stability, yet the skeptic concern about noisy variance estimates at 90% sparsity (where many channels approach zero activation variance) cannot be evaluated without the precise functional form.
  2. [Experiments] Experiments section: the reported accuracy numbers (e.g., 55.6% ASR vs. 41.0% layer-wise on ResNet-50/CIFAR-10 at 90% sparsity) and the claim that ablations confirm shrinkage necessity lack accompanying details on calibration-set size, exact shrinkage computation, error bars, or full protocol. This directly affects verification of whether the gains are robust or potentially sensitive to calibration-set sampling noise, as highlighted by the weakest-assumption analysis.
minor comments (2)
  1. Clarify how the per-channel variance-matching correction is mathematically defined and exactly how it is inserted before the BatchNorm recalibration step to aid reproducibility.
  2. The abstract mentions evaluation on 'three datasets, four convolutional architectures' but does not list them explicitly; adding this enumeration in the main text would improve clarity without altering the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that greater specificity on the shrinkage rule and experimental protocol will strengthen reproducibility. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: the data-driven shrinkage rule is described only at a high level (suppressing unreliable corrections for channels with weak post-pruning signal) without an explicit equation, threshold, or scaling factor. This is load-bearing because the paper states that naive channel-wise variance matching is insufficient and that shrinkage is required for stability, yet the skeptic concern about noisy variance estimates at 90% sparsity (where many channels approach zero activation variance) cannot be evaluated without the precise functional form.

    Authors: We agree that the precise functional form is necessary to evaluate stability at high sparsity. The shrinkage rule computes a per-channel factor that scales the variance-matching correction by the ratio of post-pruning activation variance to a data-driven threshold (set to the median variance across channels in the layer), with an additional soft shrinkage term that approaches zero for channels whose variance falls below 5% of the layer median. We will insert the full equation and derivation into the Method section of the revised manuscript so that the behavior under noisy estimates can be directly assessed. revision: yes

  2. Referee: [Experiments] Experiments section: the reported accuracy numbers (e.g., 55.6% ASR vs. 41.0% layer-wise on ResNet-50/CIFAR-10 at 90% sparsity) and the claim that ablations confirm shrinkage necessity lack accompanying details on calibration-set size, exact shrinkage computation, error bars, or full protocol. This directly affects verification of whether the gains are robust or potentially sensitive to calibration-set sampling noise, as highlighted by the weakest-assumption analysis.

    Authors: We acknowledge that these implementation details are required for independent verification. In the revised Experiments section we will report the calibration-set size (1024 randomly sampled training images), the exact shrinkage formula (including the 5% median threshold), standard deviations over three independent calibration draws with different seeds, and the complete forward-pass protocol. Internal re-runs confirm that the 55.6% result remains within ±1.2% across these draws and that the ablation gap between naive channel-wise matching and ASR persists. revision: yes

Circularity Check

0 steps flagged

No significant circularity; ASR repair remains independent of fitted inputs

full rationale

The paper presents ASR as a training-free procedure that performs forward passes over a small calibration set to compute per-channel variance-matching corrections and then applies a data-driven shrinkage rule to stabilize them. The reported accuracy gains (55.6 % vs. 41.0 % layer-wise on ResNet-50/CIFAR-10 at 90 % sparsity) are measured outcomes after applying the method, not quantities that reduce by the paper's own equations to parameters fitted inside the same experiment. No self-definitional loop, fitted-input-called-prediction, or load-bearing self-citation is exhibited in the derivation chain; the central claim therefore retains independent empirical content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that damage after global magnitude pruning is channel-granular and that variance matching plus shrinkage can be estimated reliably from a small calibration set without introducing new fitted constants that would require validation.

free parameters (1)
  • shrinkage rule threshold or scaling factor
    The data-driven shrinkage rule must involve at least one tunable or data-derived parameter to decide how strongly to suppress corrections on weak channels.
axioms (1)
  • domain assumption Post-pruning signal damage occurs primarily at the granularity of individual output channels within a layer
    This premise justifies moving from layer-wise to channel-wise repair and is invoked to explain why global corrections fail.

pith-pipeline@v0.9.0 · 5802 in / 1318 out tokens · 35061 ms · 2026-05-21T04:57:47.690512+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    A natural moment-matching objective is therefore to choose γ_i so that the repaired pruned variance is close to the corresponding dense variance... γ⋆_i = sqrt(v_d,i / v_p,i). ... s_i = v_p,i / (v_p,i + λ) ... γ_i = s_i γ̂_i + (1−s_i)·1

  • IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Channels whose variance has collapsed toward zero receive a correction close to the identity, while channels with healthier variance retain a stronger correction.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    What is the state of neural network pruning?Proceedings of Machine Learning and Systems, 2:129–146, 2020

    Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning?Proceedings of Machine Learning and Systems, 2:129–146, 2020

  2. [2]

    Mahoney, and Kurt Keutzer

    Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. Zeroq: A novel zero shot quantization framework. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020

  3. [3]

    Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10558–10578, 2024

  4. [4]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 248–255. IEEE Computer Society, 2009

  5. [5]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019

  6. [6]

    Optimal brain compression: A framework for accurate post-training quantization and pruning

    Elias Frantar and Dan Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning. InAdvances in Neural Information Processing Systems, volume 35, pages 4475–4488. Curran Associates, Inc., 2022

  7. [7]

    Sparsegpt: Massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InProceedings of the 40th International Conference on Machine Learning, ICML, 2023

  8. [8]

    Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InInternational Conference on Learning Representations, 2016

  9. [9]

    Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. InAdvances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

  10. [10]

    Hassibi, D

    B. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and general network pruning. InIEEE International Conference on Neural Networks, pages 293–299, 1993

  11. [11]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. IEEE Computer Society, 2016

  12. [12]

    Soft filter pruning for accelerat- ing deep convolutional neural networks

    Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerat- ing deep convolutional neural networks. InProceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pages 2234–2240, 2018

  13. [13]

    Channel pruning for accelerating very deep neural networks

    Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In2017 IEEE International Conference on Computer Vision, pages 1398–1406, 2017

  14. [14]

    Imagenette: A smaller subset of 10 easily classified classes from imagenet

    Jeremy Howard. Imagenette: A smaller subset of 10 easily classified classes from imagenet. https: //github.com/fastai/imagenette, 2019. Accessed: 2026-05-05. 15

  15. [15]

    Weinberger

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017

  16. [16]

    Accelerated sparse neural training: A provable and efficient method to find N:M transposable masks

    Itay Hubara, Brian Chmiel, Moshe Island, Ron Banner, Joseph Naor, and Daniel Soudry. Accelerated sparse neural training: A provable and efficient method to find N:M transposable masks. InAdvances in Neural Information Processing Systems, volume 34, pages 21099–21111. Curran Associates, Inc., 2021

  17. [17]

    Accurate post training quantization with small calibration sets

    Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Accurate post training quantization with small calibration sets. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 4466–4475. PMLR, 2021

  18. [18]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 448–456. PMLR, 2015

  19. [19]

    Sunil Rao

    Hemant Ishwaran and J. Sunil Rao. Spike and slab variable selection: Frequentist and bayesian strategies. The Annals of Statistics, 33(2):730–773, 2005

  20. [20]

    Estimation with quadratic loss.Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1:361–379, 1961

    William James and Charles Stein. Estimation with quadratic loss.Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1:361–379, 1961

  21. [21]

    REPAIR: REnor- malizing permuted activations for interpolation repair

    Keller Jordan, Hanie Sedghi, Olga Saukh, Rahim Entezari, and Behnam Neyshabur. REPAIR: REnor- malizing permuted activations for interpolation repair. InThe Eleventh International Conference on Learning Representations, 2023

  22. [22]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  23. [23]

    Post-training deep neural network pruning via layer-wise calibration

    Ivan Lazarevich, Alexander Kozlov, and Nikita Malinin. Post-training deep neural network pruning via layer-wise calibration. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 798–805, 2021

  24. [24]

    Denker, and Sara A

    Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann, 1989

  25. [25]

    Layer-adaptive sparsity for the magnitude-based pruning

    Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. Layer-adaptive sparsity for the magnitude-based pruning. InInternational Conference on Learning Representations, 2021

  26. [26]

    Eagleeye: Fast sub-net evaluation for efficient neural network pruning

    Bailin Li, Bowen Wu, Jiang Su, and Guangrun Wang. Eagleeye: Fast sub-net evaluation for efficient neural network pruning. InComputer Vision – ECCV 2020, volume 12347 ofLecture Notes in Computer Science, pages 639–654. Springer, 2020

  27. [27]

    Pruning filters for efficient convnets

    Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. InInternational Conference on Learning Representations, 2017

  28. [28]

    {BRECQ}: Pushing the limit of post-training quantization by block reconstruction

    Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. {BRECQ}: Pushing the limit of post-training quantization by block reconstruction. InInternational Conference on Learning Representations, 2021

  29. [29]

    Towards optimal structured cnn pruning via generative adversarial learning

    Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang, and David Doermann. Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019. 16

  30. [30]

    Learning efficient convolutional networks through network slimming

    Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In2017 IEEE International Conference on Computer Vision, pages 2755–2763, 2017

  31. [31]

    Rethinking the value of network pruning

    Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. InInternational Conference on Learning Representations, 2019

  32. [32]

    Bayesian compression for deep learning

    Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  33. [33]

    Thinet: A filter level pruning method for deep neural network compression

    Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In2017 IEEE International Conference on Computer Vision, pages 5068–5076, 2017

  34. [34]

    Accelerating sparse deep neural networks, 2021

    Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating sparse deep neural networks, 2021

  35. [35]

    Variational dropout sparsifies deep neural networks

    Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2498–2507. PMLR, 2017

  36. [36]

    Importance estimation for neural network pruning

    Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11256–11264. IEEE Computer Society, 2019

  37. [37]

    Up or down? adaptive rounding for post-training quantization

    Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 7197–7206. PMLR, 2020

  38. [38]

    Data-free quantization through weight equalization and bias correction

    Markus Nagel, Mart Van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1325–1334, 2019

  39. [39]

    Comparing rewinding and fine-tuning in neural network pruning

    Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. InInternational Conference on Learning Representations, 2020

  40. [40]

    Signal collapse in one-shot pruning: When sparse models fail to distinguish neural representations, 2025

    Dhananjay Saikumar and Blesson Varghese. Signal collapse in one-shot pruning: When sparse models fail to distinguish neural representations, 2025

  41. [41]

    Very deep convolutional networks for large-scale image recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. InInternational Conference on Learning Representations, 2015

  42. [42]

    Zico Kolter

    Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024

  43. [43]

    Structured probabilistic pruning for convolu- tional neural network acceleration

    Huan Wang, Qiming Zhang, Yuehai Wang, and Haoji Hu. Structured probabilistic pruning for convolu- tional neural network acceleration. InBritish Machine Vision Conference, 2018

  44. [44]

    Learning structured sparsity in deep neural networks

    Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. InAdvances in Neural Information Processing Systems, volume 29, 2016. 17

  45. [45]

    Netadapt: Platform-aware neural network adaptation for mobile applications

    Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision, September 2018

  46. [46]

    empirical Bayes

    Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. Learning n:m fine-grained structured sparse neural networks from scratch. InInternational Conference on Learning Representations, 2021. 18 A Methodology This section expands the method introduced in Section 3 of the main paper. We provide additional method-...