Weight Concentration Regularization for Improving Pruning Robustness Under High Sparsity

Junhyuk Jo; Sunwoo Lee; Vincent-Daniel Yun

arxiv: 2511.14282 · v2 · pith:LZL23LLCnew · submitted 2025-11-18 · 💻 cs.LG · cs.AI

Weight Concentration Regularization for Improving Pruning Robustness Under High Sparsity

Vincent-Daniel Yun , Junhyuk Jo , Sunwoo Lee This is my paper

Pith reviewed 2026-05-21 19:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords weight pruningregularizationmodel compressionneural network sparsityhigh sparsitypruning robustnessone-shot pruning

0 comments

The pith

A weight concentration regularizer during training lets magnitude pruning remove mostly negligible parameters even at high sparsity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training-time regularizer that pushes most weights toward zero while boosting the magnitude of a small subset, creating a distribution suited for simple magnitude-based pruning. This addresses the problem that standard training and existing regularizers like L1 or DeepHoyer leave too many medium-sized weights whose removal hurts accuracy under aggressive one-shot pruning. If the regularizer succeeds, large models can be compressed for resource-limited devices with smaller accuracy drops across tasks like image classification, medical segmentation, and LLM fine-tuning. The work also shows the method pairs with existing pruning-robust optimizers and includes a convergence analysis.

Core claim

The central claim is that the Weight Concentration Regularizer (WCR) amplifies the magnitude of a small subset of parameters while driving the remainder toward zero during training. This produces a weight distribution in which magnitude pruning predominantly removes parameters with negligible functional contribution, yielding consistent improvements in pruning robustness on LLM fine-tuning, image classification, and medical segmentation tasks across multiple architectures.

What carries the argument

The Weight Concentration Regularizer (WCR), a training-time loss term that amplifies magnitudes of a few parameters and shrinks the rest toward zero to concentrate functional contribution.

If this is right

Models trained with WCR retain higher accuracy after high-sparsity magnitude pruning than those trained with standard loss or L1 regularization.
The approach applies to LLM fine-tuning, vision classification, and medical image segmentation across different network architectures.
WCR combines with pruning-robust optimizers such as SAM without conflict.
Convergence analysis supports stable training under the added regularizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If concentration works as described, pruned models may require less or no post-pruning fine-tuning to recover performance.
The same concentration principle might improve robustness for pruning criteria other than magnitude, such as gradient-based or Hessian-based selection.
Extending WCR-style terms to other compression stages like quantization could create more separable weight distributions.

Load-bearing premise

The method assumes that after WCR training a weight's magnitude directly reflects its functional importance, with little compensation or interaction among the surviving weights that would make removed parameters matter more than their size suggests.

What would settle it

Train a model with WCR, apply magnitude pruning at 90 percent sparsity or higher, and measure whether the pruned model's accuracy or loss remains close to the dense baseline; a large drop relative to non-WCR training would indicate the concentration does not isolate negligible parameters.

Figures

Figures reproduced from arXiv: 2511.14282 by Junhyuk Jo, Sunwoo Lee, Vincent-Daniel Yun.

**Figure 2.** Figure 2: Qualitative segmentation results on the LGG MRI dataset using the ResNet-50-UNet architecture under [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of SGD and SAM with and without the proposed Variance Amplifying Regularizer (VAR) on [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative segmentation results on the LGG MRI dataset using the ResNet-50–UNet architecture under 85% [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative segmentation results on the LGG MRI dataset using the ResNet-50–UNet architecture under 85% [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: PyTorch implementation of the proposed Variance Amplifying Regularizer (VAR). For each layer, the [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

Deep neural networks achieve outstanding performance across vision and language tasks, yet their large parameter counts limit deployment in resource-constrained settings. One-shot pruning reduces model size without retraining, but models trained with standard objectives often suffer substantial accuracy drops under aggressive sparsity. Prior work mitigates this drop along two directions: regularizers such as $\ell_1$ and DeepHoyer that shape the weight distribution during training, and pruning-robust optimizers such as SAM, CrAM, and S$^2$SAM that flatten the loss landscape. However, existing regularizers either shrink all weights uniformly ($\ell_1$) or induce scale-invariant sparsity (DeepHoyer), without concentrating weight energy onto a small set of informative parameters. We propose a Weight Concentration Regularizer (WCR), a training-time regularizer that amplifies the magnitude of a small subset of parameters while driving the remainder toward zero, so that magnitude pruning predominantly removes parameters with negligible functional contribution. We provide a convergence analysis and evaluate WCR on LLM fine-tuning, image classification, and medical segmentation, demonstrating consistent improvements in pruning robustness across architectures and compatibility with existing pruning-robust optimizers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WCR is a clear incremental regularizer that concentrates magnitudes for better one-shot pruning, but the evidence tying post-training magnitude to actual functional importance remains thin.

read the letter

The main point is that this paper adds a Weight Concentration Regularizer (WCR) that pushes a small subset of weights to larger magnitudes while driving the rest toward zero, with the goal of making magnitude pruning remove less important parameters at high sparsity. It sits between uniform shrinkage like l1 and scale-invariant approaches like DeepHoyer by actively amplifying select parameters during training. They also sketch a convergence result and test across LLM fine-tuning, image classification, and medical segmentation, plus note compatibility with SAM-style optimizers. That multi-domain coverage and the theoretical note are the parts that stand out as useful additions to the pruning-robustness literature. The formulation itself is straightforward and the motivation from current accuracy drops in one-shot pruning is direct. The soft spot is the central assumption that magnitude after WCR training reliably signals functional contribution. The convergence sketch establishes concentration of the distribution but does not bound the loss change from zeroing the small-magnitude weights or rule out compensatory interactions among the surviving large ones. Without visible ablations on the concentration strength or checks for whether the gains survive when those interactions are present, the pruning robustness improvements could partly reflect the regularizer's side effects rather than a true improvement in importance ranking. The abstract claims consistent gains but gives no numbers, error bars, or detailed breakdowns, so the practical effect size is hard to judge from the summary. This paper is aimed at people working on model compression for edge deployment or efficient LLM serving. A reader already experimenting with regularizers or robust optimizers would pick up the concrete formulation and the cross-task results. It deserves peer review because the idea is distinct enough from the cited baselines and the experiments touch relevant application areas, even though the functional-importance link will need tighter validation.

Referee Report

3 major / 2 minor

Summary. The paper proposes a Weight Concentration Regularizer (WCR) as a training-time objective term that amplifies magnitudes of a small subset of parameters while driving the remainder toward zero, with the goal of improving robustness of one-shot magnitude pruning at high sparsity. It claims this makes pruned weights have negligible functional contribution. The manuscript provides a convergence analysis of the regularizer and reports empirical results showing consistent gains on LLM fine-tuning, image classification, and medical image segmentation tasks, along with compatibility with pruning-robust optimizers such as SAM.

Significance. If the central claim holds, the work would offer a practical addition to the pruning literature by providing a regularizer that shapes the weight distribution more selectively than uniform-shrinkage methods like ℓ1 or scale-invariant approaches like DeepHoyer. The convergence analysis constitutes a positive theoretical contribution, and the multi-domain evaluation (LLMs, vision, medical) is a strength that supports broad applicability. The approach could complement existing optimizers without requiring architectural changes.

major comments (3)

[Convergence analysis] Convergence analysis section: The analysis establishes concentration of the weight distribution under WCR but does not derive or bound the loss difference incurred when low-magnitude weights are zeroed versus retained, nor does it address whether remaining large weights can functionally compensate for the pruned set. This gap directly affects support for the claim that post-WCR magnitude approximates functional importance.
[Experiments] Experimental evaluation (LLM fine-tuning and image classification results): Reported gains lack error bars, statistical significance tests, or ablations that isolate the WCR strength coefficient from other factors; without these, it is difficult to confirm that improvements stem from the concentration mechanism rather than incidental effects of the regularizer.
[Method] Definition of WCR (Eq. for the regularizer term): The regularizer is introduced as an independent objective term whose effect on functional contribution is not connected to any fitted quantity or prior result; this leaves the premise that magnitude pruning after WCR removes only negligible parameters as an untested assumption rather than a derived property.

minor comments (2)

[Abstract] Abstract: The phrase 'consistent improvements across architectures' would be clearer if it specified the sparsity ratios tested and the exact metrics (e.g., accuracy drop at 90% sparsity).
[Method] Notation: The strength coefficient of WCR is listed as a free parameter; its sensitivity should be shown in a dedicated plot or table for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your constructive review and for recognizing the potential contributions of our work on Weight Concentration Regularization. We appreciate the detailed feedback and will incorporate revisions to address the concerns raised. Below we respond point by point to each major comment.

read point-by-point responses

Referee: [Convergence analysis] Convergence analysis section: The analysis establishes concentration of the weight distribution under WCR but does not derive or bound the loss difference incurred when low-magnitude weights are zeroed versus retained, nor does it address whether remaining large weights can functionally compensate for the pruned set. This gap directly affects support for the claim that post-WCR magnitude approximates functional importance.

Authors: We agree that a direct bound on the loss difference between the pruned and full models would strengthen the link between weight concentration and functional importance. The existing analysis proves convergence to a concentrated distribution, which implies that low-magnitude weights contribute negligibly; however, we will add a new subsection deriving an upper bound on the pruning-induced loss gap using the concentration property and standard Lipschitz assumptions on the loss. This revision will explicitly address compensation by the remaining large weights. revision: yes
Referee: [Experiments] Experimental evaluation (LLM fine-tuning and image classification results): Reported gains lack error bars, statistical significance tests, or ablations that isolate the WCR strength coefficient from other factors; without these, it is difficult to confirm that improvements stem from the concentration mechanism rather than incidental effects of the regularizer.

Authors: We acknowledge the value of statistical validation. In the revised manuscript we will report error bars over at least five random seeds for all main results and include paired t-tests or Wilcoxon tests to establish significance. We will also expand the ablation studies to vary only the WCR coefficient while fixing other hyperparameters, thereby isolating its contribution to the observed robustness gains. revision: yes
Referee: [Method] Definition of WCR (Eq. for the regularizer term): The regularizer is introduced as an independent objective term whose effect on functional contribution is not connected to any fitted quantity or prior result; this leaves the premise that magnitude pruning after WCR removes only negligible parameters as an untested assumption rather than a derived property.

Authors: The premise is supported by the multi-domain empirical results showing that post-WCR magnitude pruning preserves accuracy far better than baselines. To make the connection more explicit, we will revise the method section to relate WCR to sensitivity-based pruning criteria from prior work, showing how concentration aligns magnitude with functional impact. The core formulation remains unchanged. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces a novel Weight Concentration Regularizer (WCR) as an explicit training objective term whose functional form and convergence analysis are presented independently of any pruning-specific fitted quantities or prior self-citations. The central claim—that WCR produces a weight distribution amenable to magnitude pruning—is evaluated through empirical results on multiple tasks rather than derived by construction from its own inputs. No load-bearing steps reduce to self-definition, renamed empirical patterns, or uniqueness theorems imported from the authors' prior work; the provided abstract and context contain no equations or claims that equate a prediction to a fitted parameter by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on a new regularizer term whose strength is controlled by at least one hyperparameter and on standard assumptions from optimization theory for the convergence guarantee; no new physical entities are postulated.

free parameters (1)

WCR strength coefficient
Controls how strongly the regularizer concentrates magnitude; must be chosen or tuned for each architecture and task.

axioms (1)

standard math Standard assumptions on loss smoothness and bounded gradients suffice for convergence of the regularized objective.
Invoked to support the stated convergence analysis.

pith-pipeline@v0.9.0 · 5736 in / 1310 out tokens · 29967 ms · 2026-05-21T19:13:27.966203+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The variance amplifying penalty is then defined as ψ(w) = Σ 1/Var(~w^(ℓ)) + ε ... L_total(w) = L(w) + λ ψ(w)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We provide a convergence analysis ... β2-smoothness of ψ(w)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks

Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. In The Journal of Machine Learning Research, 2021

work page 2021
[2]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019

work page 2019
[3]

Rethinking the value of network pruning

Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2019

work page 2019
[4]

Channel pruning for accelerating very deep neural networks

Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision, 2017

work page 2017
[5]

Sharpness-aware minimization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021

work page 2021
[6]

Cram: Sharpness-aware minimization for efficient model compression

Liang Chen, Xiaoling Li, and Xiaolong Hu. Cram: Sharpness-aware minimization for efficient model compression. In International Conference on Learning Representations, 2023

work page 2023
[7]

Denker, and Sara A

Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems (NeurIPS), volume 2, pages 598--605. Morgan Kaufmann, 1990

work page 1990
[8]

Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems (NeurIPS), 5: 0 164--171, 1993

work page 1993
[9]

Learning both weights and connections for efficient neural network

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, 2015

work page 2015
[10]

Pruning filters for efficient convnets

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2017

work page 2017
[11]

Bilevelpruning: Unified dynamic and static channel pruning for convolutional neural networks

Zheyuan Gao, Yan Zhang, Wei Lu, Mingming Ma, Xiaolin Hu, and Ming-Ming Cheng. Bilevelpruning: Unified dynamic and static channel pruning for convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18873--18883, 2024

work page 2024
[12]

Learning pruning-friendly networks via frank-wolfe: One-shot, any-sparsity, and no retraining

Miao Lu, Xiaolong Luo, Tianlong Chen, Wuyang Chen, Dong Liu, and Zhangyang Wang. Learning pruning-friendly networks via frank-wolfe: One-shot, any-sparsity, and no retraining. In International Conference on Learning Representations, 2022

work page 2022
[13]

Train flat, then compress: Sharpness-aware minimization learns more compressible models

Clara Na, Sanket Vaibhav Mehta, and Emma Strubell. Train flat, then compress: Sharpness-aware minimization learns more compressible models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4909--4936, December 2022

work page 2022
[14]

Snip: Single-shot network pruning based on connection sensitivity

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019

work page 2019
[15]

Picking winning tickets before training by preserving gradient flow

Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2020

work page 2020
[16]

Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. In Advances in Neural Information Processing Systems, 2020

work page 2020
[17]

All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation

Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[18]

Can we gain more from orthogonality regularizations in training deep cnns? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018

Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from orthogonality regularizations in training deep cnns? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018

work page 2018
[19]

Decoding by linear programming

Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51 0 (12): 0 4203--4215, 2005

work page 2005
[20]

Guillaume Garrigos and Robert M. Gower. Handbook of convergence theorems for (stochastic) gradient methods. In Arxiv Preprint, 2023

work page 2023
[21]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization, 23 0 (4): 0 2341--2368, 2013

work page 2013
[22]

On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization

Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In International Conference on Machine Learning, pages 7184--7193. PMLR, 2019

work page 2019
[23]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009
[24]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011

work page 2011
[25]

Tiny imagenet visual recognition challenge, 2015

Stanford CS231N. Tiny imagenet visual recognition challenge, 2015

work page 2015
[26]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[27]

Wide residual networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference (BMVC), 2016

work page 2016
[28]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

work page 2021
[29]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015

work page 2015
[30]

Mazurowski

Mateusz Buda, Ashirbani Saha, and Maciej A. Mazurowski. Association of genomic subtypes of lower-grade gliomas with shape features automatically extracted by a deep learning algorithm. Computers in Biology and Medicine, 2019

work page 2019
[31]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks

Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. In The Journal of Machine Learning Research, 2021

work page 2021

[2] [2]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019

work page 2019

[3] [3]

Rethinking the value of network pruning

Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2019

work page 2019

[4] [4]

Channel pruning for accelerating very deep neural networks

Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision, 2017

work page 2017

[5] [5]

Sharpness-aware minimization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021

work page 2021

[6] [6]

Cram: Sharpness-aware minimization for efficient model compression

Liang Chen, Xiaoling Li, and Xiaolong Hu. Cram: Sharpness-aware minimization for efficient model compression. In International Conference on Learning Representations, 2023

work page 2023

[7] [7]

Denker, and Sara A

Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems (NeurIPS), volume 2, pages 598--605. Morgan Kaufmann, 1990

work page 1990

[8] [8]

Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems (NeurIPS), 5: 0 164--171, 1993

work page 1993

[9] [9]

Learning both weights and connections for efficient neural network

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, 2015

work page 2015

[10] [10]

Pruning filters for efficient convnets

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2017

work page 2017

[11] [11]

Bilevelpruning: Unified dynamic and static channel pruning for convolutional neural networks

Zheyuan Gao, Yan Zhang, Wei Lu, Mingming Ma, Xiaolin Hu, and Ming-Ming Cheng. Bilevelpruning: Unified dynamic and static channel pruning for convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18873--18883, 2024

work page 2024

[12] [12]

Learning pruning-friendly networks via frank-wolfe: One-shot, any-sparsity, and no retraining

Miao Lu, Xiaolong Luo, Tianlong Chen, Wuyang Chen, Dong Liu, and Zhangyang Wang. Learning pruning-friendly networks via frank-wolfe: One-shot, any-sparsity, and no retraining. In International Conference on Learning Representations, 2022

work page 2022

[13] [13]

Train flat, then compress: Sharpness-aware minimization learns more compressible models

Clara Na, Sanket Vaibhav Mehta, and Emma Strubell. Train flat, then compress: Sharpness-aware minimization learns more compressible models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4909--4936, December 2022

work page 2022

[14] [14]

Snip: Single-shot network pruning based on connection sensitivity

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019

work page 2019

[15] [15]

Picking winning tickets before training by preserving gradient flow

Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2020

work page 2020

[16] [16]

Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. In Advances in Neural Information Processing Systems, 2020

work page 2020

[17] [17]

All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation

Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017

[18] [18]

Can we gain more from orthogonality regularizations in training deep cnns? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018

Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from orthogonality regularizations in training deep cnns? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018

work page 2018

[19] [19]

Decoding by linear programming

Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51 0 (12): 0 4203--4215, 2005

work page 2005

[20] [20]

Guillaume Garrigos and Robert M. Gower. Handbook of convergence theorems for (stochastic) gradient methods. In Arxiv Preprint, 2023

work page 2023

[21] [21]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization, 23 0 (4): 0 2341--2368, 2013

work page 2013

[22] [22]

On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization

Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In International Conference on Machine Learning, pages 7184--7193. PMLR, 2019

work page 2019

[23] [23]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009

[24] [24]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011

work page 2011

[25] [25]

Tiny imagenet visual recognition challenge, 2015

Stanford CS231N. Tiny imagenet visual recognition challenge, 2015

work page 2015

[26] [26]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016

[27] [27]

Wide residual networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference (BMVC), 2016

work page 2016

[28] [28]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

work page 2021

[29] [29]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015

work page 2015

[30] [30]

Mazurowski

Mateusz Buda, Ashirbani Saha, and Maciej A. Mazurowski. Association of genomic subtypes of lower-grade gliomas with shape features automatically extracted by a deep learning algorithm. Computers in Biology and Medicine, 2019

work page 2019

[31] [31]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016