STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing

Adi Shamir; Odelia Melamed; Shir Maon

arxiv: 2606.01126 · v1 · pith:GAZURJBZnew · submitted 2026-05-31 · 💻 cs.LG · cs.AI· cs.CV

STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing

Shir Maon , Odelia Melamed , Adi Shamir This is my paper

Pith reviewed 2026-06-28 17:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords pruningaccuracy recoveryinternal state alignmentvision transformersmodel compressionunlabeled calibrationneural network healingDeiT

0 comments

The pith

Pruned neural networks recover most accuracy by aligning internal states with the original model on a tiny unlabeled set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pruning removes weights from neural networks to speed inference but typically reduces accuracy, so a healing step is needed afterward. STARFISH performs this healing by optimizing the pruned model so its internal activations match those of the dense original network, using only a small collection of unlabeled examples. The paper shows this produces higher recovered accuracy than prior methods, with gains up to 22 percent at 50 percent pruning on vision transformers and 82 percent recovery of original accuracy at 75 percent pruning on a DeiT-B model for ImageNet. A sympathetic reader would care because the approach avoids labeled data and full retraining, which lowers the cost of deploying large models after compression.

Core claim

The paper claims that optimizing the pruned network to align its internal state representations with those of the original network using a tiny unlabeled calibration set recovers substantially more accuracy than existing healing techniques. On ViT-based networks this yields up to 22 percent better recovered accuracy after 50 percent weight removal; after 75 percent removal in a DeiT-B network for ImageNet it reaches 82 percent of the dense model's accuracy with only 0.4 percent of the training images as calibration data while competing methods reach only 40 percent.

What carries the argument

Internal state alignment optimization that minimizes differences in activations between the pruned and original networks on unlabeled examples.

If this is right

At 50 percent pruning the method improves recovered accuracy by up to 22 percent over state-of-the-art healing on ViT networks.
At 75 percent pruning on DeiT-B it recovers 82 percent of original accuracy using only 0.4 percent of training images as calibration.
Healing succeeds without labeled data or complete retraining of the model.
The advantage grows with more aggressive pruning ratios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The alignment approach might extend to compression methods other than unstructured pruning.
Internal representations appear to encode enough task information that matching them substitutes for label-driven fine-tuning.
The small calibration requirement suggests possible on-device adaptation using private user data without sharing labels.

Load-bearing premise

That optimizing the pruned network to align its internal state representations with those of the original network using only a tiny unlabeled calibration set will reliably recover accuracy without requiring labeled data or full retraining.

What would settle it

If STARFISH applied to the 75-percent-pruned DeiT-B model on ImageNet with 0.4 percent unlabeled calibration data fails to exceed the 40 percent recovery level of competing methods, the superiority claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.01126 by Adi Shamir, Odelia Melamed, Shir Maon.

**Figure 2.** Figure 2: Visualization of the representation computation and alignment of STARFISH, where [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical visualization of the representation-based bound for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of top-1 accuracy results of unstructured pruning and recovery methods on MobileNetV1. STARFISH recovery exceeds other methods, within a larger margin at high sparsity levels. In this section, we present STARFISH in severe pruning regimes, where recovery is most challenging and the differences between recovery methods are most pronounced. We compare STARFISH across sparsity levels on both Mobi… view at source ↗

**Figure 5.** Figure 5: Although this bound still upper-bounds the empirical KL divergence, it is much looser than [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the data-dependent constants in the local KL bound compared to their [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of the top-1 accuracy results after the STARFISH and fine-tuning healing [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 4.** Figure 4: ∆ denotes the change in top-1 accuracy relative to the corresponding dense baseline. Method Dense Acc. 0.5-Sparsity 0.6-Sparsity 0.7-Sparsity 0.8-Sparsity Acc. ∆ Acc. ∆ Acc. ∆ Acc. ∆ MP 71.95 63.83 −8.12 47.15 −24.80 11.68 −60.27 0.43 −71.52 WF 71.95 68.91 −3.04 60.90 −11.05 29.36 −42.59 0.24 −71.71 CBS 71.95 70.21 −1.74 66.37 −5.58 55.11 −16.84 16.38 −55.57 CHITA 71.95 70.42 −1.53 67.30 −4.65 59.40 −12.55… view at source ↗

read the original abstract

Pruning is a process designed to reduce the number of weights in a large neural network. This can substantially speed up inference but might cause a considerable reduction in the model's accuracy, and thus it is usually followed by a healing process that regains some of the lost accuracy. In this paper, we propose a new healing method, STARFISH, that can recover (most of) the accuracy of any pruned network efficiently. The main idea of STARFISH is to optimize the pruned network to align with the original network's internal state representations using a tiny calibration set of unlabeled examples. For the common case of removing 50% of the weights, STARFISH healing improves the recovered accuracy by up to 22% over the state-of-the-art methods on ViT-based networks. Its advantage is even more pronounced under aggressive pruning. For example, after eliminating 75% of the weights in a DeiT-B network for ImageNet, STARFISH uses only 0.4% of the number of training images as a calibration set and recovers 82% of the original dense accuracy, whereas competing recovery techniques reach only 40% of the dense model accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STARFISH claims large accuracy recovery gains on pruned ViTs by aligning internal states to the dense model on a 0.4% unlabeled calibration set, but the abstract alone leaves the optimization details and generalization unverified.

read the letter

The core claim is that STARFISH recovers accuracy after pruning by optimizing the pruned network to match the original network's internal activations on a tiny unlabeled calibration set. On ViT models at 50% pruning it improves recovered accuracy by up to 22% over prior methods, and at 75% pruning on DeiT-B it reaches 82% of dense accuracy while competitors reach only 40%, all with 0.4% of ImageNet images.

What stands out is the focus on internal-state alignment rather than output matching or labeled fine-tuning. This is a clean idea for post-pruning recovery and targets a real deployment need for large transformers under aggressive compression.

The main limitation is that everything rests on the abstract. There are no equations for the alignment loss, no description of which layers are matched, no information on calibration-set selection, and no ablations or error bars. The stress-test concern about the tiny set being unrepresentative or the optimization drifting without label supervision is therefore still open; the abstract gives no evidence that the reported gap survives distribution shift or different random subsets.

The work is aimed at people doing model compression and efficient inference on vision transformers. Anyone running pruning experiments on DeiT or similar models would want to see the full experimental section. The quantitative claims are specific enough that a serious referee should check the setup, the calibration procedure, and whether the gains hold under standard controls.

I would send it to peer review so the methods and results can be examined directly.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes STARFISH, a pruning healing technique that recovers accuracy in pruned networks by aligning their internal state representations to those of the original dense model using a tiny unlabeled calibration set. It claims superior performance over state-of-the-art methods, with up to 22% better accuracy recovery for 50% pruning on ViT networks and, notably, 82% recovery of dense accuracy after 75% pruning on DeiT-B for ImageNet using only 0.4% of training images, compared to 40% for competitors.

Significance. Should the empirical results be reproducible and generalizable, the method offers a promising direction for efficient post-pruning recovery in large vision transformers, potentially lowering the barrier for deploying compressed models by minimizing data requirements and eliminating the need for labeled data in the healing phase. The internal state healing approach, if shown to work without overfitting to the calibration set, represents a practical advance in the field of model compression.

major comments (2)

[Abstract] Abstract: The central claims regarding accuracy recovery percentages (e.g., 82% vs. 40% after 75% pruning) are stated without any accompanying description of the optimization objective, the specific internal states being aligned, the size and selection of the calibration set, or experimental protocols including number of runs and variance, which are necessary to evaluate if the results support the claims.
[Abstract] Abstract: The assumption that matching internal activations on an unlabeled 0.4% subset of ImageNet will ensure accurate outputs on the full test distribution is load-bearing but unsupported in the provided text; no evidence or analysis is given to address potential issues with unrepresentative calibration data or lack of output-level supervision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on our manuscript. We address each major comment point by point below and will revise the abstract to improve clarity where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims regarding accuracy recovery percentages (e.g., 82% vs. 40% after 75% pruning) are stated without any accompanying description of the optimization objective, the specific internal states being aligned, the size and selection of the calibration set, or experimental protocols including number of runs and variance, which are necessary to evaluate if the results support the claims.

Authors: We agree that the abstract is highly condensed and could better contextualize the claims for readers. The optimization objective (minimizing distance between internal activations), the specific states aligned, calibration set details, and experimental protocols (including runs and variance) are fully described in Sections 3 and 4. In the revised version we will add a concise clause to the abstract referencing these elements without exceeding length limits. revision: yes
Referee: [Abstract] Abstract: The assumption that matching internal activations on an unlabeled 0.4% subset of ImageNet will ensure accurate outputs on the full test distribution is load-bearing but unsupported in the provided text; no evidence or analysis is given to address potential issues with unrepresentative calibration data or lack of output-level supervision.

Authors: The full manuscript provides empirical support for this assumption through generalization results across multiple architectures and pruning ratios (Section 4), plus robustness analysis to calibration set choice (Section 4.2). The lack of output-level supervision is intentional and validated by the observed correlation between internal alignment and test accuracy. We will add a brief supporting sentence to the abstract or introduction in revision. revision: partial

Circularity Check

0 steps flagged

No derivation chain; purely empirical method

full rationale

The paper proposes STARFISH as an optimization-based healing procedure that aligns internal activations of a pruned network to those of the dense model on a small unlabeled calibration set. No equations, fitted parameters, uniqueness theorems, or self-citations are presented as load-bearing steps in any derivation. All reported gains (e.g., 82% vs. 40% recovery) are experimental outcomes on ImageNet/ViT models, not quantities that reduce to the method's own inputs by construction. The approach is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5746 in / 1159 out tokens · 30616 ms · 2026-06-28T17:23:11.603873+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 14 canonical work pages · 9 internal anchors

[1]

Fast as chita: Neural network pruning with combinatorial optimization

Riade Benbaki, Wenyu Chen, Xiang Meng, Hussein Hazimeh, Natalia Ponomareva, Zhe Zhao, and Rahul Mazumder. Fast as chita: Neural network pruning with combinatorial optimization. InInternational Conference on Machine Learning, pages 2031–2049. PMLR, 2023

2031
[2]

Optimal brain connection: Towards efficient structural pruning.arXiv preprint arXiv:2508.05521, 2025

Shaowu Chen, Wei Ma, Binhua Huang, Qingyuan Wang, Guoxin Wang, Weize Sun, Lei Huang, and Deepu John. Optimal brain connection: Towards efficient structural pruning.arXiv preprint arXiv:2508.05521, 2025

work page arXiv 2025
[3]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020

2020
[4]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

2009
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

Self-supervised representation learning: Introduction, advances, and challenges.IEEE Signal Processing Magazine, 39(3):42–62, 2022

Linus Ericsson, Henry Gouk, Chen Change Loy, and Timothy M Hospedales. Self-supervised representation learning: Introduction, advances, and challenges.IEEE Signal Processing Magazine, 39(3):42–62, 2022

2022
[7]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks.arXiv preprint arXiv:1803.03635, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Stabilizing the lottery ticket hypothesis,

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket hypothesis.arXiv preprint arXiv:1903.01611, 2019

work page arXiv 1903
[9]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational Conference on Machine Learning, pages 10323–10337. PMLR, 2023

2023
[10]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in Neural Information Processing Systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in Neural Information Processing Systems, 33:21271–21284, 2020

2020
[11]

Learning both weights and connections for efficient neural network.Advances in Neural Information Processing Systems, 28, 2015

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network.Advances in Neural Information Processing Systems, 28, 2015

2015
[12]

Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. InAdvances in Neural Information Processing Systems, volume 5, 1992

1992
[13]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational Conference on Machine Learning, pages 3519–3529. PMLR, 2019

2019
[16]

Soft threshold weight reparameterization for learnable sparsity

Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In International conference on machine learning, pages 5544–5555. PMLR, 2020. 10

2020
[17]

Cap: Correlation-aware pruning for highly-accurate sparse vision models.Advances in Neural Information Processing Systems, 36:28805–28831, 2023

Denis Kuznedelev, Eldar Kurti ´c, Elias Frantar, and Dan Alistarh. Cap: Correlation-aware pruning for highly-accurate sparse vision models.Advances in Neural Information Processing Systems, 36:28805–28831, 2023

2023
[18]

A fast post-training pruning framework for transformers.Advances in Neural Information Processing Systems, 35:24101–24116, 2022

Woosuk Kwon, Sehoon Kim, Michael W Mahoney, Joseph Hassoun, Kurt Keutzer, and Amir Gholami. A fast post-training pruning framework for transformers.Advances in Neural Information Processing Systems, 35:24101–24116, 2022

2022
[19]

Optimal brain damage.Advances in Neural Information Processing Systems, 2:598–605, 1989

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in Neural Information Processing Systems, 2:598–605, 1989

1989
[20]

Preserving deep representations in one-shot pruning: A hessian-free second-order optimization framework.arXiv preprint arXiv:2411.18376, 2024

Ryan Lucas and Rahul Mazumder. Preserving deep representations in one-shot pruning: A hessian-free second-order optimization framework.arXiv preprint arXiv:2411.18376, 2024

work page arXiv 2024
[21]

Proving the lottery ticket hypothesis: Pruning is all you need

Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. InInternational Conference on Machine Learning, pages 6682–6691. PMLR, 2020

2020
[22]

Falcon: Flop-aware com- binatorial optimization for neural network pruning

Xiang Meng, Wenyu Chen, Riade Benbaki, and Rahul Mazumder. Falcon: Flop-aware com- binatorial optimization for neural network pruning. InInternational Conference on Artificial Intelligence and Statistics, pages 4384–4392. PMLR, 2024

2024
[23]

Softmax is 1/2-lipschitz: A tight bound across all ℓp norms.arXiv preprint arXiv:2510.23012, 2025

Pravin Nair. Softmax is 1/2-lipschitz: A tight bound across all ℓp norms.arXiv preprint arXiv:2510.23012, 2025

work page arXiv 2025
[24]

An Introduction to Convolutional Neural Networks

Keiron O’Shea and Ryan Nash. An introduction to convolutional neural networks.arXiv preprint arXiv:1511.08458, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[25]

What’s hidden in a randomly weighted neural network? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11893–11902, 2020

Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What’s hidden in a randomly weighted neural network? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11893–11902, 2020

2020
[26]

Comparing rewinding and fine-tuning in neural network pruning.arXiv preprint arXiv:2003.02389, 2020

Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning.arXiv preprint arXiv:2003.02389, 2020

work page arXiv 2003
[27]

FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets, 2015. URL https://arxiv.org/abs/ 1412.6550

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[29]

Woodfisher: Efficient second-order approximation for neural network compression.Advances in Neural Information Processing Systems, 33:18098–18109, 2020

Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximation for neural network compression.Advances in Neural Information Processing Systems, 33:18098–18109, 2020

2020
[30]

Contrastive representation distillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In International Conference on Learning Representations (ICLR), 2020

2020
[31]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021

2021
[32]

Deit iii: Revenge of the vit

Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit. InEuropean Conference on Computer Vision, pages 516–533. Springer, 2022

2022
[33]

The combinatorial brain sur- geon: Pruning weights that cancel one another in neural networks

Xin Yu, Thiago Serra, Srikumar Ramalingam, and Shandian Zhe. The combinatorial brain sur- geon: Pruning weights that cancel one another in neural networks. InInternational Conference on Machine Learning, pages 25668–25683. PMLR, 2022

2022
[34]

Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer

Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. InInternational Conference on Learning Representations (ICLR), 2017. 11

2017
[35]

CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers

Boxiang Zhang and Baijian Yang. Corp: Closed-form one-shot representation-preserving structured pruning for vision transformers.arXiv preprint arXiv:2602.05243, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Savit: Structure-aware vision transformer pruning via collaborative optimization.Advances in Neural Information Processing Systems, 35:9010–9023, 2022

Chuanyang Zheng, Kai Zhang, Zhi Yang, Wenming Tan, Jun Xiao, Ye Ren, Shiliang Pu, et al. Savit: Structure-aware vision transformer pruning via collaborative optimization.Advances in Neural Information Processing Systems, 35:9010–9023, 2022

2022
[37]

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression.arXiv preprint arXiv:1710.01878, 2017. A Representation alignment A.1 Representation bound for KL divergence We state the definitions and the theorem from Section 4. For an input xi ∈S cal, we denote by hi,ehi ∈R dout the last hidden representat...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Hence ζi ≤ 1 2 , and consequently Z 1 0 (1−t) ∆z ⊤ i ∇2ψ(zi +t∆z i)∆zi dt≤ 1 4 ∥∆zi∥2 2

But its Jacobian is the Hessian of the log-sum-exp function, thus for every t∈[0,1] , ∆z⊤ i ∇2ψ(zi +t∆z i)∆zi ≤ 1 2 ∥∆zi∥2 2. Hence ζi ≤ 1 2 , and consequently Z 1 0 (1−t) ∆z ⊤ i ∇2ψ(zi +t∆z i)∆zi dt≤ 1 4 ∥∆zi∥2 2. From (4) we get KL(pi ∥epi)≤ 1 2 ζi∥∆zi∥2 2 ≤ 1 4 ∥∆zi∥2 2.(6) It remains to relate the logit error to representation and head recovery. Since...
[39]

Although this bound still upper-bounds the empirical KL divergence, it is much looser than the local, batch-dependent bound

Substituting these worst-case constants yields the global bound shown in Figure 5. Although this bound still upper-bounds the empirical KL divergence, it is much looser than the local, batch-dependent bound. Figure 6 visualizes the empirical distributions of Mi and ζi over the calibration examples and compares them to their corresponding global upper boun...

2000
[40]

A.2 Recovery vs

In both cases, the realized data-dependent quantities are substantially smaller than their worst-case bounds, explaining why the local batch-dependent KL bound in Figure 5 is much tighter than the corresponding global bound. A.2 Recovery vs. fine-tuning: further experiment and details In Section 4.1, we compared STARFISH to standard output-level recovery ...

1965

[1] [1]

Fast as chita: Neural network pruning with combinatorial optimization

Riade Benbaki, Wenyu Chen, Xiang Meng, Hussein Hazimeh, Natalia Ponomareva, Zhe Zhao, and Rahul Mazumder. Fast as chita: Neural network pruning with combinatorial optimization. InInternational Conference on Machine Learning, pages 2031–2049. PMLR, 2023

2031

[2] [2]

Optimal brain connection: Towards efficient structural pruning.arXiv preprint arXiv:2508.05521, 2025

Shaowu Chen, Wei Ma, Binhua Huang, Qingyuan Wang, Guoxin Wang, Weize Sun, Lei Huang, and Deepu John. Optimal brain connection: Towards efficient structural pruning.arXiv preprint arXiv:2508.05521, 2025

work page arXiv 2025

[3] [3]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020

2020

[4] [4]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

2009

[5] [5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[6] [6]

Self-supervised representation learning: Introduction, advances, and challenges.IEEE Signal Processing Magazine, 39(3):42–62, 2022

Linus Ericsson, Henry Gouk, Chen Change Loy, and Timothy M Hospedales. Self-supervised representation learning: Introduction, advances, and challenges.IEEE Signal Processing Magazine, 39(3):42–62, 2022

2022

[7] [7]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks.arXiv preprint arXiv:1803.03635, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Stabilizing the lottery ticket hypothesis,

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket hypothesis.arXiv preprint arXiv:1903.01611, 2019

work page arXiv 1903

[9] [9]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational Conference on Machine Learning, pages 10323–10337. PMLR, 2023

2023

[10] [10]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in Neural Information Processing Systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in Neural Information Processing Systems, 33:21271–21284, 2020

2020

[11] [11]

Learning both weights and connections for efficient neural network.Advances in Neural Information Processing Systems, 28, 2015

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network.Advances in Neural Information Processing Systems, 28, 2015

2015

[12] [12]

Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. InAdvances in Neural Information Processing Systems, volume 5, 1992

1992

[13] [13]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[14] [14]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational Conference on Machine Learning, pages 3519–3529. PMLR, 2019

2019

[16] [16]

Soft threshold weight reparameterization for learnable sparsity

Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In International conference on machine learning, pages 5544–5555. PMLR, 2020. 10

2020

[17] [17]

Cap: Correlation-aware pruning for highly-accurate sparse vision models.Advances in Neural Information Processing Systems, 36:28805–28831, 2023

Denis Kuznedelev, Eldar Kurti ´c, Elias Frantar, and Dan Alistarh. Cap: Correlation-aware pruning for highly-accurate sparse vision models.Advances in Neural Information Processing Systems, 36:28805–28831, 2023

2023

[18] [18]

A fast post-training pruning framework for transformers.Advances in Neural Information Processing Systems, 35:24101–24116, 2022

Woosuk Kwon, Sehoon Kim, Michael W Mahoney, Joseph Hassoun, Kurt Keutzer, and Amir Gholami. A fast post-training pruning framework for transformers.Advances in Neural Information Processing Systems, 35:24101–24116, 2022

2022

[19] [19]

Optimal brain damage.Advances in Neural Information Processing Systems, 2:598–605, 1989

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in Neural Information Processing Systems, 2:598–605, 1989

1989

[20] [20]

Preserving deep representations in one-shot pruning: A hessian-free second-order optimization framework.arXiv preprint arXiv:2411.18376, 2024

Ryan Lucas and Rahul Mazumder. Preserving deep representations in one-shot pruning: A hessian-free second-order optimization framework.arXiv preprint arXiv:2411.18376, 2024

work page arXiv 2024

[21] [21]

Proving the lottery ticket hypothesis: Pruning is all you need

Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. InInternational Conference on Machine Learning, pages 6682–6691. PMLR, 2020

2020

[22] [22]

Falcon: Flop-aware com- binatorial optimization for neural network pruning

Xiang Meng, Wenyu Chen, Riade Benbaki, and Rahul Mazumder. Falcon: Flop-aware com- binatorial optimization for neural network pruning. InInternational Conference on Artificial Intelligence and Statistics, pages 4384–4392. PMLR, 2024

2024

[23] [23]

Softmax is 1/2-lipschitz: A tight bound across all ℓp norms.arXiv preprint arXiv:2510.23012, 2025

Pravin Nair. Softmax is 1/2-lipschitz: A tight bound across all ℓp norms.arXiv preprint arXiv:2510.23012, 2025

work page arXiv 2025

[24] [24]

An Introduction to Convolutional Neural Networks

Keiron O’Shea and Ryan Nash. An introduction to convolutional neural networks.arXiv preprint arXiv:1511.08458, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[25] [25]

What’s hidden in a randomly weighted neural network? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11893–11902, 2020

Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What’s hidden in a randomly weighted neural network? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11893–11902, 2020

2020

[26] [26]

Comparing rewinding and fine-tuning in neural network pruning.arXiv preprint arXiv:2003.02389, 2020

Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning.arXiv preprint arXiv:2003.02389, 2020

work page arXiv 2003

[27] [27]

FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets, 2015. URL https://arxiv.org/abs/ 1412.6550

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[29] [29]

Woodfisher: Efficient second-order approximation for neural network compression.Advances in Neural Information Processing Systems, 33:18098–18109, 2020

Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximation for neural network compression.Advances in Neural Information Processing Systems, 33:18098–18109, 2020

2020

[30] [30]

Contrastive representation distillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In International Conference on Learning Representations (ICLR), 2020

2020

[31] [31]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021

2021

[32] [32]

Deit iii: Revenge of the vit

Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit. InEuropean Conference on Computer Vision, pages 516–533. Springer, 2022

2022

[33] [33]

The combinatorial brain sur- geon: Pruning weights that cancel one another in neural networks

Xin Yu, Thiago Serra, Srikumar Ramalingam, and Shandian Zhe. The combinatorial brain sur- geon: Pruning weights that cancel one another in neural networks. InInternational Conference on Machine Learning, pages 25668–25683. PMLR, 2022

2022

[34] [34]

Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer

Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. InInternational Conference on Learning Representations (ICLR), 2017. 11

2017

[35] [35]

CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers

Boxiang Zhang and Baijian Yang. Corp: Closed-form one-shot representation-preserving structured pruning for vision transformers.arXiv preprint arXiv:2602.05243, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Savit: Structure-aware vision transformer pruning via collaborative optimization.Advances in Neural Information Processing Systems, 35:9010–9023, 2022

Chuanyang Zheng, Kai Zhang, Zhi Yang, Wenming Tan, Jun Xiao, Ye Ren, Shiliang Pu, et al. Savit: Structure-aware vision transformer pruning via collaborative optimization.Advances in Neural Information Processing Systems, 35:9010–9023, 2022

2022

[37] [37]

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression.arXiv preprint arXiv:1710.01878, 2017. A Representation alignment A.1 Representation bound for KL divergence We state the definitions and the theorem from Section 4. For an input xi ∈S cal, we denote by hi,ehi ∈R dout the last hidden representat...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Hence ζi ≤ 1 2 , and consequently Z 1 0 (1−t) ∆z ⊤ i ∇2ψ(zi +t∆z i)∆zi dt≤ 1 4 ∥∆zi∥2 2

But its Jacobian is the Hessian of the log-sum-exp function, thus for every t∈[0,1] , ∆z⊤ i ∇2ψ(zi +t∆z i)∆zi ≤ 1 2 ∥∆zi∥2 2. Hence ζi ≤ 1 2 , and consequently Z 1 0 (1−t) ∆z ⊤ i ∇2ψ(zi +t∆z i)∆zi dt≤ 1 4 ∥∆zi∥2 2. From (4) we get KL(pi ∥epi)≤ 1 2 ζi∥∆zi∥2 2 ≤ 1 4 ∥∆zi∥2 2.(6) It remains to relate the logit error to representation and head recovery. Since...

[39] [39]

Although this bound still upper-bounds the empirical KL divergence, it is much looser than the local, batch-dependent bound

Substituting these worst-case constants yields the global bound shown in Figure 5. Although this bound still upper-bounds the empirical KL divergence, it is much looser than the local, batch-dependent bound. Figure 6 visualizes the empirical distributions of Mi and ζi over the calibration examples and compares them to their corresponding global upper boun...

2000

[40] [40]

A.2 Recovery vs

In both cases, the realized data-dependent quantities are substantially smaller than their worst-case bounds, explaining why the local batch-dependent KL bound in Figure 5 is much tighter than the corresponding global bound. A.2 Recovery vs. fine-tuning: further experiment and details In Section 4.1, we compared STARFISH to standard output-level recovery ...

1965