pith. the verified trust layer for science. sign in

arxiv: 2510.09696 · v3 · submitted 2025-10-09 · 💻 cs.LG · cs.AI

Vanishing Contributions: A Unified Framework for Smooth and Iterative Model Compression

Pith reviewed 2026-05-18 08:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords model compressionfine-tuningpruningquantizationiterative compressiondeep neural networksaffine combinationaccuracy preservation
0
0 comments X p. Extension

The pith

A framework gradually blends outputs from original and compressed neural networks during fine-tuning to stabilize iterative model compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes VCON as a way to compress deep neural networks more reliably by avoiding abrupt switches to a smaller version. Instead of replacing the original model outright, the approach runs both the full and compressed networks at the same time. Their outputs are mixed with weights that shift steadily from the original toward the compressed model. This slow handoff is intended to keep training stable and limit accuracy loss across different compression methods and tasks. A sympathetic reader would care because current compression techniques often force difficult trade-offs between size and performance that this blending might ease.

Core claim

VCON executes both the original uncompressed model and the compressed model in parallel during fine-tuning; the contribution of the original model is progressively reduced while that of the compressed model is gradually increased via an affine combination, improving stability and mitigating accuracy degradation. In most settings this yields accuracy gains exceeding 1% over baselines, with some configurations showing improvements above 15%.

What carries the argument

The affine combination of outputs from the original and compressed models, which gradually shifts emphasis from the uncompressed network to the compressed one throughout fine-tuning.

If this is right

  • The same blending process works with pruning, quantization, and low-rank decomposition without needing separate iterative schedules for each.
  • Accuracy improves by more than 1 percent over post-shot and iterative baselines on computer vision and natural language processing benchmarks in most tested cases.
  • Some compression configurations reach accuracy gains above 15 percent when the gradual transition is used.
  • Training remains more stable because the network never experiences an instantaneous jump from full to compressed behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The parallel-execution idea could extend to other sudden model changes such as architecture search or domain adaptation where abrupt shifts cause instability.
  • Resource costs during the transition phase might be reduced by running the two models only on selected layers or batches rather than the full network.
  • The blending schedule itself could be made data-driven instead of fixed, potentially removing the need for manual tuning across new datasets.

Load-bearing premise

An affine combination of outputs from the original and compressed models during fine-tuning will reliably allow the network to adapt smoothly without introducing new instabilities or requiring technique-specific tuning of the blending schedule.

What would settle it

An experiment applying VCON to a standard compression task and finding no reduction in accuracy degradation or stability improvement compared with direct replacement or existing iterative baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2510.09696 by Charalampos Antoniadis, Fabio Pareschi, Gianluca Setti, Lorenzo Nikiforos, Luciano Prono, Riccardo Rovatti.

Figure 1
Figure 1. Figure 1: Illustration of VCON: from left to right, the original layer (orange) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of block-wise VCON: the first two blocks [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual intuition of the VCON approach: when a model parameter [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

The increasing scale of Deep Neural Networks (DNNs) introduces the need for compression techniques such as pruning, quantization, and low-rank decomposition. While these methods are very effective at reducing memory, computation, and energy consumption, they may introduce severe accuracy degradation, which is often mitigated by using iterative, gradual compression. However, different compression techniques require distinct iterative approaches, and some result in unstable, discontinuous model fine-tuning. We introduce Vanishing Contributions (VCON), a unified framework for the smooth, iterative transition of DNNs into a compressed form. Rather than replacing the original network directly with its compressed version, VCON executes both in parallel during fine-tuning. The contribution of the original (uncompressed) model is progressively reduced, while that of the compressed model is gradually increased. This affine combination allows the network to slowly adapt, improving stability and mitigating accuracy degradation. We evaluate VCON on computer vision and natural language processing benchmarks, using multiple compression strategies. In most settings, our framework improves accuracy over post-shot and iterative baselines. Typical gains exceed 1%, while some configuration exhibits improvements above 15%. VCON is thus compatible with existing compression techniques and consistently improves performance across diverse tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Vanishing Contributions (VCON), a unified framework for smooth iterative compression of deep neural networks. Rather than directly substituting the original model with a compressed version (via pruning, quantization, or low-rank decomposition), VCON runs both models in parallel during fine-tuning and applies an affine combination to progressively reduce the original model's contribution while increasing the compressed model's. This is claimed to improve training stability and yield accuracy gains over post-shot and iterative baselines, with typical improvements exceeding 1% and some configurations above 15% on vision and NLP benchmarks.

Significance. If the central claims hold under rigorous validation, VCON could offer a practical, general-purpose technique for mitigating accuracy loss during compression transitions, reducing the need for bespoke iterative fine-tuning schedules per compression method. This would be valuable for efficient ML deployment, particularly if the approach proves robust across diverse architectures and compression types without introducing new instabilities.

major comments (3)
  1. [Abstract] Abstract: the description of the affine combination ((1-α)·f_orig + α·f_comp) does not specify the combination point (logits vs. intermediate features) or the functional form of the ramp schedule α(t). Without these, it is impossible to verify whether the same blending rule remains stable and technique-agnostic when the compressed model changes structure or precision, which is load-bearing for the unified-framework claim.
  2. [Experimental section] Experimental section (as referenced in the abstract's performance claims): the abstract asserts 'consistent accuracy improvements across vision and language benchmarks' and 'gains exceed 1%' (some >15%) but supplies no dataset sizes, number of runs, error bars, or statistical tests. This prevents verification of whether the reported gains are reliable or could be artifacts of single-run evaluation.
  3. [Method description] Method description: the central claim presupposes that a single, technique-agnostic schedule for α exists that avoids accuracy cliffs or new instabilities. No ablation on schedule sensitivity or per-technique retuning is referenced, which directly tests the unification premise.
minor comments (1)
  1. [Abstract] Abstract: the sentence 'some configuration exhibits improvements above 15%' should identify the specific compression method, dataset, and configuration for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We provide point-by-point responses to the major comments below, indicating revisions where we have updated the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the description of the affine combination ((1-α)·f_orig + α·f_comp) does not specify the combination point (logits vs. intermediate features) or the functional form of the ramp schedule α(t). Without these, it is impossible to verify whether the same blending rule remains stable and technique-agnostic when the compressed model changes structure or precision, which is load-bearing for the unified-framework claim.

    Authors: We agree with this observation and have revised the abstract to specify that the affine combination is performed on the logits and that α(t) is implemented as a linear ramp schedule. These details, along with the justification for their choice to maintain stability and technique-agnosticism, are elaborated in Section 3 of the paper. revision: yes

  2. Referee: [Experimental section] Experimental section (as referenced in the abstract's performance claims): the abstract asserts 'consistent accuracy improvements across vision and language benchmarks' and 'gains exceed 1%' (some >15%) but supplies no dataset sizes, number of runs, error bars, or statistical tests. This prevents verification of whether the reported gains are reliable or could be artifacts of single-run evaluation.

    Authors: The full paper in Section 4 includes dataset sizes, results from multiple runs with error bars, and statistical comparisons. To improve the abstract, we have added a sentence noting that the gains are consistent across multiple runs with reported standard deviations. This addresses the concern while maintaining abstract brevity. revision: yes

  3. Referee: [Method description] Method description: the central claim presupposes that a single, technique-agnostic schedule for α exists that avoids accuracy cliffs or new instabilities. No ablation on schedule sensitivity or per-technique retuning is referenced, which directly tests the unification premise.

    Authors: We have added an ablation study on different schedules for α in the revised manuscript, demonstrating that the linear schedule is robust and does not require per-technique retuning for the compression methods tested. This is now referenced in the method section to strengthen the unification claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; VCON defined independently and evaluated externally

full rationale

The paper introduces VCON by defining an affine combination of original and compressed model outputs during joint fine-tuning, with the original contribution progressively reduced via a ramped parameter. This construction is presented as a new technique and then tested on external CV and NLP benchmarks across pruning, quantization, and low-rank methods. No equation or claim reduces the performance gains to a self-definition, a fitted input relabeled as prediction, or a load-bearing self-citation chain. The derivation remains self-contained because the method is specified first and its benefits are measured against independent baselines rather than derived tautologically from its own parameters.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that gradual output blending during fine-tuning produces stable adaptation; the blending schedule itself functions as an implicit free parameter whose specific form is not detailed in the abstract.

free parameters (1)
  • contribution schedule
    The rate and functional form governing how quickly the original model's weight decreases and the compressed model's weight increases must be chosen or tuned for each task and compression method.
axioms (1)
  • domain assumption Affine combination of model outputs during fine-tuning permits stable adaptation across compression techniques
    Invoked when the paper states that the parallel execution and gradual shift mitigates accuracy degradation.

pith-pipeline@v0.9.0 · 5759 in / 1308 out tokens · 41582 ms · 2026-05-18T08:35:27.510378+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 14 internal anchors

  1. [1]

    Deep learning,

    Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, May 2015. doi:10.1038/nature14539 NIKIFOROSet al.: V ANISHING CONTRIBUTIONS: A UNIFIED APPROACH TO SMOOTHLY TRANSITION NEURAL MODELS INTO COMPRESSED FORM 9 TABLE VI TRAINING CONFIGURATIONS AND HYPERPARAMETERS Configuration ViT-T/16, ViT-S/16, ViT-B/16 BERT, dist...

  2. [2]

    A survey on deploying mobile deep learning applications: A systemic and technical perspective,

    Y . Wang, J. Wang, W. Zhang, Y . Zhan, S. Guo, Q. Zheng, and X. Wang, “A survey on deploying mobile deep learning applications: A systemic and technical perspective,”Digital Communications and Networks, vol. 8, no. 1, pp. 1–17, Feb. 2022. doi:10.1016/j.dcan.2021.06.001

  3. [3]

    A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommenda- tions,

    H. Cheng, M. Zhang, and J. Q. Shi, “A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommenda- tions,” Aug. 2024. doi:10.48550/arXiv.2308.06767

  4. [4]

    Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers,

    Z. Allen-Zhu, Y . Li, and Y . Liang, “Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers,” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019

  5. [5]

    Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

    S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” Feb. 2016. doi:10.48550/arXiv.1510.00149

  6. [6]

    Methods for Pruning Deep Neu- ral Networks,

    S. Vadera and S. Ameen, “Methods for Pruning Deep Neu- ral Networks,”IEEE Access, vol. 10, pp. 63 280–63 300, 2022. doi:10.1109/ACCESS.2022.3182659

  7. [7]

    A Survey on Deep Neural Network Compression: Challenges, Overview, and Solutions,

    R. Mishra, H. P. Gupta, and T. Dutta, “A Survey on Deep Neural Network Compression: Challenges, Overview, and Solutions,” Oct

  8. [8]

    doi:10.48550/arXiv.2010.03954

  9. [9]

    A Survey on Methods and Theories of Quantized Neural Networks

    Y . Guo, “A Survey on Methods and Theories of Quantized Neural Networks,” Dec. 2018. doi:10.48550/arXiv.1808.04752

  10. [10]

    [2112.06126] Neural Network Quantization for Efficient Inference: A Survey,

    “[2112.06126] Neural Network Quantization for Efficient Inference: A Survey,” https://arxiv.org/abs/2112.06126

  11. [11]

    A Multiply-And-Max/Min Neuron Paradigm for Aggressively Prunable Deep Neural Networks,

    L. Prono, P. Bich, C. Boretti, M. Mangia, F. Pareschi, R. Rovatti, and G. Setti, “A Multiply-And-Max/Min Neuron Paradigm for Aggressively Prunable Deep Neural Networks,”IEEE Transactions on Neural Net- works and Learning Systems, vol. 36, no. 8, pp. 14 414–14 427, Aug

  12. [12]

    doi:10.1109/TNNLS.2025.3527644

  13. [13]

    Pruning neural networks without any data by iteratively conserving synaptic flow,

    H. Tanaka, D. Kunin, D. L. Yamins, and S. Ganguli, “Pruning neural networks without any data by iteratively conserving synaptic flow,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 6377–6389

  14. [14]

    Adaptive Iterative Pruning for Accelerating Deep Neural Networks,

    Y . Gordienko, Y . Kochura, V . Taran, N. Gordienko, A. Bugaiov, and S. Stirenko, “Adaptive Iterative Pruning for Accelerating Deep Neural Networks,” in2019 XIth International Scientific and Practical Confer- ence on Electronics and Information Technologies (ELIT), Sep. 2019, pp. 173–178. doi:10.1109/ELIT.2019.8892346

  15. [15]

    DropNet: Reducing Neural Network Com- plexity via Iterative Pruning,

    C. M. J. Tan and M. Motani, “DropNet: Reducing Neural Network Com- plexity via Iterative Pruning,” inProceedings of the 37th International Conference on Machine Learning. PMLR, Nov. 2020, pp. 9356–9366

  16. [16]

    Progressive Channel-Shrinking Network,

    J. Pan, S. Yang, L. G. Foo, Q. Ke, H. Rahmani, Z. Fan, and J. Liu, “Progressive Channel-Shrinking Network,”IEEE Transactions on Mul- timedia, vol. 26, pp. 2016–2026, 2024. doi:10.1109/TMM.2023.3291197

  17. [17]

    Towards Higher Ranks via Adversarial Weight Pruning,

    Y . Tian, H. Chen, T. Guo, C. Xu, and Y . Wang, “Towards Higher Ranks via Adversarial Weight Pruning,” Nov. 2023. doi:10.48550/arXiv.2311.17493

  18. [18]

    Embedding Com- pression with Isotropic Iterative Quantization,

    S. Liao, J. Chen, Y . Wang, Q. Qiu, and B. Yuan, “Embedding Com- pression with Isotropic Iterative Quantization,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 8336–8343, Apr. 2020. doi:10.1609/aaai.v34i05.6350

  19. [19]

    Gradient-Aware In- cremental Network Quantization,

    J. Meng, Z. Qu, W. Zhou, S. Hu, and B. Ye, “Gradient-Aware In- cremental Network Quantization,” inNetwork and Parallel Computing, X. Chen, G. Min, D. Guo, X. Xie, and L. Pu, Eds. Singapore: Springer Nature, 2025, pp. 430–441. doi:10.1007/978-981-96-2864-3 34

  20. [20]

    Iterative Low-Rank Approximation for CNN Com- pression,

    M. Kholiavchenko, “Iterative Low-Rank Approximation for CNN Com- pression,” Nov. 2019. doi:10.48550/arXiv.1803.08995

  21. [21]

    Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration

    Y . He, P. Liu, Z. Wang, Z. Hu, and Y . Yang, “Filter Pruning via Geo- metric Median for Deep Convolutional Neural Networks Acceleration,” Jul. 2019. doi:10.48550/arXiv.1811.00250

  22. [22]

    NISP: Pruning Networks using Neuron Importance Score Propagation

    R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V . I. Morariu, X. Han, M. Gao, C.-Y . Lin, and L. S. Davis, “NISP: Pruning Networks using Neuron Impor- tance Score Propagation,” Mar. 2018. doi:10.48550/arXiv.1711.05908

  23. [23]

    ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression

    J.-H. Luo, J. Wu, and W. Lin, “ThiNet: A Filter Level Prun- ing Method for Deep Neural Network Compression,” Jul. 2017. doi:10.48550/arXiv.1707.06342

  24. [24]

    Pruning Filters for Efficient ConvNets

    H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning Fil- ters for Efficient ConvNets,” Mar. 2017. doi:10.48550/arXiv.1608.08710

  25. [25]

    A Simple and Effective Pruning Approach for Large Language Models

    M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A Simple and Ef- fective Pruning Approach for Large Language Models,” May 2024. doi:10.48550/arXiv.2306.11695

  26. [26]

    Group Fisher Pruning for Practical Network Compression,

    L. Liu, S. Zhang, Z. Kuang, A. Zhou, J.-H. Xue, X. Wang, Y . Chen, W. Yang, Q. Liao, and W. Zhang, “Group Fisher Pruning for Practical Network Compression,” Aug. 2021. doi:10.48550/arXiv.2108.00708

  27. [27]

    Manifold Regularized Dynamic Network Pruning,

    Y . Tang, Y . Wang, Y . Xu, Y . Deng, C. Xu, D. Tao, and C. Xu, “Manifold Regularized Dynamic Network Pruning,” Mar. 2021. doi:10.48550/arXiv.2103.05861

  28. [28]

    Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023 b

    E. Frantar and D. Alistarh, “SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot,” Mar. 2023. doi:10.48550/arXiv.2301.00774

  29. [29]

    Predicting Parameters in Deep Learning

    M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas, “Predicting Parameters in Deep Learning,” Oct. 2014. doi:10.48550/arXiv.1306.0543

  30. [30]

    Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

    V . Lebedev, Y . Ganin, M. Rakhuba, I. Oseledets, and V . Lempitsky, “Speeding-up Convolutional Neural Networks Using Fine-tuned CP- Decomposition,” Apr. 2015. doi:10.48550/arXiv.1412.6553

  31. [31]

    Constrained Optimization Based Low-Rank Ap- proximation of Deep Neural Networks,

    C. Li and C. J. R. Shi, “Constrained Optimization Based Low-Rank Ap- proximation of Deep Neural Networks,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 732–747

  32. [32]

    Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets,

    T. N. Sainath, B. Kingsbury, V . Sindhwani, E. Arisoy, and B. Ramabhad- ran, “Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets,” in2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 6655–6659. doi:10.1109/ICASSP.2013.6638949

  33. [33]

    Accelerating Very Deep Convo- lutional Networks for Classification and Detection,

    X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating Very Deep Convo- lutional Networks for Classification and Detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943– 1955, Oct. 2016. doi:10.1109/TPAMI.2015.2502579

  34. [34]

    Tensorizing Neural Networks

    A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov, “Tensorizing Neural Networks,” Dec. 2015. doi:10.48550/arXiv.1509.06569

  35. [35]

    AdaBin: Improving Bi- nary Neural Networks with Adaptive Binary Sets,

    Z. Tu, X. Chen, P. Ren, and Y . Wang, “AdaBin: Improving Bi- nary Neural Networks with Adaptive Binary Sets,” Oct. 2022. doi:10.48550/arXiv.2208.08084

  36. [36]

    BiPer: Binary Neural Networks using a Periodic Function,

    E. Vargas, C. V . Correa, C. Hinojosa, and H. Arguello, “BiPer: Binary Neural Networks using a Periodic Function,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5684–5693

  37. [37]

    TernaryLLM: Ternarized Large Language Model,

    T. Chen, Z. Li, W. Xu, Z. Zhu, D. Li, L. Tian, E. Barsoum, P. Wang, and J. Cheng, “TernaryLLM: Ternarized Large Language Model,” Jun

  38. [38]

    doi:10.48550/arXiv.2406.07177

  39. [39]

    TerViT: An Efficient Ternary Vision Transformer,

    S. Xu, Y . Li, T. Ma, B. Zeng, B. Zhang, P. Gao, and J. Lv, “TerViT: An Efficient Ternary Vision Transformer,” Jan. 2022. doi:10.48550/arXiv.2201.08050

  40. [40]

    XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks,

    M. Rastegari, V . Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks,” NIKIFOROSet al.: V ANISHING CONTRIBUTIONS: A UNIFIED APPROACH TO SMOOTHLY TRANSITION NEURAL MODELS INTO COMPRESSED FORM 10 inComputer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer Inte...

  41. [41]

    Low-bit Quantization of Neural Networks for Efficient Inference

    Y . Choukroun, E. Kravchik, F. Yang, and P. Kisilev, “Low-bit Quan- tization of Neural Networks for Efficient Inference,” Mar. 2019. doi:10.48550/arXiv.1902.06822

  42. [42]

    Only Train Once: A One-Shot Neural Network Training And Pruning Framework,

    T. Chen, B. Ji, T. Ding, B. Fang, G. Wang, Z. Zhu, L. Liang, Y . Shi, S. Yi, and X. Tu, “Only Train Once: A One-Shot Neural Network Training And Pruning Framework,” inAdvances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 19 637–19 651

  43. [43]

    SLiM: One- shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression,

    M. Mozaffari, A. Yazdanbakhsh, and M. M. Dehnavi, “SLiM: One- shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression,” Aug. 2025. doi:10.48550/arXiv.2410.09615

  44. [44]

    OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization,

    P. Hu, X. Peng, H. Zhu, M. M. S. Aly, and J. Lin, “OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization,” May 2022. doi:10.48550/arXiv.2205.11141

  45. [45]

    Iterative clustering pruning for convolutional neural networks,

    J. Chang, Y . Lu, P. Xue, Y . Xu, and Z. Wei, “Iterative clustering pruning for convolutional neural networks,”Knowledge-Based Systems, vol. 265, p. 110386, Apr. 2023. doi:10.1016/j.knosys.2023.110386

  46. [46]

    Progressive DNN Compression: A Key to Achieve Ultra-High Weight Pruning and Quantization Rates using ADMM

    S. Ye, X. Feng, T. Zhang, X. Ma, S. Lin, Z. Li, K. Xu, W. Wen, S. Liu, J. Tang, M. Fardad, X. Lin, Y . Liu, and Y . Wang, “Progressive DNN Compression: A Key to Achieve Ultra-High Weight Pruning and Quanti- zation Rates using ADMM,” Mar. 2019. doi:10.48550/arXiv.1903.09769

  47. [47]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    “[1308.3432] Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,” https://arxiv.org/abs/1308.3432

  48. [48]

    Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks,

    M. Huh, B. Cheung, P. Agrawal, and P. Isola, “Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks,” May 2023. doi:10.48550/arXiv.2305.08842

  49. [49]

    S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training,

    Y . Hu, J. Zhu, and J. Chen, “S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training,” Dec. 2024. doi:10.48550/arXiv.2409.09099

  50. [50]

    Learning Multiple Layers of Features from Tiny Im- ages,

    A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Im- ages,” 2009

  51. [51]

    , author Dong, W

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009, pp. 248–255. doi:10.1109/CVPR.2009.5206848

  52. [52]

    Transforming Question Answering Datasets Into Natural Language Inference Datasets

    D. Demszky, K. Guu, and P. Liang, “Transforming Question Answer- ing Datasets Into Natural Language Inference Datasets,” Sep. 2018. doi:10.48550/arXiv.1809.02922

  53. [53]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference,

    A. Williams, N. Nangia, and S. Bowman, “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference,” inProceedings of the 2018 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), M. Walker, H. Ji, and A. Stent, Eds. New Orleans, Louisiana: Ass...