pith. sign in

arxiv: 2505.17469 · v2 · pith:OTHB6R25new · submitted 2025-05-23 · 💻 cs.LG · cs.AI· cs.IT· math.IT· math.OC· math.ST· stat.TH

Efficient compression of neural networks and datasets

Pith reviewed 2026-05-19 14:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.ITmath.OCmath.STstat.TH
keywords neural network compressionminimum description lengthSolomonoff inductionmodel pruninggeneralizationsparse optimizationteacher-student settingsample efficiency
0
0 comments X p. Extension
pith:OTHB6R25 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{OTHB6R25}

Prints a linked pith:OTHB6R25 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Refined sparse optimization approximates minimum description length to compress neural networks and datasets while improving generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper connects neural network pruning to algorithmic information theory by using parameter sparsity as a practical substitute for minimum description length. It improves probabilistic pruning to avoid Monte Carlo sampling and adds a binary search step to smooth l0 approximations, reducing hyperparameter tuning. These changes produce highly compressed convolutional and transformer models on image and text tasks with little accuracy loss and shorter overall description lengths. In controlled teacher-student experiments the compressed students reach target performance with fewer samples and generalize more accurately than dense counterparts. The results supply direct empirical backing for the Solomonoff induction claim that simpler models perform better inductive inference on low-complexity data.

Core claim

By recasting minimum description length optimization as l0-regularized learning and refining complementary sparse methods, the authors obtain substantial model compression across convolutional networks and transformers on image and text datasets. The refined probabilistic pruning procedure eliminates Monte Carlo sampling while the binary-search routine for smooth l0 approximations lowers hyperparameter complexity. When these compressed models are placed in a teacher-student setting, they exhibit greater sample efficiency and stronger generalization than uncompressed baselines, confirming the Solomonoff induction prediction that shorter descriptions support better learning from limited data.

What carries the argument

l0-regularized learning used as a computable proxy for exact minimum description length, realized through sampling-free probabilistic pruning and binary-search refinement of smooth l0 approximations.

If this is right

  • The refined methods deliver substantial parameter reduction on image and text tasks while keeping accuracy loss minimal.
  • The resulting models produce shorter combined model-plus-data description lengths than earlier pruning techniques.
  • In teacher-student setups the compressed students reach target accuracy with fewer training examples than dense students.
  • Generalization improves when training uses the sparsity-based compression relative to standard dense training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sparsity proxies might be tested on reinforcement-learning agents to see whether they improve generalization from sparse rewards.
  • If the approximation holds, searching over even sparser architectures could produce models whose internal representations are easier to interpret.
  • Extending the teacher-student protocol to data drawn from known high-complexity distributions would test the boundary of the reported benefit.
  • The techniques could be paired with quantization or distillation to reach still smaller effective description lengths.

Load-bearing premise

Parameter sparsity provides a good enough stand-in for true model description length that l0-regularized training can replace exact minimum description length optimization.

What would settle it

A teacher-student experiment in which students trained with the refined compression methods show no improvement in sample efficiency or generalization compared with students trained without the compression step.

Figures

Figures reproduced from arXiv: 2505.17469 by Lukas Silvester Barth, Paulo von Petersenn.

Figure 1
Figure 1. Figure 1: ‘Test Loss‘ vs ‘Model Byte Size‘ for different dataset sizes in the teacher-student setup. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mean loss vs model size with description length (in MB) isolines for the transformer [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Weights in a simple neural network before (left) and after (right) random gradient pruning. [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of ℓ2-regularization strength (ρ) on performance across datasets, visualized via α-Hull plots. Each dot corresponds to a different hyperparameter configuration. Solid lines denote the optimal trade-off frontiers. Compression is measured as the ratio of uncompressed to regularized model size. Color encodes ρ: yellow (highest), pink (intermediate), and green (ρ = 0). (a) MNIST (LeNet-300-100), (b) CIF… view at source ↗
Figure 5
Figure 5. Figure 5: ‘Accuracy‘ vs ‘Model Size Compression Rate‘ for different datasets and models. [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ‘Test Loss‘ vs ‘Model Byte Size‘ for different models. Description length (in bytes) isolines [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: shows the dependence of description length on α. The black line indicates the verbatim coding of the data. One can see that the right amount of regularization depends on the dataset size and is always capable of reducing the description length of the data. As we progress from smaller to larger datasets, we again witness a U-shaped curve pattern emerging (which starts with a left half-U, morphs into a full … view at source ↗
Figure 8
Figure 8. Figure 8: Mean loss vs model size with description length (in GB) isolines for the transformer [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Loss curves for LLaMA models. Left: Original training curve from [ [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
read the original abstract

Compression and generalization are fundamentally related through Solomonoff induction and the minimum description length principle (MDL), which predict that simpler models generalize better when data arises from low-complexity distributions. In this article, we combine insights from algorithmic information theory and techniques from neural network pruning to improve model generalization by identifying the most effective data compression method. Since exact MDL optimization is intractable, we cast it as $\ell_0$ regularized learning and explain why parameter sparsity provides an effective computable approximation of model description length. To identify the best practical approach, we systematically compare and refine complementary sparse optimization methods. In particular, we improve probabilistic pruning through a procedure that does not require Monte Carlo sampling and refine smooth $\ell_0$ approximations with a binary search routine that reduces hyperparameter complexity. Across convolutional networks and transformers evaluated on image and text datasets, our refined methods improve upon their predecessors, achieve substantial model compression with minimal accuracy loss, and yield short data description lengths. Finally, we use these methods in a controlled teacher-student setting to empirically verify the prediction of Solomonoff induction that compressed models learn more sample-efficiently and generalize better.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that compression and generalization are linked via Solomonoff induction and MDL, proposes ℓ0-regularized learning as a practical proxy for minimizing description length, refines probabilistic pruning (removing Monte Carlo sampling) and smooth-ℓ0 methods (via binary search for hyperparameters), demonstrates competitive compression on CNNs and transformers for image and text data, and reports a teacher-student experiment in which the resulting sparse models learn more sample-efficiently and generalize better than dense counterparts.

Significance. If the teacher-student protocol includes proper controls and the sparsity methods are shown to reduce total description length (not merely parameter count), the work would supply useful empirical evidence connecting algorithmic information theory to modern neural network practice. The algorithmic refinements to existing pruning techniques could be of interest to the compression community even if the MDL interpretation is set aside.

major comments (2)
  1. Abstract and teacher-student section: the central generalization claim requires that the ℓ0-regularized sparse models achieve lower total description length than dense models. Parameter count reduction alone does not establish this; a full description length must encode the support (indices) and the precision of retained values. No explicit codelength comparison (e.g., via arithmetic coding of indices plus quantized values) is reported on the teacher-generated data, so observed sample-efficiency gains could arise from generic regularization rather than MDL minimization.
  2. Abstract: the experimental support for improved sample efficiency is described only at a high level. No error bars, dataset sizes, number of runs, ablation controls, or precise teacher-student protocol (how the teacher is trained, how sparsity is applied during student training, how description length is measured) are supplied, making it impossible to assess whether the Solomonoff prediction has been tested or merely illustrated.
minor comments (2)
  1. The abstract states that the refined methods 'yield short data description lengths,' but no quantitative values or comparison tables for data codelengths are referenced.
  2. Notation for the refined probabilistic pruning procedure and the binary-search routine for smooth-ℓ0 should be introduced with explicit equations or pseudocode to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive suggestions. We agree that strengthening the link between our sparsity methods and total description length, as well as providing more experimental detail, will improve the manuscript. We respond to each major comment below and will incorporate the requested clarifications and analyses in the revised version.

read point-by-point responses
  1. Referee: Abstract and teacher-student section: the central generalization claim requires that the ℓ0-regularized sparse models achieve lower total description length than dense models. Parameter count reduction alone does not establish this; a full description length must encode the support (indices) and the precision of retained values. No explicit codelength comparison (e.g., via arithmetic coding of indices plus quantized values) is reported on the teacher-generated data, so observed sample-efficiency gains could arise from generic regularization rather than MDL minimization.

    Authors: We agree that a complete description length must include the cost of encoding the support (indices of retained parameters) and the precision of the non-zero values, not merely the count of non-zero parameters. In the current manuscript we treat the number of non-zero parameters as the primary computable proxy for model description length, consistent with standard practice in the pruning literature, while reporting data description length via negative log-likelihood. To more directly test the MDL prediction in the teacher-student setting, we will add an explicit total codelength comparison in the revised version. This will encode the binary support mask (via arithmetic coding or an equivalent scheme) together with quantized values of the retained weights and compare the resulting total length against dense baselines on the teacher-generated data. We expect this addition to help separate MDL-driven compression from generic regularization effects. revision: yes

  2. Referee: Abstract: the experimental support for improved sample efficiency is described only at a high level. No error bars, dataset sizes, number of runs, ablation controls, or precise teacher-student protocol (how the teacher is trained, how sparsity is applied during student training, how description length is measured) are supplied, making it impossible to assess whether the Solomonoff prediction has been tested or merely illustrated.

    Authors: We acknowledge that the abstract summarizes the teacher-student results at a high level. The body of the manuscript already contains a dedicated experimental section that specifies the teacher training procedure (full-dataset supervised training), the precise application of our refined probabilistic and smooth-ℓ0 methods to the student, the image and text datasets with their sizes, and the measurement of description length as the sum of model complexity and data negative log-likelihood. In the revision we will (i) expand the abstract to include the number of independent runs, error bars (standard deviation), and a concise statement of the protocol, and (ii) add a short subsection that explicitly lists ablation controls, dataset cardinalities, and the exact description-length formula used. These changes will make the experimental support for the Solomonoff prediction fully reproducible from the abstract onward. revision: yes

Circularity Check

1 steps flagged

Sparsity proxy for description length tuned to observed performance, weakening independent verification of Solomonoff prediction

specific steps
  1. fitted input called prediction [Abstract]
    "Since exact MDL optimization is intractable, we cast it as ℓ0 regularized learning and explain why parameter sparsity provides an effective computable approximation of model description length. ... Finally, we use these methods in a controlled teacher-student setting to empirically verify the prediction of Solomonoff induction that compressed models learn more sample-efficiently and generalize better."

    The paper presents sparsity via ℓ0 regularization as the practical MDL proxy, then uses the resulting models to 'verify' the Solomonoff prediction of better generalization. Because the regularization parameters and pruning routines are refined and selected to match the observed compression-plus-accuracy results on the same teacher-generated data, the efficiency gains are statistically tied to the fitting procedure rather than to an a-priori lower description length.

full rationale

The paper's central empirical verification in the teacher-student setting relies on l0-regularized sparse models as a computable stand-in for MDL, but the regularization strength and pruning thresholds are selected to achieve the reported compression and accuracy outcomes. This creates a partial dependence where the observed sample-efficiency gains are explained by the same fitting process used to construct the 'compressed' models, rather than by an independent demonstration that the final models have strictly lower total description length (including index encoding and value precision) than dense baselines. No explicit codelength comparison is provided, so the Solomonoff link remains interpretive rather than quantitatively closed.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central approach rests on the MDL principle as a domain assumption and introduces tunable regularization parameters whose values are selected to achieve desired sparsity and performance.

free parameters (1)
  • l0 regularization strength
    Controls the trade-off between model sparsity and task performance in the cast optimization problem.
axioms (1)
  • domain assumption Solomonoff induction and the minimum description length principle predict that simpler models generalize better when data arises from low-complexity distributions.
    Invoked in the abstract to motivate the entire compression-generalization link and the teacher-student experiment.

pith-pipeline@v0.9.0 · 5737 in / 1209 out tokens · 47442 ms · 2026-05-19T14:09:47.815575+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We conceptually motivate ℓ0 regularization from first principles of algorithmic complexity and Solomonoff induction... empirically verify the prediction of Solomonoff induction that regularized models can exhibit more sample-efficient convergence.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 14 internal anchors

  1. [1]

    Probabilistic and nonlinear compressive sensing

    Anonymous. Probabilistic and nonlinear compressive sensing. To appear soon, 2025

  2. [2]

    Ayinde, Tamer Inanc, and Jacek M

    Babajide O. Ayinde, Tamer Inanc, and Jacek M. Zurada. Redundant feature pruning for accelerated inference in deep neural networks. Neural Networks, 118:148–158, 2019

  3. [3]

    Iterative thresholding for sparse approximations

    Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications, 14:629–654, 2008

  4. [4]

    Sequential learning of neural networks for prequential mdl

    Jorg Bornschein, Yazhe Li, and Marcus Hutter. Sequential learning of neural networks for prequential mdl. arXiv preprint arXiv:2210.07931, 2022

  5. [5]

    Distributed opti- mization and statistical learning via the alternating direction method of multipliers

    Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed opti- mization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011

  6. [6]

    Feature selection via concave minimization and support vector machines

    Paul S Bradley and Olvi L Mangasarian. Feature selection via concave minimization and support vector machines. In ICML, volume 98, pages 82–90, 1998

  7. [7]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1– 113, 2023. 11

  8. [8]

    On the compression of neural networks using l0-norm regularization and weight pruning

    Felipe Dennis de Resende Oliveira, Eduardo Luiz Ortiz Batista, and Rui Seara. On the compression of neural networks using l0-norm regularization and weight pruning. Neural Networks, 171:343–352, 2024

  9. [9]

    Deepseek-v3 technical report, 2024

    DeepSeek-AI. Deepseek-v3 technical report, 2024

  10. [10]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023

  11. [11]

    Language Modeling Is Compression

    Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christo- pher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023

  12. [12]

    Global sparse momentum sgd for pruning very deep neural networks

    Xiaohan Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, Ji Liu, et al. Global sparse momentum sgd for pruning very deep neural networks. Advances in Neural Information Processing Systems, 32, 2019

  13. [13]

    Learning to prune deep neural networks via layer-wise optimal brain surgeon

    Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in neural information processing systems, 30, 2017

  14. [14]

    Disarm: An antithetic gradient estimator for binary latent variables

    Zhe Dong, Andriy Mnih, and George Tucker. Disarm: An antithetic gradient estimator for binary latent variables. Advances in neural information processing systems, 33:18637–18647, 2020

  15. [15]

    Hamprecht

    Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A. Hamprecht. Essentially no barriers in neural network energy landscape, 2019

  16. [16]

    Edelsbrunner, D

    H. Edelsbrunner, D. Kirkpatrick, and R. Seidel. On the shape of a set of points in the plane. IEEE Transactions on Information Theory, 29(4):551–559, 1983

  17. [17]

    Scaling rectified flow transform- ers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

  18. [18]

    and contributors

    Joaquim F. and contributors. Gmt.jl: Julia wrapper for the generic mapping tools. online, 2014–2025. Accessed: 2025-05-02

  19. [19]

    A Mathematical Introduction to Compressive Sensing

    Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing . Applied and Numerical Harmonic Analysis. Springer New York, New York, NY , 2013

  20. [20]

    The minimum description length principle

    Peter D Grünwald. The minimum description length principle. MIT press, 2007

  21. [21]

    The l0 norm constraint lms algorithm for sparse system identification

    Yuantao Gu, Jian Jin, and Shunliang Mei. The l0 norm constraint lms algorithm for sparse system identification. IEEE Signal Processing Letters, 16(9):774–777, 2009

  22. [22]

    Wiki-40b: Multilingual language model dataset

    Mandy Guo, Zihang Dai, Denny Vrande ˇci´c, and Rami Al-Rfou. Wiki-40b: Multilingual language model dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2440–2452, 2020

  23. [23]

    Dynamic network surgery for efficient dnns

    Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. Advances in neural information processing systems, 29, 2016

  24. [24]

    Learning both weights and connections for efficient neural network

    Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015

  25. [25]

    Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso

    Trevor Hastie, Robert Tibshirani, and Ryan J Tibshirani. Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692, 2017

  26. [26]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  27. [27]

    Data-driven sparse structure selection for deep neural networks

    Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 304–320, 2018

  28. [28]

    Universal artificial intelligence: Sequential decisions based on algorithmic probability

    Marcus Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2005. 12

  29. [29]

    Exploring the effect of l0 / l2 regularization in neural network pruning using the lc toolkit

    Yerlan Idelbayev and Miguel Á Carreira-Perpiñán. Exploring the effect of l0 / l2 regularization in neural network pruning using the lc toolkit. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3373–3377. IEEE, 2022

  30. [30]

    Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Ser- gio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Pete...

  31. [31]

    Auto-Encoding Variational Bayes

    Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  32. [32]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  33. [33]

    On tables of random numbers.Sankhy¯a: The Indian Journal of Statistics, Series A, pages 369–376, 1963

    Andrei N Kolmogorov. On tables of random numbers.Sankhy¯a: The Indian Journal of Statistics, Series A, pages 369–376, 1963

  34. [34]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. N.A., 2009. 13

  35. [35]

    Gradient estimation for binary latent variables via gradient variance clipping

    Russell Z Kunes, Mingzhang Yin, Max Land, Doron Haviv, Dana Pe’er, and Simon Tavaré. Gradient estimation for binary latent variables via gradient variance clipping. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37-7, pages 8405–8412, 2023

  36. [36]

    Gradient-based learning applied to document recognition

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998

  37. [37]

    Optimal brain damage

    Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989

  38. [38]

    SNIP: Single-shot Network Pruning based on Connection Sensitivity

    Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018

  39. [39]

    Pruning Filters for Efficient ConvNets

    Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016

  40. [40]

    L0-arm: Network sparsification via stochastic binary optimization

    Yang Li and Shihao Ji. L0-arm: Network sparsification via stochastic binary optimization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 432–448. Springer, 2019

  41. [41]

    Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024

  42. [42]

    Bayesian compression for deep learning

    Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. Advances in neural information processing systems, 30, 2017

  43. [43]

    Learning Sparse Neural Networks through $L_0$ Regularization

    Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through L0 regularization. arXiv preprint arXiv:1712.01312, 2017

  44. [44]

    Machine learning via polyhedral concave minimization

    Olvi L Mangasarian. Machine learning via polyhedral concave minimization. In Applied Mathematics and Parallel Computing: Festschrift for Klaus Ritter, pages 175–188. Springer, 1996

  45. [45]

    Relaxed lasso

    Nicolai Meinshausen. Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374–393, 2007

  46. [46]

    Variational dropout sparsifies deep neural networks

    Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International conference on machine learning, pages 2498–2507. PMLR, 2017

  47. [47]

    Lux: Explicit parameterization of deep neural networks in julia, April 2023

    Avik Pal. Lux: Explicit parameterization of deep neural networks in julia, April 2023. If you use this software, please cite it as below

  48. [48]

    On efficient training & inference of neural differential equations, 2023

    Avik Pal. On efficient training & inference of neural differential equations, 2023

  49. [49]

    Source coding algorithms for fast data compression

    Richard Clark Pasco. Source coding algorithms for fast data compression. PhD thesis, Stanford University CA, 1976

  50. [50]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfo...

  51. [51]

    Y .C. Pati, R. Rezaiifar, and P.S. Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, pages 40–44 vol.1, 1993

  52. [52]

    Pruning deep neural networks with l_0-constrained optimization

    Dzung T Phan, Lam M Nguyen, Nam H Nguyen, and Jayant R Kalagnanam. Pruning deep neural networks with l_0-constrained optimization. In 2020 IEEE International Conference on Data Mining (ICDM), pages 1214–1219. IEEE, 2020

  53. [53]

    Convergence of discrete mdl for sequential prediction

    Jan Poland and Marcus Hutter. Convergence of discrete mdl for sequential prediction. In Learning Theory: 17th Annual Conference on Learning Theory, COLT 2004, Banff, Canada, July 1-4, 2004. Proceedings 17, pages 300–314. Springer, 2004

  54. [54]

    Universal Differential Equations for Scientific Machine Learning

    Christopher Rackauckas, Yingbo Ma, Julius Martensen, Collin Warner, Kirill Zubov, Rohit Supekar, Dominic Skinner, and Ali Ramadhan. Universal differential equations for scientific machine learning. arXiv preprint arXiv:2001.04385, 2020

  55. [55]

    Compression for AGI | Stanford MLSys #76 | Talk on Youtube, 2023

    Jack Rae. Compression for AGI | Stanford MLSys #76 | Talk on Youtube, 2023. 14

  56. [56]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021

  57. [57]

    J. J. Rissanen. Generalized kraft inequality and arithmetic coding. IBM Journal of Research and Development, 20(3):198–203, 1976

  58. [58]

    Modeling by shortest data description

    Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978

  59. [59]

    Learning equations for extrapolation and control

    Subham Sahoo, Christoph Lampert, and Georg Martius. Learning equations for extrapolation and control. In International Conference on Machine Learning, pages 4442–4450. Pmlr, 2018

  60. [60]

    Formal theory of creativity, fun, and intrinsic motivation (1990–2010)

    Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE transactions on autonomous mental development, 2(3):230–247, 2010

  61. [61]

    A mathematical theory of communication

    Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948

  62. [62]

    Play and Prune: Adaptive Filter Pruning for Deep Model Compression

    Pravendra Singh, Vinay Kumar Verma, Piyush Rai, and Vinay P Namboodiri. Play and prune: Adaptive filter pruning for deep model compression. arXiv preprint arXiv:1905.04446, 2019

  63. [63]

    Complexity-based induction systems: comparisons and convergence theorems

    Ray Solomonoff. Complexity-based induction systems: comparisons and convergence theorems. IEEE transactions on Information Theory, 24(4):422–432, 1978

  64. [64]

    A formal theory of inductive inference

    Ray J Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1):1–22, 1964

  65. [65]

    Regression shrinkage and selection via the lasso

    Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996

  66. [66]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  67. [67]

    Soft Weight-Sharing for Neural Network Compression

    Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017

  68. [68]

    Autoprune: Automatic network pruning by regularizing auxiliary parameters

    Xia Xiao, Zigeng Wang, and Sanguthevar Rajasekaran. Autoprune: Automatic network pruning by regularizing auxiliary parameters. Advances in neural information processing systems, 32, 2019

  69. [69]

    Probabilistic best subset selection via gradient-based optimization

    Mingzhang Yin, Nhat Ho, Bowei Yan, Xiaoning Qian, and Mingyuan Zhou. Probabilistic best subset selection via gradient-based optimization. arXiv preprint arXiv:2006.06448, 2020

  70. [70]

    Model selection and estimation in regression with grouped variables

    Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006

  71. [71]

    Learning to share: Simultaneous parameter tying and sparsification in deep learning

    Dejiao Zhang, Haozhu Wang, Mario Figueiredo, and Laura Balzano. Learning to share: Simultaneous parameter tying and sparsification in deep learning. In International Conference on Learning Representations, 2018

  72. [72]

    A systematic dnn weight pruning framework using alternating direction method of multipliers

    Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European conference on computer vision (ECCV) , pages 184–199, 2018

  73. [73]

    density description length

    Chenglong Zhao, Bingbing Ni, Jian Zhang, Qiwei Zhao, Wenjun Zhang, and Qi Tian. Variational convolutional neural network pruning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2780–2789, 2019. 15 Appendix A Equivalence of minimum description length and Kolmogorov complexity Here, we briefly show the following L...

  74. [74]

    Suppose we have a model fθ with parameter θ, that we want to optimize to fit some data D and that we want to distribute our computation with N batches of data Di, i ∈ {1, · · · , N} on N compute nodes, that run in parallel. We can do this by copying the model z := θ to N other machines, where we define xi := copy(θ) and where we also copy the data and mod...

  75. [75]

    So we can use Proposition 2.2 to reformulate the zk+1-update as follows

    The big advantage is now that the L0(θ) term occurs only within an optimization step, that does not involve fθ(D). So we can use Proposition 2.2 to reformulate the zk+1-update as follows. First, we introduce two new vector-valued variables w and Z and write z = wZ, where wZ is componentwise multiplication, and w ∈ Rn while Z ∈ { 0, 1}n. Then we 21 apply t...

  76. [76]

    We can attempt to solve for w by computing ∂L(w, π) ∂wℓ = − X i X j ρ (Mixk+1 i − c + uk i )j − X q Pjq wqπq Pjℓπℓ+ + N ρ X p P 2 pℓwℓπℓ(1 − πℓ) = −ρ X i,j (Mixk+1 i − c + uk i )jPjℓπℓ + N ρ X j,q πℓP T ℓj Pjq πqwq + N ρ X j,q P T ℓj Pjℓπℓ(1 − πℓ)δℓqwq = −ρπℓ X i,j P T ℓj(Mixk+1 i − c + uk i )j+ + N ρπℓ X j,q P T ℓj Pjq πq + (1 − πq)δℓq wq (31) and settin...

  77. [77]

    In general, this can be non-trivial to solve but for the important special case that P = id, or Pjq = δjq, we obtain 0 = −πℓ X i (Mixk+1 i − c + uk i )ℓ = N πℓ ˆwℓ. (32) This leads to either an arbitrary value for ˆwℓ, if πℓ = 0, or otherwise to ˆw = 1 N X i (Mixk+1 i − c + uk i ) =: 1 N X i rk i (33) This is an average, that does not depend on π anymore

  78. [78]

    This 22 must be a vertex of the cube [0, 1]n

    Inserting this back into L( ˆw, π), for the case Ppj = δpj, we obtain L( ˆw, π) = α X j πj + X i ρ 2 ||rk i − ( ˆwπ)||2 2 + N ρ 2 X j ˆw2 j πj(1 − πj) (34) Since this is linear in πℓ for all ℓ ∈ {1, · · · , n}, the only thing we need to do in order to find a minimum, is to check the sign of the gradient with respect to πℓ, and move πℓ to the most extreme ...

  79. [79]

    Strong gradients near zero may trap parameters in local minima

  80. [80]

    Parameters with large magnitudes but low utility may receive weak gradients and be ineffi- ciently optimized

Showing first 80 references.