Efficient compression of neural networks and datasets

arxiv: 2505.17469 · v2 · pith:OTHB6R25new · submitted 2025-05-23 · 💻 cs.LG · cs.AI· cs.IT· math.IT· math.OC· math.ST· stat.TH

Efficient compression of neural networks and datasets

Lukas Silvester Barth , Paulo von Petersenn This is my paper

Pith reviewed 2026-05-19 14:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.ITmath.OCmath.STstat.TH

keywords neural network compressionminimum description lengthSolomonoff inductionmodel pruninggeneralizationsparse optimizationteacher-student settingsample efficiency

0 comments p. Extension

pith:OTHB6R25 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{OTHB6R25}

Prints a linked pith:OTHB6R25 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Refined sparse optimization approximates minimum description length to compress neural networks and datasets while improving generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper connects neural network pruning to algorithmic information theory by using parameter sparsity as a practical substitute for minimum description length. It improves probabilistic pruning to avoid Monte Carlo sampling and adds a binary search step to smooth l0 approximations, reducing hyperparameter tuning. These changes produce highly compressed convolutional and transformer models on image and text tasks with little accuracy loss and shorter overall description lengths. In controlled teacher-student experiments the compressed students reach target performance with fewer samples and generalize more accurately than dense counterparts. The results supply direct empirical backing for the Solomonoff induction claim that simpler models perform better inductive inference on low-complexity data.

Core claim

By recasting minimum description length optimization as l0-regularized learning and refining complementary sparse methods, the authors obtain substantial model compression across convolutional networks and transformers on image and text datasets. The refined probabilistic pruning procedure eliminates Monte Carlo sampling while the binary-search routine for smooth l0 approximations lowers hyperparameter complexity. When these compressed models are placed in a teacher-student setting, they exhibit greater sample efficiency and stronger generalization than uncompressed baselines, confirming the Solomonoff induction prediction that shorter descriptions support better learning from limited data.

What carries the argument

l0-regularized learning used as a computable proxy for exact minimum description length, realized through sampling-free probabilistic pruning and binary-search refinement of smooth l0 approximations.

If this is right

The refined methods deliver substantial parameter reduction on image and text tasks while keeping accuracy loss minimal.
The resulting models produce shorter combined model-plus-data description lengths than earlier pruning techniques.
In teacher-student setups the compressed students reach target accuracy with fewer training examples than dense students.
Generalization improves when training uses the sparsity-based compression relative to standard dense training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparsity proxies might be tested on reinforcement-learning agents to see whether they improve generalization from sparse rewards.
If the approximation holds, searching over even sparser architectures could produce models whose internal representations are easier to interpret.
Extending the teacher-student protocol to data drawn from known high-complexity distributions would test the boundary of the reported benefit.
The techniques could be paired with quantization or distillation to reach still smaller effective description lengths.

Load-bearing premise

Parameter sparsity provides a good enough stand-in for true model description length that l0-regularized training can replace exact minimum description length optimization.

What would settle it

A teacher-student experiment in which students trained with the refined compression methods show no improvement in sample efficiency or generalization compared with students trained without the compression step.

Figures

Figures reproduced from arXiv: 2505.17469 by Lukas Silvester Barth, Paulo von Petersenn.

**Figure 2.** Figure 2: Mean loss vs model size with description length (in MB) isolines for the transformer [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Weights in a simple neural network before (left) and after (right) random gradient pruning. [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of ℓ2-regularization strength (ρ) on performance across datasets, visualized via α-Hull plots. Each dot corresponds to a different hyperparameter configuration. Solid lines denote the optimal trade-off frontiers. Compression is measured as the ratio of uncompressed to regularized model size. Color encodes ρ: yellow (highest), pink (intermediate), and green (ρ = 0). (a) MNIST (LeNet-300-100), (b) CIF… view at source ↗

**Figure 5.** Figure 5: ‘Accuracy‘ vs ‘Model Size Compression Rate‘ for different datasets and models. [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

**Figure 6.** Figure 6: ‘Test Loss‘ vs ‘Model Byte Size‘ for different models. Description length (in bytes) isolines [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: shows the dependence of description length on α. The black line indicates the verbatim coding of the data. One can see that the right amount of regularization depends on the dataset size and is always capable of reducing the description length of the data. As we progress from smaller to larger datasets, we again witness a U-shaped curve pattern emerging (which starts with a left half-U, morphs into a full … view at source ↗

**Figure 8.** Figure 8: Mean loss vs model size with description length (in GB) isolines for the transformer [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

**Figure 9.** Figure 9: Loss curves for LLaMA models. Left: Original training curve from [ [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

read the original abstract

Compression and generalization are fundamentally related through Solomonoff induction and the minimum description length principle (MDL), which predict that simpler models generalize better when data arises from low-complexity distributions. In this article, we combine insights from algorithmic information theory and techniques from neural network pruning to improve model generalization by identifying the most effective data compression method. Since exact MDL optimization is intractable, we cast it as $\ell_0$ regularized learning and explain why parameter sparsity provides an effective computable approximation of model description length. To identify the best practical approach, we systematically compare and refine complementary sparse optimization methods. In particular, we improve probabilistic pruning through a procedure that does not require Monte Carlo sampling and refine smooth $\ell_0$ approximations with a binary search routine that reduces hyperparameter complexity. Across convolutional networks and transformers evaluated on image and text datasets, our refined methods improve upon their predecessors, achieve substantial model compression with minimal accuracy loss, and yield short data description lengths. Finally, we use these methods in a controlled teacher-student setting to empirically verify the prediction of Solomonoff induction that compressed models learn more sample-efficiently and generalize better.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Some useful tweaks to sparse pruning plus a teacher-student check on whether compression aids generalization, but the MDL description-length claim rests on parameter count alone without direct codelength measurements.

read the letter

The main takeaway is that this paper refines a couple of existing sparse optimization methods and tests them in a teacher-student setup to see if the resulting compressed models learn faster from limited data. The algorithmic changes are incremental but practical, and the experiment is a reasonable way to probe the generalization claim under controlled conditions. That said, the link to actual minimum description length stays loose because they do not compute full codelengths that include the cost of encoding which weights are kept and at what precision. The observed sample-efficiency gains could therefore come from ordinary regularization effects rather than anything specific to Solomonoff-style compression. The abstract mentions short data description lengths, yet the details provided do not show an explicit comparison that would confirm the sparsity proxy is doing the explanatory work. On the positive side, the Monte-Carlo-free probabilistic pruning and the binary-search routine for smooth l0 look like they reduce implementation hassle and hyperparameter sensitivity compared with prior versions. The authors also run the methods on both convolutional networks and transformers across image and text tasks, which gives a broader sense of where the improvements hold. These parts are the concrete contributions that a reader could actually use or build on. The soft spots are mainly around experimental transparency and the strength of the theoretical framing. Without error bars, ablation tables, or a clear description of the teacher-student protocol in the abstract, it is difficult to judge how robust the generalization results are. The circularity concern also lands: the regularization strength is tuned to performance, so the sparsity level is not an independent measure of description length. This paper is aimed at people working on neural network compression or on information-theoretic accounts of generalization. Practitioners who need better pruning heuristics will find the refinements worth looking at, and theorists interested in MDL approximations will see a concrete attempt to test the idea empirically. It has enough new implementation details and a focused experiment that it deserves to go to referees rather than being desk-rejected, though the reviewers will probably ask for tighter quantification of the description-length claim and more controls on the teacher-student results.

Referee Report

2 major / 2 minor

Summary. The paper argues that compression and generalization are linked via Solomonoff induction and MDL, proposes ℓ0-regularized learning as a practical proxy for minimizing description length, refines probabilistic pruning (removing Monte Carlo sampling) and smooth-ℓ0 methods (via binary search for hyperparameters), demonstrates competitive compression on CNNs and transformers for image and text data, and reports a teacher-student experiment in which the resulting sparse models learn more sample-efficiently and generalize better than dense counterparts.

Significance. If the teacher-student protocol includes proper controls and the sparsity methods are shown to reduce total description length (not merely parameter count), the work would supply useful empirical evidence connecting algorithmic information theory to modern neural network practice. The algorithmic refinements to existing pruning techniques could be of interest to the compression community even if the MDL interpretation is set aside.

major comments (2)

Abstract and teacher-student section: the central generalization claim requires that the ℓ0-regularized sparse models achieve lower total description length than dense models. Parameter count reduction alone does not establish this; a full description length must encode the support (indices) and the precision of retained values. No explicit codelength comparison (e.g., via arithmetic coding of indices plus quantized values) is reported on the teacher-generated data, so observed sample-efficiency gains could arise from generic regularization rather than MDL minimization.
Abstract: the experimental support for improved sample efficiency is described only at a high level. No error bars, dataset sizes, number of runs, ablation controls, or precise teacher-student protocol (how the teacher is trained, how sparsity is applied during student training, how description length is measured) are supplied, making it impossible to assess whether the Solomonoff prediction has been tested or merely illustrated.

minor comments (2)

The abstract states that the refined methods 'yield short data description lengths,' but no quantitative values or comparison tables for data codelengths are referenced.
Notation for the refined probabilistic pruning procedure and the binary-search routine for smooth-ℓ0 should be introduced with explicit equations or pseudocode to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive suggestions. We agree that strengthening the link between our sparsity methods and total description length, as well as providing more experimental detail, will improve the manuscript. We respond to each major comment below and will incorporate the requested clarifications and analyses in the revised version.

read point-by-point responses

Referee: Abstract and teacher-student section: the central generalization claim requires that the ℓ0-regularized sparse models achieve lower total description length than dense models. Parameter count reduction alone does not establish this; a full description length must encode the support (indices) and the precision of retained values. No explicit codelength comparison (e.g., via arithmetic coding of indices plus quantized values) is reported on the teacher-generated data, so observed sample-efficiency gains could arise from generic regularization rather than MDL minimization.

Authors: We agree that a complete description length must include the cost of encoding the support (indices of retained parameters) and the precision of the non-zero values, not merely the count of non-zero parameters. In the current manuscript we treat the number of non-zero parameters as the primary computable proxy for model description length, consistent with standard practice in the pruning literature, while reporting data description length via negative log-likelihood. To more directly test the MDL prediction in the teacher-student setting, we will add an explicit total codelength comparison in the revised version. This will encode the binary support mask (via arithmetic coding or an equivalent scheme) together with quantized values of the retained weights and compare the resulting total length against dense baselines on the teacher-generated data. We expect this addition to help separate MDL-driven compression from generic regularization effects. revision: yes
Referee: Abstract: the experimental support for improved sample efficiency is described only at a high level. No error bars, dataset sizes, number of runs, ablation controls, or precise teacher-student protocol (how the teacher is trained, how sparsity is applied during student training, how description length is measured) are supplied, making it impossible to assess whether the Solomonoff prediction has been tested or merely illustrated.

Authors: We acknowledge that the abstract summarizes the teacher-student results at a high level. The body of the manuscript already contains a dedicated experimental section that specifies the teacher training procedure (full-dataset supervised training), the precise application of our refined probabilistic and smooth-ℓ0 methods to the student, the image and text datasets with their sizes, and the measurement of description length as the sum of model complexity and data negative log-likelihood. In the revision we will (i) expand the abstract to include the number of independent runs, error bars (standard deviation), and a concise statement of the protocol, and (ii) add a short subsection that explicitly lists ablation controls, dataset cardinalities, and the exact description-length formula used. These changes will make the experimental support for the Solomonoff prediction fully reproducible from the abstract onward. revision: yes

Circularity Check

1 steps flagged

Sparsity proxy for description length tuned to observed performance, weakening independent verification of Solomonoff prediction

specific steps

fitted input called prediction [Abstract]
"Since exact MDL optimization is intractable, we cast it as ℓ0 regularized learning and explain why parameter sparsity provides an effective computable approximation of model description length. ... Finally, we use these methods in a controlled teacher-student setting to empirically verify the prediction of Solomonoff induction that compressed models learn more sample-efficiently and generalize better."

The paper presents sparsity via ℓ0 regularization as the practical MDL proxy, then uses the resulting models to 'verify' the Solomonoff prediction of better generalization. Because the regularization parameters and pruning routines are refined and selected to match the observed compression-plus-accuracy results on the same teacher-generated data, the efficiency gains are statistically tied to the fitting procedure rather than to an a-priori lower description length.

full rationale

The paper's central empirical verification in the teacher-student setting relies on l0-regularized sparse models as a computable stand-in for MDL, but the regularization strength and pruning thresholds are selected to achieve the reported compression and accuracy outcomes. This creates a partial dependence where the observed sample-efficiency gains are explained by the same fitting process used to construct the 'compressed' models, rather than by an independent demonstration that the final models have strictly lower total description length (including index encoding and value precision) than dense baselines. No explicit codelength comparison is provided, so the Solomonoff link remains interpretive rather than quantitatively closed.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central approach rests on the MDL principle as a domain assumption and introduces tunable regularization parameters whose values are selected to achieve desired sparsity and performance.

free parameters (1)

l0 regularization strength
Controls the trade-off between model sparsity and task performance in the cast optimization problem.

axioms (1)

domain assumption Solomonoff induction and the minimum description length principle predict that simpler models generalize better when data arises from low-complexity distributions.
Invoked in the abstract to motivate the entire compression-generalization link and the teacher-student experiment.

pith-pipeline@v0.9.0 · 5737 in / 1209 out tokens · 47442 ms · 2026-05-19T14:09:47.815575+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We conceptually motivate ℓ0 regularization from first principles of algorithmic complexity and Solomonoff induction... empirically verify the prediction of Solomonoff induction that regularized models can exhibit more sample-efficient convergence.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 14 internal anchors

[1]

Probabilistic and nonlinear compressive sensing

Anonymous. Probabilistic and nonlinear compressive sensing. To appear soon, 2025

work page 2025
[2]

Ayinde, Tamer Inanc, and Jacek M

Babajide O. Ayinde, Tamer Inanc, and Jacek M. Zurada. Redundant feature pruning for accelerated inference in deep neural networks. Neural Networks, 118:148–158, 2019

work page 2019
[3]

Iterative thresholding for sparse approximations

Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications, 14:629–654, 2008

work page 2008
[4]

Sequential learning of neural networks for prequential mdl

Jorg Bornschein, Yazhe Li, and Marcus Hutter. Sequential learning of neural networks for prequential mdl. arXiv preprint arXiv:2210.07931, 2022

work page arXiv 2022
[5]

Distributed opti- mization and statistical learning via the alternating direction method of multipliers

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed opti- mization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011

work page 2011
[6]

Feature selection via concave minimization and support vector machines

Paul S Bradley and Olvi L Mangasarian. Feature selection via concave minimization and support vector machines. In ICML, volume 98, pages 82–90, 1998

work page 1998
[7]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1– 113, 2023. 11

work page 2023
[8]

On the compression of neural networks using l0-norm regularization and weight pruning

Felipe Dennis de Resende Oliveira, Eduardo Luiz Ortiz Batista, and Rui Seara. On the compression of neural networks using l0-norm regularization and weight pruning. Neural Networks, 171:343–352, 2024

work page 2024
[9]

Deepseek-v3 technical report, 2024

DeepSeek-AI. Deepseek-v3 technical report, 2024

work page 2024
[10]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023

work page 2023
[11]

Language Modeling Is Compression

Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christo- pher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Global sparse momentum sgd for pruning very deep neural networks

Xiaohan Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, Ji Liu, et al. Global sparse momentum sgd for pruning very deep neural networks. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[13]

Learning to prune deep neural networks via layer-wise optimal brain surgeon

Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in neural information processing systems, 30, 2017

work page 2017
[14]

Disarm: An antithetic gradient estimator for binary latent variables

Zhe Dong, Andriy Mnih, and George Tucker. Disarm: An antithetic gradient estimator for binary latent variables. Advances in neural information processing systems, 33:18637–18647, 2020

work page 2020
[15]

Hamprecht

Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A. Hamprecht. Essentially no barriers in neural network energy landscape, 2019

work page 2019
[16]

Edelsbrunner, D

H. Edelsbrunner, D. Kirkpatrick, and R. Seidel. On the shape of a set of points in the plane. IEEE Transactions on Information Theory, 29(4):551–559, 1983

work page 1983
[17]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[18]

and contributors

Joaquim F. and contributors. Gmt.jl: Julia wrapper for the generic mapping tools. online, 2014–2025. Accessed: 2025-05-02

work page 2014
[19]

A Mathematical Introduction to Compressive Sensing

Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing . Applied and Numerical Harmonic Analysis. Springer New York, New York, NY , 2013

work page 2013
[20]

The minimum description length principle

Peter D Grünwald. The minimum description length principle. MIT press, 2007

work page 2007
[21]

The l0 norm constraint lms algorithm for sparse system identification

Yuantao Gu, Jian Jin, and Shunliang Mei. The l0 norm constraint lms algorithm for sparse system identification. IEEE Signal Processing Letters, 16(9):774–777, 2009

work page 2009
[22]

Wiki-40b: Multilingual language model dataset

Mandy Guo, Zihang Dai, Denny Vrande ˇci´c, and Rami Al-Rfou. Wiki-40b: Multilingual language model dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2440–2452, 2020

work page 2020
[23]

Dynamic network surgery for efficient dnns

Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. Advances in neural information processing systems, 29, 2016

work page 2016
[24]

Learning both weights and connections for efficient neural network

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015

work page 2015
[25]

Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso

Trevor Hastie, Robert Tibshirani, and Ryan J Tibshirani. Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Data-driven sparse structure selection for deep neural networks

Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 304–320, 2018

work page 2018
[28]

Universal artificial intelligence: Sequential decisions based on algorithmic probability

Marcus Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2005. 12

work page 2005
[29]

Exploring the effect of l0 / l2 regularization in neural network pruning using the lc toolkit

Yerlan Idelbayev and Miguel Á Carreira-Perpiñán. Exploring the effect of l0 / l2 regularization in neural network pruning using the lc toolkit. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3373–3377. IEEE, 2022

work page 2022
[30]

Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Ser- gio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Pete...

work page 2024
[31]

Auto-Encoding Variational Bayes

Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[32]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

On tables of random numbers.Sankhy¯a: The Indian Journal of Statistics, Series A, pages 369–376, 1963

Andrei N Kolmogorov. On tables of random numbers.Sankhy¯a: The Indian Journal of Statistics, Series A, pages 369–376, 1963

work page 1963
[34]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. N.A., 2009. 13

work page 2009
[35]

Gradient estimation for binary latent variables via gradient variance clipping

Russell Z Kunes, Mingzhang Yin, Max Land, Doron Haviv, Dana Pe’er, and Simon Tavaré. Gradient estimation for binary latent variables via gradient variance clipping. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37-7, pages 8405–8412, 2023

work page 2023
[36]

Gradient-based learning applied to document recognition

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998

work page 1998
[37]

Optimal brain damage

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989

work page 1989
[38]

SNIP: Single-shot Network Pruning based on Connection Sensitivity

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Pruning Filters for Efficient ConvNets

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[40]

L0-arm: Network sparsification via stochastic binary optimization

Yang Li and Shihao Ji. L0-arm: Network sparsification via stochastic binary optimization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 432–448. Springer, 2019

work page 2019
[41]

Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024

work page 2024
[42]

Bayesian compression for deep learning

Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. Advances in neural information processing systems, 30, 2017

work page 2017
[43]

Learning Sparse Neural Networks through $L_0$ Regularization

Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through L0 regularization. arXiv preprint arXiv:1712.01312, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Machine learning via polyhedral concave minimization

Olvi L Mangasarian. Machine learning via polyhedral concave minimization. In Applied Mathematics and Parallel Computing: Festschrift for Klaus Ritter, pages 175–188. Springer, 1996

work page 1996
[45]

Relaxed lasso

Nicolai Meinshausen. Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374–393, 2007

work page 2007
[46]

Variational dropout sparsifies deep neural networks

Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International conference on machine learning, pages 2498–2507. PMLR, 2017

work page 2017
[47]

Lux: Explicit parameterization of deep neural networks in julia, April 2023

Avik Pal. Lux: Explicit parameterization of deep neural networks in julia, April 2023. If you use this software, please cite it as below

work page 2023
[48]

On efficient training & inference of neural differential equations, 2023

Avik Pal. On efficient training & inference of neural differential equations, 2023

work page 2023
[49]

Source coding algorithms for fast data compression

Richard Clark Pasco. Source coding algorithms for fast data compression. PhD thesis, Stanford University CA, 1976

work page 1976
[50]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfo...

work page internal anchor Pith review Pith/arXiv arXiv 1912
[51]

Y .C. Pati, R. Rezaiifar, and P.S. Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, pages 40–44 vol.1, 1993

work page 1993
[52]

Pruning deep neural networks with l_0-constrained optimization

Dzung T Phan, Lam M Nguyen, Nam H Nguyen, and Jayant R Kalagnanam. Pruning deep neural networks with l_0-constrained optimization. In 2020 IEEE International Conference on Data Mining (ICDM), pages 1214–1219. IEEE, 2020

work page 2020
[53]

Convergence of discrete mdl for sequential prediction

Jan Poland and Marcus Hutter. Convergence of discrete mdl for sequential prediction. In Learning Theory: 17th Annual Conference on Learning Theory, COLT 2004, Banff, Canada, July 1-4, 2004. Proceedings 17, pages 300–314. Springer, 2004

work page 2004
[54]

Universal Differential Equations for Scientific Machine Learning

Christopher Rackauckas, Yingbo Ma, Julius Martensen, Collin Warner, Kirill Zubov, Rohit Supekar, Dominic Skinner, and Ali Ramadhan. Universal differential equations for scientific machine learning. arXiv preprint arXiv:2001.04385, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[55]

Compression for AGI | Stanford MLSys #76 | Talk on Youtube, 2023

Jack Rae. Compression for AGI | Stanford MLSys #76 | Talk on Youtube, 2023. 14

work page 2023
[56]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

J. J. Rissanen. Generalized kraft inequality and arithmetic coding. IBM Journal of Research and Development, 20(3):198–203, 1976

work page 1976
[58]

Modeling by shortest data description

Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978

work page 1978
[59]

Learning equations for extrapolation and control

Subham Sahoo, Christoph Lampert, and Georg Martius. Learning equations for extrapolation and control. In International Conference on Machine Learning, pages 4442–4450. Pmlr, 2018

work page 2018
[60]

Formal theory of creativity, fun, and intrinsic motivation (1990–2010)

Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE transactions on autonomous mental development, 2(3):230–247, 2010

work page 1990
[61]

A mathematical theory of communication

Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948

work page 1948
[62]

Play and Prune: Adaptive Filter Pruning for Deep Model Compression

Pravendra Singh, Vinay Kumar Verma, Piyush Rai, and Vinay P Namboodiri. Play and prune: Adaptive filter pruning for deep model compression. arXiv preprint arXiv:1905.04446, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[63]

Complexity-based induction systems: comparisons and convergence theorems

Ray Solomonoff. Complexity-based induction systems: comparisons and convergence theorems. IEEE transactions on Information Theory, 24(4):422–432, 1978

work page 1978
[64]

A formal theory of inductive inference

Ray J Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1):1–22, 1964

work page 1964
[65]

Regression shrinkage and selection via the lasso

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996

work page 1996
[66]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Soft Weight-Sharing for Neural Network Compression

Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[68]

Autoprune: Automatic network pruning by regularizing auxiliary parameters

Xia Xiao, Zigeng Wang, and Sanguthevar Rajasekaran. Autoprune: Automatic network pruning by regularizing auxiliary parameters. Advances in neural information processing systems, 32, 2019

work page 2019
[69]

Probabilistic best subset selection via gradient-based optimization

Mingzhang Yin, Nhat Ho, Bowei Yan, Xiaoning Qian, and Mingyuan Zhou. Probabilistic best subset selection via gradient-based optimization. arXiv preprint arXiv:2006.06448, 2020

work page arXiv 2006
[70]

Model selection and estimation in regression with grouped variables

Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006

work page 2006
[71]

Learning to share: Simultaneous parameter tying and sparsification in deep learning

Dejiao Zhang, Haozhu Wang, Mario Figueiredo, and Laura Balzano. Learning to share: Simultaneous parameter tying and sparsification in deep learning. In International Conference on Learning Representations, 2018

work page 2018
[72]

A systematic dnn weight pruning framework using alternating direction method of multipliers

Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European conference on computer vision (ECCV) , pages 184–199, 2018

work page 2018
[73]

density description length

Chenglong Zhao, Bingbing Ni, Jian Zhang, Qiwei Zhao, Wenjun Zhang, and Qi Tian. Variational convolutional neural network pruning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2780–2789, 2019. 15 Appendix A Equivalence of minimum description length and Kolmogorov complexity Here, we briefly show the following L...

work page 2019
[74]

Suppose we have a model fθ with parameter θ, that we want to optimize to fit some data D and that we want to distribute our computation with N batches of data Di, i ∈ {1, · · · , N} on N compute nodes, that run in parallel. We can do this by copying the model z := θ to N other machines, where we define xi := copy(θ) and where we also copy the data and mod...

work page
[75]

So we can use Proposition 2.2 to reformulate the zk+1-update as follows

The big advantage is now that the L0(θ) term occurs only within an optimization step, that does not involve fθ(D). So we can use Proposition 2.2 to reformulate the zk+1-update as follows. First, we introduce two new vector-valued variables w and Z and write z = wZ, where wZ is componentwise multiplication, and w ∈ Rn while Z ∈ { 0, 1}n. Then we 21 apply t...

work page
[76]

We can attempt to solve for w by computing ∂L(w, π) ∂wℓ = − X i X j ρ (Mixk+1 i − c + uk i )j − X q Pjq wqπq Pjℓπℓ+ + N ρ X p P 2 pℓwℓπℓ(1 − πℓ) = −ρ X i,j (Mixk+1 i − c + uk i )jPjℓπℓ + N ρ X j,q πℓP T ℓj Pjq πqwq + N ρ X j,q P T ℓj Pjℓπℓ(1 − πℓ)δℓqwq = −ρπℓ X i,j P T ℓj(Mixk+1 i − c + uk i )j+ + N ρπℓ X j,q P T ℓj Pjq πq + (1 − πq)δℓq wq (31) and settin...

work page
[77]

In general, this can be non-trivial to solve but for the important special case that P = id, or Pjq = δjq, we obtain 0 = −πℓ X i (Mixk+1 i − c + uk i )ℓ = N πℓ ˆwℓ. (32) This leads to either an arbitrary value for ˆwℓ, if πℓ = 0, or otherwise to ˆw = 1 N X i (Mixk+1 i − c + uk i ) =: 1 N X i rk i (33) This is an average, that does not depend on π anymore

work page
[78]

This 22 must be a vertex of the cube [0, 1]n

Inserting this back into L( ˆw, π), for the case Ppj = δpj, we obtain L( ˆw, π) = α X j πj + X i ρ 2 ||rk i − ( ˆwπ)||2 2 + N ρ 2 X j ˆw2 j πj(1 − πj) (34) Since this is linear in πℓ for all ℓ ∈ {1, · · · , n}, the only thing we need to do in order to find a minimum, is to check the sign of the gradient with respect to πℓ, and move πℓ to the most extreme ...

work page
[79]

Strong gradients near zero may trap parameters in local minima

work page
[80]

Parameters with large magnitudes but low utility may receive weak gradients and be ineffi- ciently optimized

work page

Showing first 80 references.

[1] [1]

Probabilistic and nonlinear compressive sensing

Anonymous. Probabilistic and nonlinear compressive sensing. To appear soon, 2025

work page 2025

[2] [2]

Ayinde, Tamer Inanc, and Jacek M

Babajide O. Ayinde, Tamer Inanc, and Jacek M. Zurada. Redundant feature pruning for accelerated inference in deep neural networks. Neural Networks, 118:148–158, 2019

work page 2019

[3] [3]

Iterative thresholding for sparse approximations

Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications, 14:629–654, 2008

work page 2008

[4] [4]

Sequential learning of neural networks for prequential mdl

Jorg Bornschein, Yazhe Li, and Marcus Hutter. Sequential learning of neural networks for prequential mdl. arXiv preprint arXiv:2210.07931, 2022

work page arXiv 2022

[5] [5]

Distributed opti- mization and statistical learning via the alternating direction method of multipliers

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed opti- mization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011

work page 2011

[6] [6]

Feature selection via concave minimization and support vector machines

Paul S Bradley and Olvi L Mangasarian. Feature selection via concave minimization and support vector machines. In ICML, volume 98, pages 82–90, 1998

work page 1998

[7] [7]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1– 113, 2023. 11

work page 2023

[8] [8]

On the compression of neural networks using l0-norm regularization and weight pruning

Felipe Dennis de Resende Oliveira, Eduardo Luiz Ortiz Batista, and Rui Seara. On the compression of neural networks using l0-norm regularization and weight pruning. Neural Networks, 171:343–352, 2024

work page 2024

[9] [9]

Deepseek-v3 technical report, 2024

DeepSeek-AI. Deepseek-v3 technical report, 2024

work page 2024

[10] [10]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023

work page 2023

[11] [11]

Language Modeling Is Compression

Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christo- pher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Global sparse momentum sgd for pruning very deep neural networks

Xiaohan Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, Ji Liu, et al. Global sparse momentum sgd for pruning very deep neural networks. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[13] [13]

Learning to prune deep neural networks via layer-wise optimal brain surgeon

Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in neural information processing systems, 30, 2017

work page 2017

[14] [14]

Disarm: An antithetic gradient estimator for binary latent variables

Zhe Dong, Andriy Mnih, and George Tucker. Disarm: An antithetic gradient estimator for binary latent variables. Advances in neural information processing systems, 33:18637–18647, 2020

work page 2020

[15] [15]

Hamprecht

Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A. Hamprecht. Essentially no barriers in neural network energy landscape, 2019

work page 2019

[16] [16]

Edelsbrunner, D

H. Edelsbrunner, D. Kirkpatrick, and R. Seidel. On the shape of a set of points in the plane. IEEE Transactions on Information Theory, 29(4):551–559, 1983

work page 1983

[17] [17]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[18] [18]

and contributors

Joaquim F. and contributors. Gmt.jl: Julia wrapper for the generic mapping tools. online, 2014–2025. Accessed: 2025-05-02

work page 2014

[19] [19]

A Mathematical Introduction to Compressive Sensing

Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing . Applied and Numerical Harmonic Analysis. Springer New York, New York, NY , 2013

work page 2013

[20] [20]

The minimum description length principle

Peter D Grünwald. The minimum description length principle. MIT press, 2007

work page 2007

[21] [21]

The l0 norm constraint lms algorithm for sparse system identification

Yuantao Gu, Jian Jin, and Shunliang Mei. The l0 norm constraint lms algorithm for sparse system identification. IEEE Signal Processing Letters, 16(9):774–777, 2009

work page 2009

[22] [22]

Wiki-40b: Multilingual language model dataset

Mandy Guo, Zihang Dai, Denny Vrande ˇci´c, and Rami Al-Rfou. Wiki-40b: Multilingual language model dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2440–2452, 2020

work page 2020

[23] [23]

Dynamic network surgery for efficient dnns

Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. Advances in neural information processing systems, 29, 2016

work page 2016

[24] [24]

Learning both weights and connections for efficient neural network

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015

work page 2015

[25] [25]

Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso

Trevor Hastie, Robert Tibshirani, and Ryan J Tibshirani. Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Data-driven sparse structure selection for deep neural networks

Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 304–320, 2018

work page 2018

[28] [28]

Universal artificial intelligence: Sequential decisions based on algorithmic probability

Marcus Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2005. 12

work page 2005

[29] [29]

Exploring the effect of l0 / l2 regularization in neural network pruning using the lc toolkit

Yerlan Idelbayev and Miguel Á Carreira-Perpiñán. Exploring the effect of l0 / l2 regularization in neural network pruning using the lc toolkit. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3373–3377. IEEE, 2022

work page 2022

[30] [30]

Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Ser- gio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Pete...

work page 2024

[31] [31]

Auto-Encoding Variational Bayes

Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[32] [32]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[33] [33]

On tables of random numbers.Sankhy¯a: The Indian Journal of Statistics, Series A, pages 369–376, 1963

Andrei N Kolmogorov. On tables of random numbers.Sankhy¯a: The Indian Journal of Statistics, Series A, pages 369–376, 1963

work page 1963

[34] [34]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. N.A., 2009. 13

work page 2009

[35] [35]

Gradient estimation for binary latent variables via gradient variance clipping

Russell Z Kunes, Mingzhang Yin, Max Land, Doron Haviv, Dana Pe’er, and Simon Tavaré. Gradient estimation for binary latent variables via gradient variance clipping. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37-7, pages 8405–8412, 2023

work page 2023

[36] [36]

Gradient-based learning applied to document recognition

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998

work page 1998

[37] [37]

Optimal brain damage

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989

work page 1989

[38] [38]

SNIP: Single-shot Network Pruning based on Connection Sensitivity

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

Pruning Filters for Efficient ConvNets

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[40] [40]

L0-arm: Network sparsification via stochastic binary optimization

Yang Li and Shihao Ji. L0-arm: Network sparsification via stochastic binary optimization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 432–448. Springer, 2019

work page 2019

[41] [41]

Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024

work page 2024

[42] [42]

Bayesian compression for deep learning

Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. Advances in neural information processing systems, 30, 2017

work page 2017

[43] [43]

Learning Sparse Neural Networks through $L_0$ Regularization

Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through L0 regularization. arXiv preprint arXiv:1712.01312, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [44]

Machine learning via polyhedral concave minimization

Olvi L Mangasarian. Machine learning via polyhedral concave minimization. In Applied Mathematics and Parallel Computing: Festschrift for Klaus Ritter, pages 175–188. Springer, 1996

work page 1996

[45] [45]

Relaxed lasso

Nicolai Meinshausen. Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374–393, 2007

work page 2007

[46] [46]

Variational dropout sparsifies deep neural networks

Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International conference on machine learning, pages 2498–2507. PMLR, 2017

work page 2017

[47] [47]

Lux: Explicit parameterization of deep neural networks in julia, April 2023

Avik Pal. Lux: Explicit parameterization of deep neural networks in julia, April 2023. If you use this software, please cite it as below

work page 2023

[48] [48]

On efficient training & inference of neural differential equations, 2023

Avik Pal. On efficient training & inference of neural differential equations, 2023

work page 2023

[49] [49]

Source coding algorithms for fast data compression

Richard Clark Pasco. Source coding algorithms for fast data compression. PhD thesis, Stanford University CA, 1976

work page 1976

[50] [50]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfo...

work page internal anchor Pith review Pith/arXiv arXiv 1912

[51] [51]

Y .C. Pati, R. Rezaiifar, and P.S. Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, pages 40–44 vol.1, 1993

work page 1993

[52] [52]

Pruning deep neural networks with l_0-constrained optimization

Dzung T Phan, Lam M Nguyen, Nam H Nguyen, and Jayant R Kalagnanam. Pruning deep neural networks with l_0-constrained optimization. In 2020 IEEE International Conference on Data Mining (ICDM), pages 1214–1219. IEEE, 2020

work page 2020

[53] [53]

Convergence of discrete mdl for sequential prediction

Jan Poland and Marcus Hutter. Convergence of discrete mdl for sequential prediction. In Learning Theory: 17th Annual Conference on Learning Theory, COLT 2004, Banff, Canada, July 1-4, 2004. Proceedings 17, pages 300–314. Springer, 2004

work page 2004

[54] [54]

Universal Differential Equations for Scientific Machine Learning

Christopher Rackauckas, Yingbo Ma, Julius Martensen, Collin Warner, Kirill Zubov, Rohit Supekar, Dominic Skinner, and Ali Ramadhan. Universal differential equations for scientific machine learning. arXiv preprint arXiv:2001.04385, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[55] [55]

Compression for AGI | Stanford MLSys #76 | Talk on Youtube, 2023

Jack Rae. Compression for AGI | Stanford MLSys #76 | Talk on Youtube, 2023. 14

work page 2023

[56] [56]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[57] [57]

J. J. Rissanen. Generalized kraft inequality and arithmetic coding. IBM Journal of Research and Development, 20(3):198–203, 1976

work page 1976

[58] [58]

Modeling by shortest data description

Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978

work page 1978

[59] [59]

Learning equations for extrapolation and control

Subham Sahoo, Christoph Lampert, and Georg Martius. Learning equations for extrapolation and control. In International Conference on Machine Learning, pages 4442–4450. Pmlr, 2018

work page 2018

[60] [60]

Formal theory of creativity, fun, and intrinsic motivation (1990–2010)

Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE transactions on autonomous mental development, 2(3):230–247, 2010

work page 1990

[61] [61]

A mathematical theory of communication

Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948

work page 1948

[62] [62]

Play and Prune: Adaptive Filter Pruning for Deep Model Compression

Pravendra Singh, Vinay Kumar Verma, Piyush Rai, and Vinay P Namboodiri. Play and prune: Adaptive filter pruning for deep model compression. arXiv preprint arXiv:1905.04446, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[63] [63]

Complexity-based induction systems: comparisons and convergence theorems

Ray Solomonoff. Complexity-based induction systems: comparisons and convergence theorems. IEEE transactions on Information Theory, 24(4):422–432, 1978

work page 1978

[64] [64]

A formal theory of inductive inference

Ray J Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1):1–22, 1964

work page 1964

[65] [65]

Regression shrinkage and selection via the lasso

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996

work page 1996

[66] [66]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Soft Weight-Sharing for Neural Network Compression

Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[68] [68]

Autoprune: Automatic network pruning by regularizing auxiliary parameters

Xia Xiao, Zigeng Wang, and Sanguthevar Rajasekaran. Autoprune: Automatic network pruning by regularizing auxiliary parameters. Advances in neural information processing systems, 32, 2019

work page 2019

[69] [69]

Probabilistic best subset selection via gradient-based optimization

Mingzhang Yin, Nhat Ho, Bowei Yan, Xiaoning Qian, and Mingyuan Zhou. Probabilistic best subset selection via gradient-based optimization. arXiv preprint arXiv:2006.06448, 2020

work page arXiv 2006

[70] [70]

Model selection and estimation in regression with grouped variables

Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006

work page 2006

[71] [71]

Learning to share: Simultaneous parameter tying and sparsification in deep learning

Dejiao Zhang, Haozhu Wang, Mario Figueiredo, and Laura Balzano. Learning to share: Simultaneous parameter tying and sparsification in deep learning. In International Conference on Learning Representations, 2018

work page 2018

[72] [72]

A systematic dnn weight pruning framework using alternating direction method of multipliers

Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European conference on computer vision (ECCV) , pages 184–199, 2018

work page 2018

[73] [73]

density description length

Chenglong Zhao, Bingbing Ni, Jian Zhang, Qiwei Zhao, Wenjun Zhang, and Qi Tian. Variational convolutional neural network pruning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2780–2789, 2019. 15 Appendix A Equivalence of minimum description length and Kolmogorov complexity Here, we briefly show the following L...

work page 2019

[74] [74]

Suppose we have a model fθ with parameter θ, that we want to optimize to fit some data D and that we want to distribute our computation with N batches of data Di, i ∈ {1, · · · , N} on N compute nodes, that run in parallel. We can do this by copying the model z := θ to N other machines, where we define xi := copy(θ) and where we also copy the data and mod...

work page

[75] [75]

So we can use Proposition 2.2 to reformulate the zk+1-update as follows

The big advantage is now that the L0(θ) term occurs only within an optimization step, that does not involve fθ(D). So we can use Proposition 2.2 to reformulate the zk+1-update as follows. First, we introduce two new vector-valued variables w and Z and write z = wZ, where wZ is componentwise multiplication, and w ∈ Rn while Z ∈ { 0, 1}n. Then we 21 apply t...

work page

[76] [76]

We can attempt to solve for w by computing ∂L(w, π) ∂wℓ = − X i X j ρ (Mixk+1 i − c + uk i )j − X q Pjq wqπq Pjℓπℓ+ + N ρ X p P 2 pℓwℓπℓ(1 − πℓ) = −ρ X i,j (Mixk+1 i − c + uk i )jPjℓπℓ + N ρ X j,q πℓP T ℓj Pjq πqwq + N ρ X j,q P T ℓj Pjℓπℓ(1 − πℓ)δℓqwq = −ρπℓ X i,j P T ℓj(Mixk+1 i − c + uk i )j+ + N ρπℓ X j,q P T ℓj Pjq πq + (1 − πq)δℓq wq (31) and settin...

work page

[77] [77]

In general, this can be non-trivial to solve but for the important special case that P = id, or Pjq = δjq, we obtain 0 = −πℓ X i (Mixk+1 i − c + uk i )ℓ = N πℓ ˆwℓ. (32) This leads to either an arbitrary value for ˆwℓ, if πℓ = 0, or otherwise to ˆw = 1 N X i (Mixk+1 i − c + uk i ) =: 1 N X i rk i (33) This is an average, that does not depend on π anymore

work page

[78] [78]

This 22 must be a vertex of the cube [0, 1]n

Inserting this back into L( ˆw, π), for the case Ppj = δpj, we obtain L( ˆw, π) = α X j πj + X i ρ 2 ||rk i − ( ˆwπ)||2 2 + N ρ 2 X j ˆw2 j πj(1 − πj) (34) Since this is linear in πℓ for all ℓ ∈ {1, · · · , n}, the only thing we need to do in order to find a minimum, is to check the sign of the gradient with respect to πℓ, and move πℓ to the most extreme ...

work page

[79] [79]

Strong gradients near zero may trap parameters in local minima

work page

[80] [80]

Parameters with large magnitudes but low utility may receive weak gradients and be ineffi- ciently optimized

work page