Efficient compression of neural networks and datasets
Pith reviewed 2026-05-19 14:09 UTC · model grok-4.3
pith:OTHB6R25 Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{OTHB6R25}
Prints a linked pith:OTHB6R25 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Refined sparse optimization approximates minimum description length to compress neural networks and datasets while improving generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By recasting minimum description length optimization as l0-regularized learning and refining complementary sparse methods, the authors obtain substantial model compression across convolutional networks and transformers on image and text datasets. The refined probabilistic pruning procedure eliminates Monte Carlo sampling while the binary-search routine for smooth l0 approximations lowers hyperparameter complexity. When these compressed models are placed in a teacher-student setting, they exhibit greater sample efficiency and stronger generalization than uncompressed baselines, confirming the Solomonoff induction prediction that shorter descriptions support better learning from limited data.
What carries the argument
l0-regularized learning used as a computable proxy for exact minimum description length, realized through sampling-free probabilistic pruning and binary-search refinement of smooth l0 approximations.
If this is right
- The refined methods deliver substantial parameter reduction on image and text tasks while keeping accuracy loss minimal.
- The resulting models produce shorter combined model-plus-data description lengths than earlier pruning techniques.
- In teacher-student setups the compressed students reach target accuracy with fewer training examples than dense students.
- Generalization improves when training uses the sparsity-based compression relative to standard dense training.
Where Pith is reading between the lines
- The same sparsity proxies might be tested on reinforcement-learning agents to see whether they improve generalization from sparse rewards.
- If the approximation holds, searching over even sparser architectures could produce models whose internal representations are easier to interpret.
- Extending the teacher-student protocol to data drawn from known high-complexity distributions would test the boundary of the reported benefit.
- The techniques could be paired with quantization or distillation to reach still smaller effective description lengths.
Load-bearing premise
Parameter sparsity provides a good enough stand-in for true model description length that l0-regularized training can replace exact minimum description length optimization.
What would settle it
A teacher-student experiment in which students trained with the refined compression methods show no improvement in sample efficiency or generalization compared with students trained without the compression step.
Figures
read the original abstract
Compression and generalization are fundamentally related through Solomonoff induction and the minimum description length principle (MDL), which predict that simpler models generalize better when data arises from low-complexity distributions. In this article, we combine insights from algorithmic information theory and techniques from neural network pruning to improve model generalization by identifying the most effective data compression method. Since exact MDL optimization is intractable, we cast it as $\ell_0$ regularized learning and explain why parameter sparsity provides an effective computable approximation of model description length. To identify the best practical approach, we systematically compare and refine complementary sparse optimization methods. In particular, we improve probabilistic pruning through a procedure that does not require Monte Carlo sampling and refine smooth $\ell_0$ approximations with a binary search routine that reduces hyperparameter complexity. Across convolutional networks and transformers evaluated on image and text datasets, our refined methods improve upon their predecessors, achieve substantial model compression with minimal accuracy loss, and yield short data description lengths. Finally, we use these methods in a controlled teacher-student setting to empirically verify the prediction of Solomonoff induction that compressed models learn more sample-efficiently and generalize better.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that compression and generalization are linked via Solomonoff induction and MDL, proposes ℓ0-regularized learning as a practical proxy for minimizing description length, refines probabilistic pruning (removing Monte Carlo sampling) and smooth-ℓ0 methods (via binary search for hyperparameters), demonstrates competitive compression on CNNs and transformers for image and text data, and reports a teacher-student experiment in which the resulting sparse models learn more sample-efficiently and generalize better than dense counterparts.
Significance. If the teacher-student protocol includes proper controls and the sparsity methods are shown to reduce total description length (not merely parameter count), the work would supply useful empirical evidence connecting algorithmic information theory to modern neural network practice. The algorithmic refinements to existing pruning techniques could be of interest to the compression community even if the MDL interpretation is set aside.
major comments (2)
- Abstract and teacher-student section: the central generalization claim requires that the ℓ0-regularized sparse models achieve lower total description length than dense models. Parameter count reduction alone does not establish this; a full description length must encode the support (indices) and the precision of retained values. No explicit codelength comparison (e.g., via arithmetic coding of indices plus quantized values) is reported on the teacher-generated data, so observed sample-efficiency gains could arise from generic regularization rather than MDL minimization.
- Abstract: the experimental support for improved sample efficiency is described only at a high level. No error bars, dataset sizes, number of runs, ablation controls, or precise teacher-student protocol (how the teacher is trained, how sparsity is applied during student training, how description length is measured) are supplied, making it impossible to assess whether the Solomonoff prediction has been tested or merely illustrated.
minor comments (2)
- The abstract states that the refined methods 'yield short data description lengths,' but no quantitative values or comparison tables for data codelengths are referenced.
- Notation for the refined probabilistic pruning procedure and the binary-search routine for smooth-ℓ0 should be introduced with explicit equations or pseudocode to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive suggestions. We agree that strengthening the link between our sparsity methods and total description length, as well as providing more experimental detail, will improve the manuscript. We respond to each major comment below and will incorporate the requested clarifications and analyses in the revised version.
read point-by-point responses
-
Referee: Abstract and teacher-student section: the central generalization claim requires that the ℓ0-regularized sparse models achieve lower total description length than dense models. Parameter count reduction alone does not establish this; a full description length must encode the support (indices) and the precision of retained values. No explicit codelength comparison (e.g., via arithmetic coding of indices plus quantized values) is reported on the teacher-generated data, so observed sample-efficiency gains could arise from generic regularization rather than MDL minimization.
Authors: We agree that a complete description length must include the cost of encoding the support (indices of retained parameters) and the precision of the non-zero values, not merely the count of non-zero parameters. In the current manuscript we treat the number of non-zero parameters as the primary computable proxy for model description length, consistent with standard practice in the pruning literature, while reporting data description length via negative log-likelihood. To more directly test the MDL prediction in the teacher-student setting, we will add an explicit total codelength comparison in the revised version. This will encode the binary support mask (via arithmetic coding or an equivalent scheme) together with quantized values of the retained weights and compare the resulting total length against dense baselines on the teacher-generated data. We expect this addition to help separate MDL-driven compression from generic regularization effects. revision: yes
-
Referee: Abstract: the experimental support for improved sample efficiency is described only at a high level. No error bars, dataset sizes, number of runs, ablation controls, or precise teacher-student protocol (how the teacher is trained, how sparsity is applied during student training, how description length is measured) are supplied, making it impossible to assess whether the Solomonoff prediction has been tested or merely illustrated.
Authors: We acknowledge that the abstract summarizes the teacher-student results at a high level. The body of the manuscript already contains a dedicated experimental section that specifies the teacher training procedure (full-dataset supervised training), the precise application of our refined probabilistic and smooth-ℓ0 methods to the student, the image and text datasets with their sizes, and the measurement of description length as the sum of model complexity and data negative log-likelihood. In the revision we will (i) expand the abstract to include the number of independent runs, error bars (standard deviation), and a concise statement of the protocol, and (ii) add a short subsection that explicitly lists ablation controls, dataset cardinalities, and the exact description-length formula used. These changes will make the experimental support for the Solomonoff prediction fully reproducible from the abstract onward. revision: yes
Circularity Check
Sparsity proxy for description length tuned to observed performance, weakening independent verification of Solomonoff prediction
specific steps
-
fitted input called prediction
[Abstract]
"Since exact MDL optimization is intractable, we cast it as ℓ0 regularized learning and explain why parameter sparsity provides an effective computable approximation of model description length. ... Finally, we use these methods in a controlled teacher-student setting to empirically verify the prediction of Solomonoff induction that compressed models learn more sample-efficiently and generalize better."
The paper presents sparsity via ℓ0 regularization as the practical MDL proxy, then uses the resulting models to 'verify' the Solomonoff prediction of better generalization. Because the regularization parameters and pruning routines are refined and selected to match the observed compression-plus-accuracy results on the same teacher-generated data, the efficiency gains are statistically tied to the fitting procedure rather than to an a-priori lower description length.
full rationale
The paper's central empirical verification in the teacher-student setting relies on l0-regularized sparse models as a computable stand-in for MDL, but the regularization strength and pruning thresholds are selected to achieve the reported compression and accuracy outcomes. This creates a partial dependence where the observed sample-efficiency gains are explained by the same fitting process used to construct the 'compressed' models, rather than by an independent demonstration that the final models have strictly lower total description length (including index encoding and value precision) than dense baselines. No explicit codelength comparison is provided, so the Solomonoff link remains interpretive rather than quantitatively closed.
Axiom & Free-Parameter Ledger
free parameters (1)
- l0 regularization strength
axioms (1)
- domain assumption Solomonoff induction and the minimum description length principle predict that simpler models generalize better when data arises from low-complexity distributions.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We conceptually motivate ℓ0 regularization from first principles of algorithmic complexity and Solomonoff induction... empirically verify the prediction of Solomonoff induction that regularized models can exhibit more sample-efficient convergence.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Probabilistic and nonlinear compressive sensing
Anonymous. Probabilistic and nonlinear compressive sensing. To appear soon, 2025
work page 2025
-
[2]
Ayinde, Tamer Inanc, and Jacek M
Babajide O. Ayinde, Tamer Inanc, and Jacek M. Zurada. Redundant feature pruning for accelerated inference in deep neural networks. Neural Networks, 118:148–158, 2019
work page 2019
-
[3]
Iterative thresholding for sparse approximations
Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications, 14:629–654, 2008
work page 2008
-
[4]
Sequential learning of neural networks for prequential mdl
Jorg Bornschein, Yazhe Li, and Marcus Hutter. Sequential learning of neural networks for prequential mdl. arXiv preprint arXiv:2210.07931, 2022
-
[5]
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed opti- mization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011
work page 2011
-
[6]
Feature selection via concave minimization and support vector machines
Paul S Bradley and Olvi L Mangasarian. Feature selection via concave minimization and support vector machines. In ICML, volume 98, pages 82–90, 1998
work page 1998
-
[7]
Palm: Scaling language modeling with pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1– 113, 2023. 11
work page 2023
-
[8]
On the compression of neural networks using l0-norm regularization and weight pruning
Felipe Dennis de Resende Oliveira, Eduardo Luiz Ortiz Batista, and Rui Seara. On the compression of neural networks using l0-norm regularization and weight pruning. Neural Networks, 171:343–352, 2024
work page 2024
- [9]
-
[10]
Scaling vision transformers to 22 billion parameters
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023
work page 2023
-
[11]
Language Modeling Is Compression
Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christo- pher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Global sparse momentum sgd for pruning very deep neural networks
Xiaohan Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, Ji Liu, et al. Global sparse momentum sgd for pruning very deep neural networks. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[13]
Learning to prune deep neural networks via layer-wise optimal brain surgeon
Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in neural information processing systems, 30, 2017
work page 2017
-
[14]
Disarm: An antithetic gradient estimator for binary latent variables
Zhe Dong, Andriy Mnih, and George Tucker. Disarm: An antithetic gradient estimator for binary latent variables. Advances in neural information processing systems, 33:18637–18647, 2020
work page 2020
- [15]
-
[16]
H. Edelsbrunner, D. Kirkpatrick, and R. Seidel. On the shape of a set of points in the plane. IEEE Transactions on Information Theory, 29(4):551–559, 1983
work page 1983
-
[17]
Scaling rectified flow transform- ers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[18]
Joaquim F. and contributors. Gmt.jl: Julia wrapper for the generic mapping tools. online, 2014–2025. Accessed: 2025-05-02
work page 2014
-
[19]
A Mathematical Introduction to Compressive Sensing
Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing . Applied and Numerical Harmonic Analysis. Springer New York, New York, NY , 2013
work page 2013
-
[20]
The minimum description length principle
Peter D Grünwald. The minimum description length principle. MIT press, 2007
work page 2007
-
[21]
The l0 norm constraint lms algorithm for sparse system identification
Yuantao Gu, Jian Jin, and Shunliang Mei. The l0 norm constraint lms algorithm for sparse system identification. IEEE Signal Processing Letters, 16(9):774–777, 2009
work page 2009
-
[22]
Wiki-40b: Multilingual language model dataset
Mandy Guo, Zihang Dai, Denny Vrande ˇci´c, and Rami Al-Rfou. Wiki-40b: Multilingual language model dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2440–2452, 2020
work page 2020
-
[23]
Dynamic network surgery for efficient dnns
Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. Advances in neural information processing systems, 29, 2016
work page 2016
-
[24]
Learning both weights and connections for efficient neural network
Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015
work page 2015
-
[25]
Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso
Trevor Hastie, Robert Tibshirani, and Ryan J Tibshirani. Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Data-driven sparse structure selection for deep neural networks
Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 304–320, 2018
work page 2018
-
[28]
Universal artificial intelligence: Sequential decisions based on algorithmic probability
Marcus Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2005. 12
work page 2005
-
[29]
Exploring the effect of l0 / l2 regularization in neural network pruning using the lc toolkit
Yerlan Idelbayev and Miguel Á Carreira-Perpiñán. Exploring the effect of l0 / l2 regularization in neural network pruning using the lc toolkit. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3373–3377. IEEE, 2022
work page 2022
-
[30]
Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Ser- gio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Pete...
work page 2024
-
[31]
Auto-Encoding Variational Bayes
Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[32]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[33]
Andrei N Kolmogorov. On tables of random numbers.Sankhy¯a: The Indian Journal of Statistics, Series A, pages 369–376, 1963
work page 1963
-
[34]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. N.A., 2009. 13
work page 2009
-
[35]
Gradient estimation for binary latent variables via gradient variance clipping
Russell Z Kunes, Mingzhang Yin, Max Land, Doron Haviv, Dana Pe’er, and Simon Tavaré. Gradient estimation for binary latent variables via gradient variance clipping. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37-7, pages 8405–8412, 2023
work page 2023
-
[36]
Gradient-based learning applied to document recognition
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998
work page 1998
-
[37]
Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989
work page 1989
-
[38]
SNIP: Single-shot Network Pruning based on Connection Sensitivity
Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[39]
Pruning Filters for Efficient ConvNets
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[40]
L0-arm: Network sparsification via stochastic binary optimization
Yang Li and Shihao Ji. L0-arm: Network sparsification via stochastic binary optimization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 432–448. Springer, 2019
work page 2019
-
[41]
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024
work page 2024
-
[42]
Bayesian compression for deep learning
Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. Advances in neural information processing systems, 30, 2017
work page 2017
-
[43]
Learning Sparse Neural Networks through $L_0$ Regularization
Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through L0 regularization. arXiv preprint arXiv:1712.01312, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
Machine learning via polyhedral concave minimization
Olvi L Mangasarian. Machine learning via polyhedral concave minimization. In Applied Mathematics and Parallel Computing: Festschrift for Klaus Ritter, pages 175–188. Springer, 1996
work page 1996
-
[45]
Nicolai Meinshausen. Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374–393, 2007
work page 2007
-
[46]
Variational dropout sparsifies deep neural networks
Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International conference on machine learning, pages 2498–2507. PMLR, 2017
work page 2017
-
[47]
Lux: Explicit parameterization of deep neural networks in julia, April 2023
Avik Pal. Lux: Explicit parameterization of deep neural networks in julia, April 2023. If you use this software, please cite it as below
work page 2023
-
[48]
On efficient training & inference of neural differential equations, 2023
Avik Pal. On efficient training & inference of neural differential equations, 2023
work page 2023
-
[49]
Source coding algorithms for fast data compression
Richard Clark Pasco. Source coding algorithms for fast data compression. PhD thesis, Stanford University CA, 1976
work page 1976
-
[50]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfo...
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[51]
Y .C. Pati, R. Rezaiifar, and P.S. Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, pages 40–44 vol.1, 1993
work page 1993
-
[52]
Pruning deep neural networks with l_0-constrained optimization
Dzung T Phan, Lam M Nguyen, Nam H Nguyen, and Jayant R Kalagnanam. Pruning deep neural networks with l_0-constrained optimization. In 2020 IEEE International Conference on Data Mining (ICDM), pages 1214–1219. IEEE, 2020
work page 2020
-
[53]
Convergence of discrete mdl for sequential prediction
Jan Poland and Marcus Hutter. Convergence of discrete mdl for sequential prediction. In Learning Theory: 17th Annual Conference on Learning Theory, COLT 2004, Banff, Canada, July 1-4, 2004. Proceedings 17, pages 300–314. Springer, 2004
work page 2004
-
[54]
Universal Differential Equations for Scientific Machine Learning
Christopher Rackauckas, Yingbo Ma, Julius Martensen, Collin Warner, Kirill Zubov, Rohit Supekar, Dominic Skinner, and Ali Ramadhan. Universal differential equations for scientific machine learning. arXiv preprint arXiv:2001.04385, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[55]
Compression for AGI | Stanford MLSys #76 | Talk on Youtube, 2023
Jack Rae. Compression for AGI | Stanford MLSys #76 | Talk on Youtube, 2023. 14
work page 2023
-
[56]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[57]
J. J. Rissanen. Generalized kraft inequality and arithmetic coding. IBM Journal of Research and Development, 20(3):198–203, 1976
work page 1976
-
[58]
Modeling by shortest data description
Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978
work page 1978
-
[59]
Learning equations for extrapolation and control
Subham Sahoo, Christoph Lampert, and Georg Martius. Learning equations for extrapolation and control. In International Conference on Machine Learning, pages 4442–4450. Pmlr, 2018
work page 2018
-
[60]
Formal theory of creativity, fun, and intrinsic motivation (1990–2010)
Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE transactions on autonomous mental development, 2(3):230–247, 2010
work page 1990
-
[61]
A mathematical theory of communication
Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948
work page 1948
-
[62]
Play and Prune: Adaptive Filter Pruning for Deep Model Compression
Pravendra Singh, Vinay Kumar Verma, Piyush Rai, and Vinay P Namboodiri. Play and prune: Adaptive filter pruning for deep model compression. arXiv preprint arXiv:1905.04446, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[63]
Complexity-based induction systems: comparisons and convergence theorems
Ray Solomonoff. Complexity-based induction systems: comparisons and convergence theorems. IEEE transactions on Information Theory, 24(4):422–432, 1978
work page 1978
-
[64]
A formal theory of inductive inference
Ray J Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1):1–22, 1964
work page 1964
-
[65]
Regression shrinkage and selection via the lasso
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996
work page 1996
-
[66]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Soft Weight-Sharing for Neural Network Compression
Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[68]
Autoprune: Automatic network pruning by regularizing auxiliary parameters
Xia Xiao, Zigeng Wang, and Sanguthevar Rajasekaran. Autoprune: Automatic network pruning by regularizing auxiliary parameters. Advances in neural information processing systems, 32, 2019
work page 2019
-
[69]
Probabilistic best subset selection via gradient-based optimization
Mingzhang Yin, Nhat Ho, Bowei Yan, Xiaoning Qian, and Mingyuan Zhou. Probabilistic best subset selection via gradient-based optimization. arXiv preprint arXiv:2006.06448, 2020
-
[70]
Model selection and estimation in regression with grouped variables
Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006
work page 2006
-
[71]
Learning to share: Simultaneous parameter tying and sparsification in deep learning
Dejiao Zhang, Haozhu Wang, Mario Figueiredo, and Laura Balzano. Learning to share: Simultaneous parameter tying and sparsification in deep learning. In International Conference on Learning Representations, 2018
work page 2018
-
[72]
A systematic dnn weight pruning framework using alternating direction method of multipliers
Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European conference on computer vision (ECCV) , pages 184–199, 2018
work page 2018
-
[73]
Chenglong Zhao, Bingbing Ni, Jian Zhang, Qiwei Zhao, Wenjun Zhang, and Qi Tian. Variational convolutional neural network pruning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2780–2789, 2019. 15 Appendix A Equivalence of minimum description length and Kolmogorov complexity Here, we briefly show the following L...
work page 2019
-
[74]
Suppose we have a model fθ with parameter θ, that we want to optimize to fit some data D and that we want to distribute our computation with N batches of data Di, i ∈ {1, · · · , N} on N compute nodes, that run in parallel. We can do this by copying the model z := θ to N other machines, where we define xi := copy(θ) and where we also copy the data and mod...
-
[75]
So we can use Proposition 2.2 to reformulate the zk+1-update as follows
The big advantage is now that the L0(θ) term occurs only within an optimization step, that does not involve fθ(D). So we can use Proposition 2.2 to reformulate the zk+1-update as follows. First, we introduce two new vector-valued variables w and Z and write z = wZ, where wZ is componentwise multiplication, and w ∈ Rn while Z ∈ { 0, 1}n. Then we 21 apply t...
-
[76]
We can attempt to solve for w by computing ∂L(w, π) ∂wℓ = − X i X j ρ (Mixk+1 i − c + uk i )j − X q Pjq wqπq Pjℓπℓ+ + N ρ X p P 2 pℓwℓπℓ(1 − πℓ) = −ρ X i,j (Mixk+1 i − c + uk i )jPjℓπℓ + N ρ X j,q πℓP T ℓj Pjq πqwq + N ρ X j,q P T ℓj Pjℓπℓ(1 − πℓ)δℓqwq = −ρπℓ X i,j P T ℓj(Mixk+1 i − c + uk i )j+ + N ρπℓ X j,q P T ℓj Pjq πq + (1 − πq)δℓq wq (31) and settin...
-
[77]
In general, this can be non-trivial to solve but for the important special case that P = id, or Pjq = δjq, we obtain 0 = −πℓ X i (Mixk+1 i − c + uk i )ℓ = N πℓ ˆwℓ. (32) This leads to either an arbitrary value for ˆwℓ, if πℓ = 0, or otherwise to ˆw = 1 N X i (Mixk+1 i − c + uk i ) =: 1 N X i rk i (33) This is an average, that does not depend on π anymore
-
[78]
This 22 must be a vertex of the cube [0, 1]n
Inserting this back into L( ˆw, π), for the case Ppj = δpj, we obtain L( ˆw, π) = α X j πj + X i ρ 2 ||rk i − ( ˆwπ)||2 2 + N ρ 2 X j ˆw2 j πj(1 − πj) (34) Since this is linear in πℓ for all ℓ ∈ {1, · · · , n}, the only thing we need to do in order to find a minimum, is to check the sign of the gradient with respect to πℓ, and move πℓ to the most extreme ...
-
[79]
Strong gradients near zero may trap parameters in local minima
-
[80]
Parameters with large magnitudes but low utility may receive weak gradients and be ineffi- ciently optimized
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.