Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis

Jonathan Wensh{\o}j; Mahmoud Afifi; Pedram Bakhtiarifard; Raghavendra Selvan; Sophia N. Wilson

arxiv: 2605.15551 · v1 · pith:UAUIGELUnew · submitted 2026-05-15 · 💻 cs.LG

Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis

Pedram Bakhtiarifard , Sophia N. Wilson , Mahmoud Afifi , Jonathan Wensh{\o}j , Raghavendra Selvan This is my paper

Pith reviewed 2026-05-20 20:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords algorithmic complexityKolmogorov complexitydeep neural networksmodel compressiongrokkingoverfittingquantizationlearning dynamics

0 comments

The pith

Training reduces the algorithmic complexity of deep neural network weights in a measurable way.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents QuBD as a way to estimate Kolmogorov-Chaitin-Solomonoff complexity for the real-valued weights inside modern neural networks. It first turns the weights into a discrete alphabet and then adds up Coding Theorem Method estimates computed separately on each bit-plane. This produces the first practical numbers showing that complexity falls while a model is learning, grows again when overfitting begins, tracks the late generalization phase called grokking, and mostly lives in the highest-order bits. A reader cares because the numbers give a concrete handle on the old idea that learning is a form of compression and point to simple diagnostics for when to quantize a finished model.

Core claim

QuBD quantizes network weights to a finite alphabet and aggregates per-bit-plane CTM estimates to produce a strictly tighter upper bound on true KCS complexity than any binarization method. When applied to trained networks it shows that the estimated complexity decreases as training proceeds, scales directly with data budget, rises again during overfitting, follows the same delayed-generalization pattern seen in grokking, and correlates with held-out accuracy, while the bulk of the algorithmic information sits in the most significant bit-planes.

What carries the argument

QuBD, the Quantized Block Decomposition method: quantize weights to a finite alphabet then sum CTM estimates across bit-planes to approximate KCS complexity for non-binary objects.

If this is right

Complexity of the weights falls steadily while the network is still improving on unseen data.
Larger training sets produce correspondingly larger complexity reductions.
Once overfitting starts, measured complexity begins to climb again.
The same complexity curve reproduces the delayed jump in test accuracy known as grokking.
Lower complexity values line up with better generalization performance.
Nearly all the algorithmic content is carried by the highest-significance bit-planes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training runs could be monitored with QuBD to trigger early stopping before overfitting raises complexity.
The bit-plane concentration result suggests a simple rule for choosing post-training quantization depth.
The same estimator might be applied to other high-dimensional objects such as learned embeddings or activation maps.
Refining the quantization grid could shrink the remaining gap to true KCS complexity without losing the scaling advantages.

Load-bearing premise

Quantizing continuous weights to a finite alphabet keeps the structural features that determine their true Kolmogorov complexity.

What would settle it

A controlled run in which a network is trained to clear generalization and QuBD complexity is observed to stay flat or rise instead of falling.

Figures

Figures reproduced from arXiv: 2605.15551 by Jonathan Wensh{\o}j, Mahmoud Afifi, Pedram Bakhtiarifard, Raghavendra Selvan, Sophia N. Wilson.

**Figure 2.** Figure 2: QuBD first quantizes weights into finite codes (top), then decomposes them into aligned bit-planes, to aggregate the per-plane KCS estimates (bottom). Our proposed KCS complexity estimation method, QuBD, shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Complexity gap between estimations of an ordered 8-ary object and its permutation, where ρ is the fraction of symbols d permuted. The gap is shown (left) for an object with d = 1M symbols as ρ increases, and (right) across object sizes for different methods when ρ = 1. Since the permutation preserves all symbol counts and alters only their positions, any difference in complexity estimates must arise from… view at source ↗

**Figure 4.** Figure 4: Complexity of trained models compared to random initialization when using QuBD, BDM, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Learning curves along with ∆CQuBD evolution for MLP (left) and Tiny ViT (center). Algorithmic Complexity during (right) Grokking for the Modulo operator for P = 97. compression using GZIP. As clearly seen in these plots, neither measure correlates meaningfully with learning, in contrast to QuBD. The results shown here are when using 8-bit planes for QuBD. Additional experimental details are provided in App… view at source ↗

**Figure 6.** Figure 6: QuBD complexity ratio ∆CQuBD per layer for ResNet-18 for the two MSB-planes (P7, P6). Algorithmic Information Resides in Higher Bit-Planes. QuBD complexity offers tighter estimates of KCS complexity of non-binary objects compared to BDM, due to Thm. 3.1 and Thm. 3.2. Adding bit-planes captures additional algorithmic information, but saturates beyond a few. This is illustrated in [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 7.** Figure 7: QuBD complexity ∆CQuBD per bit-plane (left) and top-1 PTQ accuracy on ImageNet (right) for five pretrained models under 1-to-8-bit quantization. Dashed lines indicate FP32 baseline accuracy. QuBD Complexity as a Diagnostic for Compression. To test this hypothesis, we measure the QuBD complexity of five full precision (FP32) models: ResNet-18, ResNet-50, ViT, EfficientNet, and MobileNet, trained on Image… view at source ↗

**Figure 8.** Figure 8: Relative gap (±1 std.) to the finite CTM-support reference over 100 trials (fixed symbol relabelings). The aligned QuBD control decreases as more MSB bit-planes are retained, validating the residual refinement predicted by Thm 3.2. Under symbol relabeling, QuBD remains closer to the reference than serialization, one-bit quantization, and sign binarization. Fix a block size π and a finite CTM table with sup… view at source ↗

**Figure 9.** Figure 9: Empirical saturation rates as the number of symbols [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Complexity gap with (top) π = (3 × 3) and (bottom) π = (2 × 2), between estimations of an ordered 8-ary object and its permutation, where ρ is the fraction of symbols d permuted. The gap is shown (left) for an object with d = 1M symbols as ρ increases, and (right) across object sizes for different methods when ρ = 1. QuBD keeps a larger and more stable estimation gap than serialization, one-bit, and sign-… view at source ↗

**Figure 11.** Figure 11: QuBD complexity ratio (pretrained / random) for 100 pretrained models from the [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: shows the normalized QuBD complexity per layer for ResNet-50 for the two MSB-planes (P7, P6). It shows significant differences in reduction across different layers spanning close to ∆CQuBD = 0% and up to ∆CQuBD = 100%. Experimental details can be found in Appx. E.2. 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 Layer Index 0 20 40 60 80 100 ∆ CQuBD (%) ResNet-50 Bit-plane P7 P6 [PITH_FULL_IMAGE:figu… view at source ↗

read the original abstract

Training large-scale deep neural networks (DNNs) is resource-intensive, making model compression a practical necessity. The widely accepted ''learning as compression'' hypothesis posits that training induces structure in network weights, which enables compression. Measuring this structure through Kolmogorov-Chaitin-Solomonoff (KCS) complexity is appealing, but existing estimators based on the Coding Theorem Method (CTM) and the Block Decomposition Method (BDM) are limited to small binary objects and do not scale to modern DNNs. We introduce the Quantized Block Decomposition method (QuBD), which extends algorithmic complexity estimation to any $k$-ary object. QuBD first quantizes the network weights to a finite alphabet, then estimates the KCS complexity by aggregating per bit-plane CTM estimates. We show theoretically that QuBD yields a strictly tighter estimation gap with respect to true KCS complexity than binarization-based methods. Using QuBD, we study how the algorithmic complexity of neural network weights evolves during training, showing that it decreases as models learn, scales with data budget, increases during overfitting, follows the delayed generalization observed during grokking, and correlates with generalization performance. We further show that algorithmic information resides predominantly in the most significant bit-planes, which can serve as a practical diagnostic for determining appropriate post-training quantization levels. This work offers novel insights into learning mechanisms in DNNs by providing the first scalable, tractable estimates of KCS complexity for large, non-binary objects such as DNN weights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QuBD scales CTM/BDM to quantized DNN weights and tracks complexity drops during learning plus rises in overfitting and grokking, but the quantization step may be shaping the measured changes more than the underlying KCS structure.

read the letter

The paper's core move is to quantize real-valued network weights to a k-ary alphabet, run CTM on the resulting bit planes, and aggregate to get a complexity estimate that they say has a strictly smaller gap to true KCS than plain binarization. They then plot this estimate across training and report that it falls as accuracy rises, grows with overfitting, tracks the delayed jump in grokking, scales with dataset size, and correlates with final generalization. Most of the estimated complexity sits in the highest bit planes, which they suggest could guide post-training quantization choices.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Quantized Block Decomposition (QuBD) method, which extends CTM/BDM-based KCS complexity estimation to non-binary objects by first quantizing DNN weights to a finite k-ary alphabet and then aggregating per-bit-plane CTM estimates. It claims a strictly tighter upper bound on the estimation gap relative to true KCS complexity than binarization baselines. Empirically, QuBD is used to show that weight complexity decreases during training, scales with data budget, rises during overfitting, tracks the delayed generalization phase in grokking, and correlates with generalization error; additionally, most algorithmic information resides in the most significant bit-planes, with implications for post-training quantization.

Significance. If the central claims hold, the work supplies the first scalable, non-binary KCS estimator applicable to modern DNNs and supplies concrete empirical support for the learning-as-compression hypothesis, including links to grokking and generalization. The theoretical gap improvement and the bit-plane dominance observation are potentially useful for both theory and practice; however, the absence of full proofs and detailed experimental protocols in the current version limits immediate impact.

major comments (2)

[§3] §3 (theoretical gap derivation): The claim of a strictly tighter estimation gap for QuBD versus binarization is derived by aggregating CTM over k-ary bit-planes after quantization. This bound holds only under the assumption that the chosen quantization mapping (implicitly uniform or magnitude-based) does not systematically alter the shortest program length for the weight tensor. Because real-valued weights map many-to-one into bins, small perturbations inside a bin can change compressibility without changing network behavior; the manuscript does not provide a formal argument that the gap improvement survives such perturbations.
[Empirical sections on grokking and overfitting] Empirical sections on grokking and overfitting (e.g., the experiments tracking complexity during delayed generalization): The reported increases in complexity during overfitting and the alignment with grokking lack explicit controls for the quantization alphabet size k and threshold selection. If these hyperparameters are chosen after observing the complexity curves, the claimed correlations with generalization and data budget could be artifacts of the discretization rather than reflections of true KCS changes.

minor comments (2)

[Abstract and Introduction] The abstract and introduction would benefit from an explicit statement of the quantization rule (uniform, magnitude-based, or learned) and the precise value of k used in the main experiments.
[Figures on bit-plane analysis] Figure captions for the bit-plane dominance plots should include the exact bit depth and the definition of 'most significant bit-planes' to allow direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging where clarifications or additions are warranted. We have outlined specific revisions to strengthen the theoretical discussion and empirical robustness.

read point-by-point responses

Referee: [§3] §3 (theoretical gap derivation): The claim of a strictly tighter estimation gap for QuBD versus binarization is derived by aggregating CTM over k-ary bit-planes after quantization. This bound holds only under the assumption that the chosen quantization mapping (implicitly uniform or magnitude-based) does not systematically alter the shortest program length for the weight tensor. Because real-valued weights map many-to-one into bins, small perturbations inside a bin can change compressibility without changing network behavior; the manuscript does not provide a formal argument that the gap improvement survives such perturbations.

Authors: We thank the referee for this precise observation on the scope of the bound in §3. The theoretical result establishes a strictly tighter upper bound on the gap between the QuBD estimate and the true KCS complexity of the quantized tensor, relative to binarization, because the k-ary representation retains more structural information per bit-plane. We agree that the manuscript does not supply an explicit formal argument showing that this gap improvement is invariant to arbitrary intra-bin perturbations of the original real-valued weights. In the revision we will add a clarifying paragraph in §3 stating that the bound is derived for the fixed quantized object (the relevant quantity for tracking learning dynamics) and that the quantization mapping is chosen to preserve functional equivalence of the network. We will also note that CTM-based estimates exhibit stability under small local changes within the same alphabet, supporting that the observed trends are not artifacts of the discretization step. This addresses the concern directly without overstating the result. revision: partial
Referee: [Empirical sections on grokking and overfitting] Empirical sections on grokking and overfitting (e.g., the experiments tracking complexity during delayed generalization): The reported increases in complexity during overfitting and the alignment with grokking lack explicit controls for the quantization alphabet size k and threshold selection. If these hyperparameters are chosen after observing the complexity curves, the claimed correlations with generalization and data budget could be artifacts of the discretization rather than reflections of true KCS changes.

Authors: We concur that explicit sensitivity controls for k and quantization thresholds are necessary to rule out discretization artifacts. In the original work, k=4 was selected after preliminary runs showed that complexity estimates stabilized beyond this value while computational cost remained tractable; thresholds followed a uniform scheme derived from the empirical weight range. To directly address the referee’s concern, we have performed additional experiments sweeping k from 2 to 8 and comparing uniform versus magnitude-based thresholds. The reported trends—monotonic complexity decrease during training, rise during overfitting, and alignment with the delayed-generalization phase of grokking—remain qualitatively unchanged across these settings, and the correlation with generalization error is preserved. We will add a dedicated robustness subsection (or appendix) describing the hyperparameter selection protocol and presenting the sensitivity results. This revision eliminates any ambiguity regarding post-hoc tuning and reinforces that the empirical findings reflect genuine algorithmic-complexity dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines QuBD explicitly as quantization of real-valued weights to a k-ary alphabet followed by per-bit-plane CTM aggregation, then proves a strictly tighter gap bound relative to binarization from that construction. The central empirical claims (complexity decrease during learning, scaling with data, increase on overfitting, tracking of grokking, correlation with generalization, and MSB dominance) are direct applications of this estimator to trained networks and are compared against external binarization baselines rather than being fitted parameters or self-defined quantities. No load-bearing self-citations, imported uniqueness theorems, or ansatzes from prior author work appear in the provided text; the method is presented as an extension of independent CTM/BDM literature and remains self-contained against those benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard properties of the Coding Theorem Method for small objects and introduces quantization level choice as a modeling decision whose effect on the final complexity estimate is not independently validated in the abstract.

free parameters (1)

quantization alphabet size or bit depth
Finite alphabet chosen for weight quantization directly determines the bit-planes analyzed and therefore the numerical complexity value reported.

axioms (1)

standard math CTM provides a usable approximation to Kolmogorov complexity for small discrete objects
QuBD aggregates per-bit-plane CTM estimates; this inherits the approximation guarantees and limitations of CTM.

pith-pipeline@v0.9.0 · 5821 in / 1372 out tokens · 55760 ms · 2026-05-20T20:19:00.501063+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 21 internal anchors

[1]

Jacob, Benoit and Kligys, Skirmantas and Chen, Bo and Zhu, Menglong and Tang, Matthew and Howard, Andrew and Adam, Hartwig and Kalenichenko, Dmitry , booktitle =

work page
[2]

Hochreiter, Sepp and Schmidhuber, J\"

work page
[3]

2008 , publisher=

Li, Ming and Vit. 2008 , publisher=

work page 2008
[4]

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others , journal=

work page
[5]

Algorithmic Simplification of Neural Networks with Mosaic-of-Motifs

Pedram Bakhtiarifard and Tong Chen and Jonathan Wenshøj and Erik B Dam and Raghavendra Selvan , year=. 2602.14896 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Seldin, Yevgeny , journal=

work page
[7]

URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116

Belkin, Mikhail and Hsu, Daniel and Ma, Siyuan and Mandal, Soumik , year=. Proceedings of the National Academy of Sciences , publisher=. doi:10.1073/pnas.1903070116 , number=

work page doi:10.1073/pnas.1903070116
[8]

Chaitin, Gregory , year=

work page
[9]

2017 , url =

Neyshabur, Behnam and Bhojanapalli, Srinadh and McAllester, David and Srebro, Nathan , booktitle =. 2017 , url =

work page 2017
[10]

sztal/pybdm: v0.1.0

Talaga, Szymon and Tsampourakis, Kostas. sztal/pybdm: v0.1.0

work page
[11]

Cover, Thomas M , year=

work page
[12]

1966 , publisher=

Chaitin, Gregory J , journal=. 1966 , publisher=

work page 1966
[13]

2012 , publisher=

Delahaye, Jean-Paul and Zenil, Hector , journal=. 2012 , publisher=

work page 2012
[14]

2003 , publisher=

McMillan, Brockway , journal=. 2003 , publisher=

work page 2003
[15]

Kraft, Leon Gordon , year=

work page
[16]

1964 , publisher=

Solomonoff, Ray J , journal=. 1964 , publisher=

work page 1964
[17]

1969 , publisher=

Chaitin, Gregory J , journal=. 1969 , publisher=

work page 1969
[18]

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Sachin Mehta and Mohammad Rastegari , year=. 2110.02178 , archivePrefix=

work page internal anchor Pith review arXiv
[19]

Stronger generalization bounds for deep nets via a compression approach

Sanjeev Arora and Rong Ge and Behnam Neyshabur and Yi Zhang , year=. 1802.05296 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

2024 , url=

Micah Goldblum and Marc Anton Finzi and Keefer Rowan and Andrew Gordon Wilson , booktitle =. 2024 , url=

work page 2024
[21]

Two-Dimensional Kolmogorov Complexity and Validation of the Coding Theorem Method by Compressibility

Hector Zenil and Fernando Soler-Toscano and Jean-Paul Delahaye and Nicolas Gauvrit , year=. 1212.6745 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

2025 , organization=

Li, Guoyu and Ye, Shengyu and Chen, Chunyun and Wang, Yang and Yang, Fan and Cao, Ting and Liu, Cheng and Aly, Mohamed M Sabry and Yang, Mao , booktitle =. 2025 , organization=

work page 2025
[23]

Gong, Yunchao and Liu, Liu and Yang, Ming and Bourdev, Lubomir , journal=

work page
[24]

GitHub repository , doi =

Ross Wightman , title =. GitHub repository , doi =. 2019 , publisher =

work page 2019
[25]

, title =

Chaitin, Gregory J. , title =. 1975 , issue_date =. doi:10.1145/321892.321894 , journal =

work page doi:10.1145/321892.321894 1975
[26]

Shannon, C. E. , journal=. 1948 , volume=

work page 1948
[27]

Science , author=

R. Milo and S. Shen-Orr and S. Itzkovitz and N. Kashtan and D. Chklovskii and U. Alon , title =. Science , volume =. 2002 , doi =. https://www.science.org/doi/pdf/10.1126/science.298.5594.824 , abstract =

work page doi:10.1126/science.298.5594.824 2002
[28]

Science , author=

Benson, Austin R. and Gleich, David F. and Leskovec, Jure , year=. Science , publisher=. doi:10.1126/science.aad9029 , number=

work page doi:10.1126/science.aad9029
[29]

and Hinton, Geoffrey E

Nowlan, Steven J. and Hinton, Geoffrey E. , title =. Neural Computation , volume =. 1992 , month =. doi:10.1162/neco.1992.4.4.473 , url =

work page doi:10.1162/neco.1992.4.4.473 1992
[30]

Deep $k$-Means: Re-Training and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions

Junru Wu and Yue Wang and Zhenyu Wu and Zhangyang Wang and Ashok Veeraraghavan and Yingyan Lin , year=. 1806.09228 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Taylor and David J

Michael Burrows and D J Wheeler D I G I T A L and Robert W. Taylor and David J. Wheeler and David Wheeler , year=

work page
[32]

1996 , note =

Julian Seward , title =. 1996 , note =

work page 1996
[33]

, journal=

Huffman, David A. , journal=. 1952 , volume=

work page 1952
[34]

1992 , url =

Gailly, Jean-Loup and Adler, Mark , title =. 1992 , url =

work page 1992
[35]

and Ziv, J

Lempel, A. and Ziv, J. , journal=. 1976 , volume=

work page 1976
[36]

Rissanen , keywords =

J. Rissanen , keywords =. Automatica , volume =. 1978 , issn =. doi:https://doi.org/10.1016/0005-1098(78)90005-5 , url =

work page doi:10.1016/0005-1098(78)90005-5 1978
[37]

A tutorial introduction to the minimum description length principle

Peter Grunwald , year=. math/0406077 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

2014 , month =

PLOS ONE , publisher =. 2014 , month =. doi:10.1371/journal.pone.0096223 , author =

work page doi:10.1371/journal.pone.0096223 2014
[39]

Xu, Aolin and Raginsky, Maxim , journal=

work page
[40]

Emergence of Invariance and Disentanglement in Deep Representations

Alessandro Achille and Stefano Soatto , year=. 1706.01350 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

2019 , publisher=

Chaudhari, Pratik and Choromanska, Anna and Soatto, Stefano and LeCun, Yann and Baldassi, Carlo and Borgs, Christian and Chayes, Jennifer and Sagun, Levent and Zecchina, Riccardo , journal=. 2019 , publisher=

work page 2019
[42]

2017 , publisher =

Dinh, Laurent and Pascanu, Razvan and Bengio, Samy and Bengio, Yoshua , booktitle =. 2017 , publisher =

work page 2017
[43]

Xiao, Han and Rasul, Kashif and Vollgraf, Roland , journal=

work page
[44]

Krizhevsky, Alex and Hinton, Geoffrey and others , year=

work page
[45]

Jia Deng and Wei Dong and Richard Socher and Li-Jia Li and Kai Li and Li Fei-Fei , title =

work page
[46]

1912.02178 , archivePrefix=

Yiding Jiang and Behnam Neyshabur and Hossein Mobahi and Dilip Krishnan and Samy Bengio , year=. 1912.02178 , archivePrefix=

work page arXiv 1912
[47]

McAllester , booktitle =

David A. McAllester , booktitle =. 1999 , url=

work page 1999
[48]

Communications of the ACM , doi =

Zhang, Chiyuan and Bengio, Samy and Hardt, Moritz and Recht, Benjamin and Vinyals, Oriol , year =. Communications of the ACM , doi =

work page
[49]

Bartlett, Peter L and Mendelson, Shahar , journal=

work page
[50]

Vapnik, V. N. , title =. Trans. Neur. Netw. , month = sep, pages =. 1999 , issue_date =. doi:10.1109/72.788640 , abstract =

work page doi:10.1109/72.788640 1999
[51]

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Mingxing Tan and Quoc V. Le , year=. 1905.11946 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 1905
[52]

Smith and Oren Etzioni , year=

Roy Schwartz and Jesse Dodge and Noah A. Smith and Oren Etzioni , year=. 1907.10597 , archivePrefix=

work page arXiv 1907
[53]

Carbon Emissions and Large Neural Network Training

David Patterson and Joseph Gonzalez and Quoc Le and Chen Liang and Lluis-Miquel Munguia and Daniel Rothchild and David So and Maud Texier and Jeff Dean , year=. 2104.10350 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Efficient Neural Architecture Search via Parameter Sharing

Hieu Pham and Melody Y. Guan and Barret Zoph and Quoc V. Le and Jeff Dean , year=. 1802.03268 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

A White Paper on Neural Network Quantization

Markus Nagel and Marios Fournarakis and Rana Ali Amjad and Yelysei Bondarenko and Mart van Baalen and Tijmen Blankevoort , year=. 2106.08295 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Shwartz-Ziv, Ravid and Tishby, Naftali , journal=

work page
[57]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han and Huizi Mao and William J. Dally , year=. 1510.00149 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

2019 , url =

Frankle, Jonathan and Carbin, Michael , booktitle =. 2019 , url =

work page 2019
[59]

Kingma, Diederik P and Ba, Jimmy , booktitle =

work page
[60]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , year=. 2103.00020 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Vinyals, Oriol and Babuschkin, Igor and Czarnecki, Wojciech M and Mathieu, Micha \"e l and Dudzik, Andrew and Chung, Junyoung and Choi, David H and Powell, Richard and Ewalds, Timo and Georgiev, Petko and Oh, Junhyuk and Horgan, Dan and Kroiss, Manuel and Danihelka, Ivo and Huang, Aja and Sifre, Laurent and Cai, Trevor and Agapiou, John P and Jaderberg, M...

work page
[62]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , year=. 2010.11929 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[63]

Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , year=

Jonathan S. Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , year=. 1909.12673 , archivePrefix=

work page arXiv 1909
[64]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness and Sharan Narang and Newsha Ardalani and Gregory Diamos and Heewoo Jun and Hassan Kianinejad and Md. Mostofa Ali Patwary and Yang Yang and Yanqi Zhou , year=. 1712.00409 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Scaling Laws for Neural Language Models

Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , year=. 2001.08361 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[66]

2409.17066 , archivePrefix=

Yifei Liu and Jicheng Wen and Yang Wang and Shengyu Ye and Li Lyna Zhang and Ting Cao and Cheng Li and Mao Yang , year=. 2409.17066 , archivePrefix=

work page arXiv
[67]

Mohammad Sadegh Norouzzadeh and Shahbaz Rezaei , year=

work page
[68]

arXiv preprint arXiv:2505.20646v3 , year=

Sakabe, Eduardo Y and Abrah. arXiv preprint arXiv:2505.20646v3 , year=

work page arXiv
[69]

Shaw, Peter and Cohan, James and Eisenstein, Jacob and Toutanova, Kristina , journal=

work page
[70]

Wilson, Andrew Gordon , journal=

work page
[71]

1977 , publisher=

Chaitin, Gregory J , journal=. 1977 , publisher=

work page 1977
[72]

2025 , organization =

work page 2025
[73]

and Toetzke, Malte and Kontoleon, Andreas and Díaz Anadón, Laura and Minx, Jan C

Probst, Benedict S. and Toetzke, Malte and Kontoleon, Andreas and Díaz Anadón, Laura and Minx, Jan C. and Haya, Barbara K. and Schneider, Lambert and Trotter, Philipp A. and West, Thales A. P. and Gill-Wiehl, Annelise and Hoffmann, Volker H. , year =. Nature Communications , publisher =. doi:10.1038/s41467-024-53645-z , number =

work page doi:10.1038/s41467-024-53645-z
[74]

2023 , publisher =

Ilyas Moutawwakil and Régis Pierrard , title =. 2023 , publisher =

work page 2023
[75]

Nature , doi =

Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and Žídek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon and Ballard, Andrew and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, R...

work page
[76]

Attention Is All You Need

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , year=. 1706.03762 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon and Santosh Divvala and Ross Girshick and Ali Farhadi , year=. 1506.02640 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

2022 , publisher=

S. 2022 , publisher=

work page 2022
[79]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert

work page
[80]

Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj

work page

Showing first 80 references.

[1] [1]

Jacob, Benoit and Kligys, Skirmantas and Chen, Bo and Zhu, Menglong and Tang, Matthew and Howard, Andrew and Adam, Hartwig and Kalenichenko, Dmitry , booktitle =

work page

[2] [2]

Hochreiter, Sepp and Schmidhuber, J\"

work page

[3] [3]

2008 , publisher=

Li, Ming and Vit. 2008 , publisher=

work page 2008

[4] [4]

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others , journal=

work page

[5] [5]

Algorithmic Simplification of Neural Networks with Mosaic-of-Motifs

Pedram Bakhtiarifard and Tong Chen and Jonathan Wenshøj and Erik B Dam and Raghavendra Selvan , year=. 2602.14896 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Seldin, Yevgeny , journal=

work page

[7] [7]

URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116

Belkin, Mikhail and Hsu, Daniel and Ma, Siyuan and Mandal, Soumik , year=. Proceedings of the National Academy of Sciences , publisher=. doi:10.1073/pnas.1903070116 , number=

work page doi:10.1073/pnas.1903070116

[8] [8]

Chaitin, Gregory , year=

work page

[9] [9]

2017 , url =

Neyshabur, Behnam and Bhojanapalli, Srinadh and McAllester, David and Srebro, Nathan , booktitle =. 2017 , url =

work page 2017

[10] [10]

sztal/pybdm: v0.1.0

Talaga, Szymon and Tsampourakis, Kostas. sztal/pybdm: v0.1.0

work page

[11] [11]

Cover, Thomas M , year=

work page

[12] [12]

1966 , publisher=

Chaitin, Gregory J , journal=. 1966 , publisher=

work page 1966

[13] [13]

2012 , publisher=

Delahaye, Jean-Paul and Zenil, Hector , journal=. 2012 , publisher=

work page 2012

[14] [14]

2003 , publisher=

McMillan, Brockway , journal=. 2003 , publisher=

work page 2003

[15] [15]

Kraft, Leon Gordon , year=

work page

[16] [16]

1964 , publisher=

Solomonoff, Ray J , journal=. 1964 , publisher=

work page 1964

[17] [17]

1969 , publisher=

Chaitin, Gregory J , journal=. 1969 , publisher=

work page 1969

[18] [18]

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Sachin Mehta and Mohammad Rastegari , year=. 2110.02178 , archivePrefix=

work page internal anchor Pith review arXiv

[19] [19]

Stronger generalization bounds for deep nets via a compression approach

Sanjeev Arora and Rong Ge and Behnam Neyshabur and Yi Zhang , year=. 1802.05296 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

2024 , url=

Micah Goldblum and Marc Anton Finzi and Keefer Rowan and Andrew Gordon Wilson , booktitle =. 2024 , url=

work page 2024

[21] [21]

Two-Dimensional Kolmogorov Complexity and Validation of the Coding Theorem Method by Compressibility

Hector Zenil and Fernando Soler-Toscano and Jean-Paul Delahaye and Nicolas Gauvrit , year=. 1212.6745 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

2025 , organization=

Li, Guoyu and Ye, Shengyu and Chen, Chunyun and Wang, Yang and Yang, Fan and Cao, Ting and Liu, Cheng and Aly, Mohamed M Sabry and Yang, Mao , booktitle =. 2025 , organization=

work page 2025

[23] [23]

Gong, Yunchao and Liu, Liu and Yang, Ming and Bourdev, Lubomir , journal=

work page

[24] [24]

GitHub repository , doi =

Ross Wightman , title =. GitHub repository , doi =. 2019 , publisher =

work page 2019

[25] [25]

, title =

Chaitin, Gregory J. , title =. 1975 , issue_date =. doi:10.1145/321892.321894 , journal =

work page doi:10.1145/321892.321894 1975

[26] [26]

Shannon, C. E. , journal=. 1948 , volume=

work page 1948

[27] [27]

Science , author=

R. Milo and S. Shen-Orr and S. Itzkovitz and N. Kashtan and D. Chklovskii and U. Alon , title =. Science , volume =. 2002 , doi =. https://www.science.org/doi/pdf/10.1126/science.298.5594.824 , abstract =

work page doi:10.1126/science.298.5594.824 2002

[28] [28]

Science , author=

Benson, Austin R. and Gleich, David F. and Leskovec, Jure , year=. Science , publisher=. doi:10.1126/science.aad9029 , number=

work page doi:10.1126/science.aad9029

[29] [29]

and Hinton, Geoffrey E

Nowlan, Steven J. and Hinton, Geoffrey E. , title =. Neural Computation , volume =. 1992 , month =. doi:10.1162/neco.1992.4.4.473 , url =

work page doi:10.1162/neco.1992.4.4.473 1992

[30] [30]

Deep $k$-Means: Re-Training and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions

Junru Wu and Yue Wang and Zhenyu Wu and Zhangyang Wang and Ashok Veeraraghavan and Yingyan Lin , year=. 1806.09228 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Taylor and David J

Michael Burrows and D J Wheeler D I G I T A L and Robert W. Taylor and David J. Wheeler and David Wheeler , year=

work page

[32] [32]

1996 , note =

Julian Seward , title =. 1996 , note =

work page 1996

[33] [33]

, journal=

Huffman, David A. , journal=. 1952 , volume=

work page 1952

[34] [34]

1992 , url =

Gailly, Jean-Loup and Adler, Mark , title =. 1992 , url =

work page 1992

[35] [35]

and Ziv, J

Lempel, A. and Ziv, J. , journal=. 1976 , volume=

work page 1976

[36] [36]

Rissanen , keywords =

J. Rissanen , keywords =. Automatica , volume =. 1978 , issn =. doi:https://doi.org/10.1016/0005-1098(78)90005-5 , url =

work page doi:10.1016/0005-1098(78)90005-5 1978

[37] [37]

A tutorial introduction to the minimum description length principle

Peter Grunwald , year=. math/0406077 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

2014 , month =

PLOS ONE , publisher =. 2014 , month =. doi:10.1371/journal.pone.0096223 , author =

work page doi:10.1371/journal.pone.0096223 2014

[39] [39]

Xu, Aolin and Raginsky, Maxim , journal=

work page

[40] [40]

Emergence of Invariance and Disentanglement in Deep Representations

Alessandro Achille and Stefano Soatto , year=. 1706.01350 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

2019 , publisher=

Chaudhari, Pratik and Choromanska, Anna and Soatto, Stefano and LeCun, Yann and Baldassi, Carlo and Borgs, Christian and Chayes, Jennifer and Sagun, Levent and Zecchina, Riccardo , journal=. 2019 , publisher=

work page 2019

[42] [42]

2017 , publisher =

Dinh, Laurent and Pascanu, Razvan and Bengio, Samy and Bengio, Yoshua , booktitle =. 2017 , publisher =

work page 2017

[43] [43]

Xiao, Han and Rasul, Kashif and Vollgraf, Roland , journal=

work page

[44] [44]

Krizhevsky, Alex and Hinton, Geoffrey and others , year=

work page

[45] [45]

Jia Deng and Wei Dong and Richard Socher and Li-Jia Li and Kai Li and Li Fei-Fei , title =

work page

[46] [46]

1912.02178 , archivePrefix=

Yiding Jiang and Behnam Neyshabur and Hossein Mobahi and Dilip Krishnan and Samy Bengio , year=. 1912.02178 , archivePrefix=

work page arXiv 1912

[47] [47]

McAllester , booktitle =

David A. McAllester , booktitle =. 1999 , url=

work page 1999

[48] [48]

Communications of the ACM , doi =

Zhang, Chiyuan and Bengio, Samy and Hardt, Moritz and Recht, Benjamin and Vinyals, Oriol , year =. Communications of the ACM , doi =

work page

[49] [49]

Bartlett, Peter L and Mendelson, Shahar , journal=

work page

[50] [50]

Vapnik, V. N. , title =. Trans. Neur. Netw. , month = sep, pages =. 1999 , issue_date =. doi:10.1109/72.788640 , abstract =

work page doi:10.1109/72.788640 1999

[51] [51]

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Mingxing Tan and Quoc V. Le , year=. 1905.11946 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 1905

[52] [52]

Smith and Oren Etzioni , year=

Roy Schwartz and Jesse Dodge and Noah A. Smith and Oren Etzioni , year=. 1907.10597 , archivePrefix=

work page arXiv 1907

[53] [53]

Carbon Emissions and Large Neural Network Training

David Patterson and Joseph Gonzalez and Quoc Le and Chen Liang and Lluis-Miquel Munguia and Daniel Rothchild and David So and Maud Texier and Jeff Dean , year=. 2104.10350 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Efficient Neural Architecture Search via Parameter Sharing

Hieu Pham and Melody Y. Guan and Barret Zoph and Quoc V. Le and Jeff Dean , year=. 1802.03268 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

A White Paper on Neural Network Quantization

Markus Nagel and Marios Fournarakis and Rana Ali Amjad and Yelysei Bondarenko and Mart van Baalen and Tijmen Blankevoort , year=. 2106.08295 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

Shwartz-Ziv, Ravid and Tishby, Naftali , journal=

work page

[57] [57]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han and Huizi Mao and William J. Dally , year=. 1510.00149 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

2019 , url =

Frankle, Jonathan and Carbin, Michael , booktitle =. 2019 , url =

work page 2019

[59] [59]

Kingma, Diederik P and Ba, Jimmy , booktitle =

work page

[60] [60]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , year=. 2103.00020 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Vinyals, Oriol and Babuschkin, Igor and Czarnecki, Wojciech M and Mathieu, Micha \"e l and Dudzik, Andrew and Chung, Junyoung and Choi, David H and Powell, Richard and Ewalds, Timo and Georgiev, Petko and Oh, Junhyuk and Horgan, Dan and Kroiss, Manuel and Danihelka, Ivo and Huang, Aja and Sifre, Laurent and Cai, Trevor and Agapiou, John P and Jaderberg, M...

work page

[62] [62]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , year=. 2010.11929 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2010

[63] [63]

Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , year=

Jonathan S. Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , year=. 1909.12673 , archivePrefix=

work page arXiv 1909

[64] [64]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness and Sharan Narang and Newsha Ardalani and Gregory Diamos and Heewoo Jun and Hassan Kianinejad and Md. Mostofa Ali Patwary and Yang Yang and Yanqi Zhou , year=. 1712.00409 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

Scaling Laws for Neural Language Models

Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , year=. 2001.08361 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[66] [66]

2409.17066 , archivePrefix=

Yifei Liu and Jicheng Wen and Yang Wang and Shengyu Ye and Li Lyna Zhang and Ting Cao and Cheng Li and Mao Yang , year=. 2409.17066 , archivePrefix=

work page arXiv

[67] [67]

Mohammad Sadegh Norouzzadeh and Shahbaz Rezaei , year=

work page

[68] [68]

arXiv preprint arXiv:2505.20646v3 , year=

Sakabe, Eduardo Y and Abrah. arXiv preprint arXiv:2505.20646v3 , year=

work page arXiv

[69] [69]

Shaw, Peter and Cohan, James and Eisenstein, Jacob and Toutanova, Kristina , journal=

work page

[70] [70]

Wilson, Andrew Gordon , journal=

work page

[71] [71]

1977 , publisher=

Chaitin, Gregory J , journal=. 1977 , publisher=

work page 1977

[72] [72]

2025 , organization =

work page 2025

[73] [73]

and Toetzke, Malte and Kontoleon, Andreas and Díaz Anadón, Laura and Minx, Jan C

Probst, Benedict S. and Toetzke, Malte and Kontoleon, Andreas and Díaz Anadón, Laura and Minx, Jan C. and Haya, Barbara K. and Schneider, Lambert and Trotter, Philipp A. and West, Thales A. P. and Gill-Wiehl, Annelise and Hoffmann, Volker H. , year =. Nature Communications , publisher =. doi:10.1038/s41467-024-53645-z , number =

work page doi:10.1038/s41467-024-53645-z

[74] [74]

2023 , publisher =

Ilyas Moutawwakil and Régis Pierrard , title =. 2023 , publisher =

work page 2023

[75] [75]

Nature , doi =

Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and Žídek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon and Ballard, Andrew and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, R...

work page

[76] [76]

Attention Is All You Need

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , year=. 1706.03762 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[77] [77]

You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon and Santosh Divvala and Ross Girshick and Ali Farhadi , year=. 1506.02640 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[78] [78]

2022 , publisher=

S. 2022 , publisher=

work page 2022

[79] [79]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert

work page

[80] [80]

Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj

work page