Recognition: 2 theorem links
· Lean TheoremNeural Weight Norm = Kolmogorov Complexity
Pith reviewed 2026-05-12 04:35 UTC · model grok-4.3
The pith
The smallest weight norm of a looped neural network in fixed precision equals the Kolmogorov complexity of its output string up to a log factor.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser.
What carries the argument
Fixed-precision looped neural network whose minimal weight norm equals Kolmogorov complexity by encoding any universal Turing machine program at unit cost per bit and describing the network via enumeration of its nonzero parameters with logarithmic overhead.
If this is right
- Weight decay on looped neural networks induces a prior matching Solomonoff's universal prior up to polynomial factors.
- The bound holds for every choice of weight norm because all norms reduce to the count of nonzero parameters up to constants.
- The logarithmic overhead is tight and is realized when the nonzero parameters encode a permutation.
- The fixed-precision restriction is required; infinite precision allows non-computable encodings that break the connection to Kolmogorov complexity.
Where Pith is reading between the lines
- Regularization can be reinterpreted as an implicit search for the shortest program consistent with the observed data.
- The result offers one route to understanding why overparameterized networks generalize when regularized: they are driven toward low-complexity solutions.
- Empirical checks could compare minimal-norm networks against known low-complexity strings such as periodic or repetitive sequences.
Load-bearing premise
The weights must operate under a fixed-precision regime; without it, networks can encode non-computable functions and the weight norm ceases to track Kolmogorov complexity.
What would settle it
Exhibit a fixed-precision looped network and a binary string it outputs for which the smallest achievable weight norm differs from the Kolmogorov complexity of the string by more than a logarithmic factor.
read the original abstract
Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser. The proof has two short reductions: any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit, and any fixed-precision network can be described by enumerating its non-zero parameters with logarithmic addressing overhead. Both bounds are tight up to constants, with the logarithmic factor realised by permutation encodings: a network whose parameters encode a permutation produces a string whose Kolmogorov complexity is the non-zero parameter count times its logarithm. The fixed-precision assumption is essential: with infinite precision, neural networks can encode non-computable functions and the weight norm loses its relevance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser. The proof has two short reductions: any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit, and any fixed-precision network can be described by enumerating its non-zero parameters with logarithmic addressing overhead. Both bounds are tight up to constants, with the logarithmic factor realised by permutation encodings.
Significance. If the result holds, it would provide a theoretical foundation linking weight regularization in neural networks directly to Kolmogorov complexity and Solomonoff's universal prior. The explicit reductions, tightness via permutation encodings, and emphasis on the fixed-precision assumption as essential are notable strengths that could clarify inductive biases in deep learning. The norm-agnostic framing would make the result broadly applicable to different regularizers if substantiated.
major comments (1)
- Abstract: The claim that 'in fixed precision, every weight norm collapses to the non-zero parameter count up to constants' and therefore 'the same sandwich bound holds for any norm used as a regulariser' is not supported. The two reductions establish that the minimal number of non-zero parameters k satisfies k = Θ(K / log K) for Kolmogorov complexity K. However, even with fixed-precision weights bounded away from zero, norms scale differently: ||w||_1 = Θ(k) while ||w||_2 = Θ(√k). Thus the minimal L2 norm is Θ(√K), inducing a prior ~exp(−λ √K) rather than ~2^{-K}. The ratio between these priors is exponential in √K (super-polynomial), contradicting the claimed polynomial-factor equivalence to Solomonoff's prior. This is load-bearing for the central implication regarding weight decay.
minor comments (1)
- The looped neural network architecture (central to encoding arbitrary-length outputs) is only sketched in the abstract; a dedicated explanation of the model, including the looping mechanism and output generation, should be added to the main text for clarity.
Simulated Author's Rebuttal
We thank the referee for their thorough review and for identifying a key imprecision in our abstract. We agree with the major comment and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Abstract: The claim that 'in fixed precision, every weight norm collapses to the non-zero parameter count up to constants' and therefore 'the same sandwich bound holds for any norm used as a regulariser' is not supported. The two reductions establish that the minimal number of non-zero parameters k satisfies k = Θ(K / log K) for Kolmogorov complexity K. However, even with fixed-precision weights bounded away from zero, norms scale differently: ||w||_1 = Θ(k) while ||w||_2 = Θ(√k). Thus the minimal L2 norm is Θ(√K), inducing a prior ~exp(−λ √K) rather than ~2^{-K}. The ratio between these priors is exponential in √K (super-polynomial), contradicting the claimed polynomial-factor equivalence to Solomonoff's prior. This is load-bearing for the central implication regarding weight decay.
Authors: We thank the referee for highlighting this important point. We acknowledge that the statement in the abstract regarding the norm-agnostic nature is not accurate for all norms. The reductions indeed show that the minimal number of non-zero parameters k is Θ(K / log K). For the L1 norm, ||w||_1 = Θ(k) since weights are bounded away from zero in fixed precision, so the minimal L1 norm is Θ(K / log K). However, for the L2 norm, it is Θ(√k) = Θ(√(K / log K)), leading to a different scaling. We agree that this means the induced prior for L2 regularization does not match the universal prior up to a polynomial factor. We will revise the abstract and the relevant sections to remove the claim that the result is norm-agnostic and instead specify that it applies to the number of non-zero parameters and to norms equivalent to it up to constant factors (such as L1). We will also add a discussion on the implications for different regularization techniques, noting that for L2 the connection is weaker. This revision will be made in the next version of the manuscript. revision: yes
Circularity Check
No circularity: explicit bidirectional reductions between program length and weight norm
full rationale
The paper's derivation consists of two direct constructive reductions presented in the abstract: (1) encoding any UTM program into neural weights at unit cost per bit, and (2) describing any fixed-precision network by enumerating its non-zero parameters with logarithmic addressing overhead. These mappings establish sandwich bounds between Kolmogorov complexity K(s) and the minimal non-zero parameter count (hence weight norm, under the fixed-precision assumption that collapses norms to parameter count up to constants). No step defines the target quantity in terms of itself, fits parameters to data then renames the fit as a prediction, or relies on self-citations for uniqueness or ansatz. The fixed-precision regime is an explicit assumption, not smuggled in. The result is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit
- domain assumption Fixed-precision weights are required; infinite precision allows non-computable functions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Main Theorem (informal). N(s) ≤ K(s) ≤ N(s) log N(s) ... Lp weight norm ... collapses to the non-zero parameter count
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Fixed precision is the only regime where the question makes sense
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Eric Allender. When worlds collide: derandomization, lower bounds, and Kolmogorov com- plexity.Foundations of Software Technology and Theoretical Computer Science, 2001
work page 2001
-
[2]
Why do we need weight decay in modern deep learning?, 2024
Maksym Andriushchenko, Francesco D’Angelo, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?, 2024. URL https://openreview.net/ forum?id=RKh7DI23tz
work page 2024
-
[3]
Stronger generalization bounds for deep nets via a compression approach
Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InInternational Conference on Machine Learning, pages 254–263. PMLR, 2018
work page 2018
-
[4]
Balcázar, Ricard Gavaldà, and Hava T
José L. Balcázar, Ricard Gavaldà, and Hava T. Siegelmann. Computational power of neural networks: a characterization in terms of Kolmogorov complexity.IEEE Transactions on Information Theory, 43(4):1175–1183, 1997
work page 1997
-
[5]
On the computational power of transformers and its implications in sequence modeling
Satwik Bhattamishra, Arkil Patel, and Navin Goyal. On the computational power of transformers and its implications in sequence modeling. In Raquel Fernández and Tal Linzen, editors, Proceedings of the 24th Conference on Computational Natural Language Learning, pages 455–475, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/ ...
work page 2020
-
[6]
Gregory J. Chaitin. On the length of programs for computing finite binary sequences: statistical considerations.Journal of the ACM, 16(1):145–159, 1969
work page 1969
-
[7]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers.arXiv preprint arXiv:1807.03819, 2018
work page internal anchor Pith review arXiv 2018
-
[8]
The lottery ticket hypothesis: Finding sparse, trainable neural networks
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019
work page 2019
-
[9]
Looped transformers as programmable computers
Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, pages 11398–11442. PMLR, 2023
work page 2023
-
[10]
Learning universal predictors.arXiv preprint arXiv:2401.14953, 2024
Jordi Grau-Moya, Tim Genewein, Marcus Hutter, Laurent Orseau, Grégoire Delétang, Elliot Catt, Anian Ruoss, Li Kevin Wenliang, Christopher Mattern, Matthew Aitchison, and Joel Veness. Learning universal predictors.arXiv preprint arXiv:2401.14953, 2024
-
[11]
Grünwald.The Minimum Description Length Principle
Peter D. Grünwald.The Minimum Description Length Principle. MIT Press, 2007
work page 2007
-
[12]
Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InInternational Conference on Learning Representations, 2016
work page 2016
-
[13]
Geoffrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing the description length of the weights. InProceedings of the Sixth Annual Conference on Computational Learning Theory (COLT), pages 5–13, 1993
work page 1993
-
[14]
Flat minima.Neural Computation, 9(1):1–42, 1997
Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997
work page 1997
-
[15]
Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 22(241):1–124, 2021
work page 2021
-
[16]
Quan- tized neural networks: Training neural networks with low precision weights and activations
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quan- tized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18(1):6869–6898, 2017
work page 2017
-
[17]
Marcus Hutter.Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, 2005
work page 2005
-
[18]
Arthur Jacot. Deep learning as a convex paradigm of computation: Minimizing circuit size with resnets.arXiv preprint arXiv:2511.20888, 2025. 10
-
[19]
Andrei N Kolmogorov. Three approaches to the quantitative definition ofinformation’.Problems of information transmission, 1(1):1–7, 1965
work page 1965
-
[20]
Anders Krogh and John Hertz. A simple weight decay can improve generalization.Advances in Neural Information Processing Systems, 4, 1991
work page 1991
-
[21]
Universal intelligence: A definition of machine intelligence
Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds and Machines, 17(4):391–444, 2007
work page 2007
-
[22]
Leonid A. Levin. Universal sequential search problems.Problems of Information Transmission, 9(3):265–266, 1973
work page 1973
-
[23]
Ming Li and Paul Vitányi.An Introduction to Kolmogorov Complexity and Its Applications. Springer, 3 edition, 2008
work page 2008
-
[24]
Constant bit-size transformers are turing complete
Qian Li and Yuyi Wang. Constant bit-size transformers are turing complete. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 62273–62292. Curran Associates, Inc., 2025. URL https://proceedings.neurips.cc/paper_files/paper/2025/file/ 59d501de2e2839...
work page 2025
-
[25]
Efficient turing machine simulation with transformers
Qian Li and Yuyi Wang. Efficient turing machine simulation with transformers. InThe Fourteenth International Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=bxVuILo1xx
work page 2026
-
[26]
Pac-bayes compression bounds so tight that they can explain generaliza- tion
Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, and An- drew Gordon Wilson. Pac-bayes compression bounds so tight that they can explain generaliza- tion. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[27]
Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks.International Conference on Machine Learning, 2017
work page 2017
-
[28]
Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. InAdvances in Neural Information Processing Systems, 2019
work page 2019
-
[29]
On the turing completeness of modern neural network architectures
Jorge Pérez, Javier Marinkovi ´c, and Pablo Barceló. On the turing completeness of modern neural network architectures. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=HyGBdo0qFm
work page 2019
-
[30]
Attention is turing-complete.Journal of Machine Learning Research, 22(75):1–35, 2021
Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is turing-complete.Journal of Machine Learning Research, 22(75):1–35, 2021. URL http://jmlr.org/papers/v22/ 20-302.html
work page 2021
-
[31]
Ask, and it shall be given: On the turing completeness of prompting
Ruizhong Qiu, Zhe Xu, Wenxuan Bao, and Hanghang Tong. Ask, and it shall be given: On the turing completeness of prompting. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 6286– 6309, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/ 123d3e814e257e0781e5d3282...
work page 2025
-
[32]
Modeling by shortest data description.Automatica, 14(5):465–471, 1978
Jorma Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, 1978
work page 1978
-
[33]
Jesse Roberts. How powerful are decoder-only transformer neural models? In2024 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2024. doi: 10.1109/IJCNN60899. 2024.10651286
-
[34]
Jürgen Schmidhuber. Discovering neural nets with low kolmogorov complexity and high generalization capability.Neural Networks, 10(5):857–873, 1997. ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(96)00127-X. URL https://www.sciencedirect.com/ science/article/pii/S089360809600127X
-
[35]
Peter Shaw, James Cohan, Jacob Eisenstein, and Kristina Toutanova. Bridging kolmogorov com- plexity and deep learning: Asymptotically optimal description length objectives for transformers. InInternational Conference on Learning Representations, 2026. arXiv:2509.22445. 11
-
[36]
Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, 1995
work page 1995
-
[37]
A formal theory of inductive inference
Ray J Solomonoff. A formal theory of inductive inference. part i.Information and control, 7(1): 1–22, 1964
work page 1964
-
[38]
Understanding deep learning requires rethinking generalization
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017. A A Concrete Looped Neural Network Model We instantiate Definition 1 as a ternary-weight looped transformer, mirroring Giannou et al.[9] with the constant...
work page 2017
-
[39]
Parses the self-delimited prefix to recoverW, n, L ′, and the architecture metadata
-
[40]
Parses theWtuples to reconstructθ∈Θ δ,M
-
[41]
Initialises X(0) =0 and iterates the network forward at fixed precision, applying each layer’s linear update plus ReLU and the activation discretisation specified by the architecture
-
[42]
At each iteration t≥1 , after each layer is applied: if (X(t))halt >0 , halt; otherwise, if (X(t))emit >0, emit1[(X (t))bit >0]. Πis fixed, with constant size|Π|independent ofθ. Step 5: assembly.The complete program is ˆp(θ) = Π·encoding(θ) (concatenation, parseable on UsinceΠknows where its own code ends). ThenU(ˆp(θ)) =sand |ˆp(θ)| ≤ |Π|+ 3Wlog 2 W+O(W)...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.