arxiv: 2605.10878 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.IT· math.IT

Recognition: 2 theorem links

· Lean Theorem

Neural Weight Norm = Kolmogorov Complexity

Tiberiu Musat

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:35 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT

keywords Kolmogorov complexityweight decayneural networksSolomonoff inductionuniversal priorregularizationalgorithmic information theory

0 comments

The pith

The smallest weight norm of a looped neural network in fixed precision equals the Kolmogorov complexity of its output string up to a log factor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that the minimal weight norm for a looped neural network to output a binary string equals the Kolmogorov complexity of the string, up to a logarithmic factor. This shows that weight decay effectively implements a prior close to Solomonoff's universal prior over computable objects. The argument works by showing that any universal Turing machine program can be embedded into the network weights at linear cost and that the network itself can be described by listing its nonzero parameters with only logarithmic addressing overhead. The equality is independent of which norm is chosen because fixed precision makes every norm equivalent to the count of nonzero weights up to constant factors. Readers care because the result supplies a theoretical account of why regularization produces models that generalize by favoring short descriptions.

Core claim

In any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser.

What carries the argument

Fixed-precision looped neural network whose minimal weight norm equals Kolmogorov complexity by encoding any universal Turing machine program at unit cost per bit and describing the network via enumeration of its nonzero parameters with logarithmic overhead.

If this is right

Weight decay on looped neural networks induces a prior matching Solomonoff's universal prior up to polynomial factors.
The bound holds for every choice of weight norm because all norms reduce to the count of nonzero parameters up to constants.
The logarithmic overhead is tight and is realized when the nonzero parameters encode a permutation.
The fixed-precision restriction is required; infinite precision allows non-computable encodings that break the connection to Kolmogorov complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regularization can be reinterpreted as an implicit search for the shortest program consistent with the observed data.
The result offers one route to understanding why overparameterized networks generalize when regularized: they are driven toward low-complexity solutions.
Empirical checks could compare minimal-norm networks against known low-complexity strings such as periodic or repetitive sequences.

Load-bearing premise

The weights must operate under a fixed-precision regime; without it, networks can encode non-computable functions and the weight norm ceases to track Kolmogorov complexity.

What would settle it

Exhibit a fixed-precision looped network and a binary string it outputs for which the smallest achievable weight norm differs from the Kolmogorov complexity of the string by more than a logarithmic factor.

read the original abstract

Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser. The proof has two short reductions: any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit, and any fixed-precision network can be described by enumerating its non-zero parameters with logarithmic addressing overhead. Both bounds are tight up to constants, with the logarithmic factor realised by permutation encodings: a network whose parameters encode a permutation produces a string whose Kolmogorov complexity is the non-zero parameter count times its logarithm. The fixed-precision assumption is essential: with infinite precision, neural networks can encode non-computable functions and the weight norm loses its relevance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The reductions from programs to fixed-precision weights are concrete, but the norm-agnostic claim fails for L2 because it scales as sqrt(k) rather than k.

read the letter

The paper claims that in any fixed-precision regime the smallest weight norm of a looped neural network equals the Kolmogorov complexity of the binary string it outputs, up to a logarithmic factor. This is meant to imply that weight decay induces a prior matching Solomonoff's universal prior up to a polynomial factor, and that the result holds for any norm. The reductions themselves are the part worth keeping track of, but the broader conclusion does not survive the scaling issue with common norms.

Referee Report

1 major / 1 minor

Summary. The paper claims that in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser. The proof has two short reductions: any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit, and any fixed-precision network can be described by enumerating its non-zero parameters with logarithmic addressing overhead. Both bounds are tight up to constants, with the logarithmic factor realised by permutation encodings.

Significance. If the result holds, it would provide a theoretical foundation linking weight regularization in neural networks directly to Kolmogorov complexity and Solomonoff's universal prior. The explicit reductions, tightness via permutation encodings, and emphasis on the fixed-precision assumption as essential are notable strengths that could clarify inductive biases in deep learning. The norm-agnostic framing would make the result broadly applicable to different regularizers if substantiated.

major comments (1)

Abstract: The claim that 'in fixed precision, every weight norm collapses to the non-zero parameter count up to constants' and therefore 'the same sandwich bound holds for any norm used as a regulariser' is not supported. The two reductions establish that the minimal number of non-zero parameters k satisfies k = Θ(K / log K) for Kolmogorov complexity K. However, even with fixed-precision weights bounded away from zero, norms scale differently: ||w||_1 = Θ(k) while ||w||_2 = Θ(√k). Thus the minimal L2 norm is Θ(√K), inducing a prior ~exp(−λ √K) rather than ~2^{-K}. The ratio between these priors is exponential in √K (super-polynomial), contradicting the claimed polynomial-factor equivalence to Solomonoff's prior. This is load-bearing for the central implication regarding weight decay.

minor comments (1)

The looped neural network architecture (central to encoding arbitrary-length outputs) is only sketched in the abstract; a dedicated explanation of the model, including the looping mechanism and output generation, should be added to the main text for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and for identifying a key imprecision in our abstract. We agree with the major comment and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract: The claim that 'in fixed precision, every weight norm collapses to the non-zero parameter count up to constants' and therefore 'the same sandwich bound holds for any norm used as a regulariser' is not supported. The two reductions establish that the minimal number of non-zero parameters k satisfies k = Θ(K / log K) for Kolmogorov complexity K. However, even with fixed-precision weights bounded away from zero, norms scale differently: ||w||_1 = Θ(k) while ||w||_2 = Θ(√k). Thus the minimal L2 norm is Θ(√K), inducing a prior ~exp(−λ √K) rather than ~2^{-K}. The ratio between these priors is exponential in √K (super-polynomial), contradicting the claimed polynomial-factor equivalence to Solomonoff's prior. This is load-bearing for the central implication regarding weight decay.

Authors: We thank the referee for highlighting this important point. We acknowledge that the statement in the abstract regarding the norm-agnostic nature is not accurate for all norms. The reductions indeed show that the minimal number of non-zero parameters k is Θ(K / log K). For the L1 norm, ||w||_1 = Θ(k) since weights are bounded away from zero in fixed precision, so the minimal L1 norm is Θ(K / log K). However, for the L2 norm, it is Θ(√k) = Θ(√(K / log K)), leading to a different scaling. We agree that this means the induced prior for L2 regularization does not match the universal prior up to a polynomial factor. We will revise the abstract and the relevant sections to remove the claim that the result is norm-agnostic and instead specify that it applies to the number of non-zero parameters and to norms equivalent to it up to constant factors (such as L1). We will also add a discussion on the implications for different regularization techniques, noting that for L2 the connection is weaker. This revision will be made in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit bidirectional reductions between program length and weight norm

full rationale

The paper's derivation consists of two direct constructive reductions presented in the abstract: (1) encoding any UTM program into neural weights at unit cost per bit, and (2) describing any fixed-precision network by enumerating its non-zero parameters with logarithmic addressing overhead. These mappings establish sandwich bounds between Kolmogorov complexity K(s) and the minimal non-zero parameter count (hence weight norm, under the fixed-precision assumption that collapses norms to parameter count up to constants). No step defines the target quantity in terms of itself, fits parameters to data then renames the fit as a prediction, or relies on self-citations for uniqueness or ansatz. The fixed-precision regime is an explicit assumption, not smuggled in. The result is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about encoding UTM programs into weights and describing fixed-precision networks via non-zero parameters; no numerical parameters are fitted to data.

axioms (2)

domain assumption Any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit
One of the two reductions used to establish the upper bound on weight norm.
domain assumption Fixed-precision weights are required; infinite precision allows non-computable functions
Explicitly stated as essential in the abstract; without it the norm-Kolmogorov link fails.

pith-pipeline@v0.9.0 · 5489 in / 1396 out tokens · 36808 ms · 2026-05-12T04:35:59.106585+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Main Theorem (informal). N(s) ≤ K(s) ≤ N(s) log N(s) ... Lp weight norm ... collapses to the non-zero parameter count
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fixed precision is the only regime where the question makes sense

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

When worlds collide: derandomization, lower bounds, and Kolmogorov com- plexity.Foundations of Software Technology and Theoretical Computer Science, 2001

Eric Allender. When worlds collide: derandomization, lower bounds, and Kolmogorov com- plexity.Foundations of Software Technology and Theoretical Computer Science, 2001

work page 2001
[2]

Why do we need weight decay in modern deep learning?, 2024

Maksym Andriushchenko, Francesco D’Angelo, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?, 2024. URL https://openreview.net/ forum?id=RKh7DI23tz

work page 2024
[3]

Stronger generalization bounds for deep nets via a compression approach

Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InInternational Conference on Machine Learning, pages 254–263. PMLR, 2018

work page 2018
[4]

Balcázar, Ricard Gavaldà, and Hava T

José L. Balcázar, Ricard Gavaldà, and Hava T. Siegelmann. Computational power of neural networks: a characterization in terms of Kolmogorov complexity.IEEE Transactions on Information Theory, 43(4):1175–1183, 1997

work page 1997
[5]

On the computational power of transformers and its implications in sequence modeling

Satwik Bhattamishra, Arkil Patel, and Navin Goyal. On the computational power of transformers and its implications in sequence modeling. In Raquel Fernández and Tal Linzen, editors, Proceedings of the 24th Conference on Computational Natural Language Learning, pages 455–475, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/ ...

work page 2020
[6]

Gregory J. Chaitin. On the length of programs for computing finite binary sequences: statistical considerations.Journal of the ACM, 16(1):145–159, 1969

work page 1969
[7]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers.arXiv preprint arXiv:1807.03819, 2018

work page internal anchor Pith review arXiv 2018
[8]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019

work page 2019
[9]

Looped transformers as programmable computers

Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, pages 11398–11442. PMLR, 2023

work page 2023
[10]

Learning universal predictors.arXiv preprint arXiv:2401.14953, 2024

Jordi Grau-Moya, Tim Genewein, Marcus Hutter, Laurent Orseau, Grégoire Delétang, Elliot Catt, Anian Ruoss, Li Kevin Wenliang, Christopher Mattern, Matthew Aitchison, and Joel Veness. Learning universal predictors.arXiv preprint arXiv:2401.14953, 2024

work page arXiv 2024
[11]

Grünwald.The Minimum Description Length Principle

Peter D. Grünwald.The Minimum Description Length Principle. MIT Press, 2007

work page 2007
[12]

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InInternational Conference on Learning Representations, 2016

work page 2016
[13]

Hinton and Drew van Camp

Geoffrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing the description length of the weights. InProceedings of the Sixth Annual Conference on Computational Learning Theory (COLT), pages 5–13, 1993

work page 1993
[14]

Flat minima.Neural Computation, 9(1):1–42, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997

work page 1997
[15]

Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks

Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 22(241):1–124, 2021

work page 2021
[16]

Quan- tized neural networks: Training neural networks with low precision weights and activations

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quan- tized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18(1):6869–6898, 2017

work page 2017
[17]

Springer, 2005

Marcus Hutter.Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, 2005

work page 2005
[18]

Deep learning as a convex paradigm of computation: Minimizing circuit size with resnets.arXiv preprint arXiv:2511.20888, 2025

Arthur Jacot. Deep learning as a convex paradigm of computation: Minimizing circuit size with resnets.arXiv preprint arXiv:2511.20888, 2025. 10

work page arXiv 2025
[19]

Three approaches to the quantitative definition ofinformation’.Problems of information transmission, 1(1):1–7, 1965

Andrei N Kolmogorov. Three approaches to the quantitative definition ofinformation’.Problems of information transmission, 1(1):1–7, 1965

work page 1965
[20]

A simple weight decay can improve generalization.Advances in Neural Information Processing Systems, 4, 1991

Anders Krogh and John Hertz. A simple weight decay can improve generalization.Advances in Neural Information Processing Systems, 4, 1991

work page 1991
[21]

Universal intelligence: A definition of machine intelligence

Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds and Machines, 17(4):391–444, 2007

work page 2007
[22]

Leonid A. Levin. Universal sequential search problems.Problems of Information Transmission, 9(3):265–266, 1973

work page 1973
[23]

Springer, 3 edition, 2008

Ming Li and Paul Vitányi.An Introduction to Kolmogorov Complexity and Its Applications. Springer, 3 edition, 2008

work page 2008
[24]

Constant bit-size transformers are turing complete

Qian Li and Yuyi Wang. Constant bit-size transformers are turing complete. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 62273–62292. Curran Associates, Inc., 2025. URL https://proceedings.neurips.cc/paper_files/paper/2025/file/ 59d501de2e2839...

work page 2025
[25]

Efficient turing machine simulation with transformers

Qian Li and Yuyi Wang. Efficient turing machine simulation with transformers. InThe Fourteenth International Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=bxVuILo1xx

work page 2026
[26]

Pac-bayes compression bounds so tight that they can explain generaliza- tion

Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, and An- drew Gordon Wilson. Pac-bayes compression bounds so tight that they can explain generaliza- tion. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[27]

Variational dropout sparsifies deep neural networks.International Conference on Machine Learning, 2017

Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks.International Conference on Machine Learning, 2017

work page 2017
[28]

Zico Kolter

Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[29]

On the turing completeness of modern neural network architectures

Jorge Pérez, Javier Marinkovi ´c, and Pablo Barceló. On the turing completeness of modern neural network architectures. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=HyGBdo0qFm

work page 2019
[30]

Attention is turing-complete.Journal of Machine Learning Research, 22(75):1–35, 2021

Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is turing-complete.Journal of Machine Learning Research, 22(75):1–35, 2021. URL http://jmlr.org/papers/v22/ 20-302.html

work page 2021
[31]

Ask, and it shall be given: On the turing completeness of prompting

Ruizhong Qiu, Zhe Xu, Wenxuan Bao, and Hanghang Tong. Ask, and it shall be given: On the turing completeness of prompting. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 6286– 6309, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/ 123d3e814e257e0781e5d3282...

work page 2025
[32]

Modeling by shortest data description.Automatica, 14(5):465–471, 1978

Jorma Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, 1978

work page 1978
[33]

How powerful are decoder-only transformer neural models? In2024 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2024

Jesse Roberts. How powerful are decoder-only transformer neural models? In2024 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2024. doi: 10.1109/IJCNN60899. 2024.10651286

work page doi:10.1109/ijcnn60899 2024
[34]

Discovering neural nets with low kolmogorov complexity and high generalization capability.Neural Networks, 10(5):857–873, 1997

Jürgen Schmidhuber. Discovering neural nets with low kolmogorov complexity and high generalization capability.Neural Networks, 10(5):857–873, 1997. ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(96)00127-X. URL https://www.sciencedirect.com/ science/article/pii/S089360809600127X

work page doi:10.1016/s0893-6080(96)00127-x 1997
[35]

Bridging kolmogorov com- plexity and deep learning: Asymptotically optimal description length objectives for transformers

Peter Shaw, James Cohan, Jacob Eisenstein, and Kristina Toutanova. Bridging kolmogorov com- plexity and deep learning: Asymptotically optimal description length objectives for transformers. InInternational Conference on Learning Representations, 2026. arXiv:2509.22445. 11

work page arXiv 2026
[36]

Siegelmann and Eduardo D

Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, 1995

work page 1995
[37]

A formal theory of inductive inference

Ray J Solomonoff. A formal theory of inductive inference. part i.Information and control, 7(1): 1–22, 1964

work page 1964
[38]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017. A A Concrete Looped Neural Network Model We instantiate Definition 1 as a ternary-weight looped transformer, mirroring Giannou et al.[9] with the constant...

work page 2017
[39]

Parses the self-delimited prefix to recoverW, n, L ′, and the architecture metadata

work page
[40]

Parses theWtuples to reconstructθ∈Θ δ,M

work page
[41]

Initialises X(0) =0 and iterates the network forward at fixed precision, applying each layer’s linear update plus ReLU and the activation discretisation specified by the architecture

work page
[42]

edge table

At each iteration t≥1 , after each layer is applied: if (X(t))halt >0 , halt; otherwise, if (X(t))emit >0, emit1[(X (t))bit >0]. Πis fixed, with constant size|Π|independent ofθ. Step 5: assembly.The complete program is ˆp(θ) = Π·encoding(θ) (concatenation, parseable on UsinceΠknows where its own code ends). ThenU(ˆp(θ)) =sand |ˆp(θ)| ≤ |Π|+ 3Wlog 2 W+O(W)...

work page