pith. machine review for the scientific record. sign in

arxiv: 2605.10878 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.IT· math.IT

Recognition: 2 theorem links

· Lean Theorem

Neural Weight Norm = Kolmogorov Complexity

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:35 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT
keywords Kolmogorov complexityweight decayneural networksSolomonoff inductionuniversal priorregularizationalgorithmic information theory
0
0 comments X

The pith

The smallest weight norm of a looped neural network in fixed precision equals the Kolmogorov complexity of its output string up to a log factor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that the minimal weight norm for a looped neural network to output a binary string equals the Kolmogorov complexity of the string, up to a logarithmic factor. This shows that weight decay effectively implements a prior close to Solomonoff's universal prior over computable objects. The argument works by showing that any universal Turing machine program can be embedded into the network weights at linear cost and that the network itself can be described by listing its nonzero parameters with only logarithmic addressing overhead. The equality is independent of which norm is chosen because fixed precision makes every norm equivalent to the count of nonzero weights up to constant factors. Readers care because the result supplies a theoretical account of why regularization produces models that generalize by favoring short descriptions.

Core claim

In any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser.

What carries the argument

Fixed-precision looped neural network whose minimal weight norm equals Kolmogorov complexity by encoding any universal Turing machine program at unit cost per bit and describing the network via enumeration of its nonzero parameters with logarithmic overhead.

If this is right

  • Weight decay on looped neural networks induces a prior matching Solomonoff's universal prior up to polynomial factors.
  • The bound holds for every choice of weight norm because all norms reduce to the count of nonzero parameters up to constants.
  • The logarithmic overhead is tight and is realized when the nonzero parameters encode a permutation.
  • The fixed-precision restriction is required; infinite precision allows non-computable encodings that break the connection to Kolmogorov complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regularization can be reinterpreted as an implicit search for the shortest program consistent with the observed data.
  • The result offers one route to understanding why overparameterized networks generalize when regularized: they are driven toward low-complexity solutions.
  • Empirical checks could compare minimal-norm networks against known low-complexity strings such as periodic or repetitive sequences.

Load-bearing premise

The weights must operate under a fixed-precision regime; without it, networks can encode non-computable functions and the weight norm ceases to track Kolmogorov complexity.

What would settle it

Exhibit a fixed-precision looped network and a binary string it outputs for which the smallest achievable weight norm differs from the Kolmogorov complexity of the string by more than a logarithmic factor.

read the original abstract

Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser. The proof has two short reductions: any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit, and any fixed-precision network can be described by enumerating its non-zero parameters with logarithmic addressing overhead. Both bounds are tight up to constants, with the logarithmic factor realised by permutation encodings: a network whose parameters encode a permutation produces a string whose Kolmogorov complexity is the non-zero parameter count times its logarithm. The fixed-precision assumption is essential: with infinite precision, neural networks can encode non-computable functions and the weight norm loses its relevance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser. The proof has two short reductions: any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit, and any fixed-precision network can be described by enumerating its non-zero parameters with logarithmic addressing overhead. Both bounds are tight up to constants, with the logarithmic factor realised by permutation encodings.

Significance. If the result holds, it would provide a theoretical foundation linking weight regularization in neural networks directly to Kolmogorov complexity and Solomonoff's universal prior. The explicit reductions, tightness via permutation encodings, and emphasis on the fixed-precision assumption as essential are notable strengths that could clarify inductive biases in deep learning. The norm-agnostic framing would make the result broadly applicable to different regularizers if substantiated.

major comments (1)
  1. Abstract: The claim that 'in fixed precision, every weight norm collapses to the non-zero parameter count up to constants' and therefore 'the same sandwich bound holds for any norm used as a regulariser' is not supported. The two reductions establish that the minimal number of non-zero parameters k satisfies k = Θ(K / log K) for Kolmogorov complexity K. However, even with fixed-precision weights bounded away from zero, norms scale differently: ||w||_1 = Θ(k) while ||w||_2 = Θ(√k). Thus the minimal L2 norm is Θ(√K), inducing a prior ~exp(−λ √K) rather than ~2^{-K}. The ratio between these priors is exponential in √K (super-polynomial), contradicting the claimed polynomial-factor equivalence to Solomonoff's prior. This is load-bearing for the central implication regarding weight decay.
minor comments (1)
  1. The looped neural network architecture (central to encoding arbitrary-length outputs) is only sketched in the abstract; a dedicated explanation of the model, including the looping mechanism and output generation, should be added to the main text for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and for identifying a key imprecision in our abstract. We agree with the major comment and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: The claim that 'in fixed precision, every weight norm collapses to the non-zero parameter count up to constants' and therefore 'the same sandwich bound holds for any norm used as a regulariser' is not supported. The two reductions establish that the minimal number of non-zero parameters k satisfies k = Θ(K / log K) for Kolmogorov complexity K. However, even with fixed-precision weights bounded away from zero, norms scale differently: ||w||_1 = Θ(k) while ||w||_2 = Θ(√k). Thus the minimal L2 norm is Θ(√K), inducing a prior ~exp(−λ √K) rather than ~2^{-K}. The ratio between these priors is exponential in √K (super-polynomial), contradicting the claimed polynomial-factor equivalence to Solomonoff's prior. This is load-bearing for the central implication regarding weight decay.

    Authors: We thank the referee for highlighting this important point. We acknowledge that the statement in the abstract regarding the norm-agnostic nature is not accurate for all norms. The reductions indeed show that the minimal number of non-zero parameters k is Θ(K / log K). For the L1 norm, ||w||_1 = Θ(k) since weights are bounded away from zero in fixed precision, so the minimal L1 norm is Θ(K / log K). However, for the L2 norm, it is Θ(√k) = Θ(√(K / log K)), leading to a different scaling. We agree that this means the induced prior for L2 regularization does not match the universal prior up to a polynomial factor. We will revise the abstract and the relevant sections to remove the claim that the result is norm-agnostic and instead specify that it applies to the number of non-zero parameters and to norms equivalent to it up to constant factors (such as L1). We will also add a discussion on the implications for different regularization techniques, noting that for L2 the connection is weaker. This revision will be made in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit bidirectional reductions between program length and weight norm

full rationale

The paper's derivation consists of two direct constructive reductions presented in the abstract: (1) encoding any UTM program into neural weights at unit cost per bit, and (2) describing any fixed-precision network by enumerating its non-zero parameters with logarithmic addressing overhead. These mappings establish sandwich bounds between Kolmogorov complexity K(s) and the minimal non-zero parameter count (hence weight norm, under the fixed-precision assumption that collapses norms to parameter count up to constants). No step defines the target quantity in terms of itself, fits parameters to data then renames the fit as a prediction, or relies on self-citations for uniqueness or ansatz. The fixed-precision regime is an explicit assumption, not smuggled in. The result is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about encoding UTM programs into weights and describing fixed-precision networks via non-zero parameters; no numerical parameters are fitted to data.

axioms (2)
  • domain assumption Any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit
    One of the two reductions used to establish the upper bound on weight norm.
  • domain assumption Fixed-precision weights are required; infinite precision allows non-computable functions
    Explicitly stated as essential in the abstract; without it the norm-Kolmogorov link fails.

pith-pipeline@v0.9.0 · 5489 in / 1396 out tokens · 36808 ms · 2026-05-12T04:35:59.106585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    When worlds collide: derandomization, lower bounds, and Kolmogorov com- plexity.Foundations of Software Technology and Theoretical Computer Science, 2001

    Eric Allender. When worlds collide: derandomization, lower bounds, and Kolmogorov com- plexity.Foundations of Software Technology and Theoretical Computer Science, 2001

  2. [2]

    Why do we need weight decay in modern deep learning?, 2024

    Maksym Andriushchenko, Francesco D’Angelo, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?, 2024. URL https://openreview.net/ forum?id=RKh7DI23tz

  3. [3]

    Stronger generalization bounds for deep nets via a compression approach

    Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InInternational Conference on Machine Learning, pages 254–263. PMLR, 2018

  4. [4]

    Balcázar, Ricard Gavaldà, and Hava T

    José L. Balcázar, Ricard Gavaldà, and Hava T. Siegelmann. Computational power of neural networks: a characterization in terms of Kolmogorov complexity.IEEE Transactions on Information Theory, 43(4):1175–1183, 1997

  5. [5]

    On the computational power of transformers and its implications in sequence modeling

    Satwik Bhattamishra, Arkil Patel, and Navin Goyal. On the computational power of transformers and its implications in sequence modeling. In Raquel Fernández and Tal Linzen, editors, Proceedings of the 24th Conference on Computational Natural Language Learning, pages 455–475, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/ ...

  6. [6]

    Gregory J. Chaitin. On the length of programs for computing finite binary sequences: statistical considerations.Journal of the ACM, 16(1):145–159, 1969

  7. [7]

    Universal Transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers.arXiv preprint arXiv:1807.03819, 2018

  8. [8]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019

  9. [9]

    Looped transformers as programmable computers

    Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, pages 11398–11442. PMLR, 2023

  10. [10]

    Learning universal predictors.arXiv preprint arXiv:2401.14953, 2024

    Jordi Grau-Moya, Tim Genewein, Marcus Hutter, Laurent Orseau, Grégoire Delétang, Elliot Catt, Anian Ruoss, Li Kevin Wenliang, Christopher Mattern, Matthew Aitchison, and Joel Veness. Learning universal predictors.arXiv preprint arXiv:2401.14953, 2024

  11. [11]

    Grünwald.The Minimum Description Length Principle

    Peter D. Grünwald.The Minimum Description Length Principle. MIT Press, 2007

  12. [12]

    Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. InInternational Conference on Learning Representations, 2016

  13. [13]

    Hinton and Drew van Camp

    Geoffrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing the description length of the weights. InProceedings of the Sixth Annual Conference on Computational Learning Theory (COLT), pages 5–13, 1993

  14. [14]

    Flat minima.Neural Computation, 9(1):1–42, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997

  15. [15]

    Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks

    Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 22(241):1–124, 2021

  16. [16]

    Quan- tized neural networks: Training neural networks with low precision weights and activations

    Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quan- tized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18(1):6869–6898, 2017

  17. [17]

    Springer, 2005

    Marcus Hutter.Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, 2005

  18. [18]

    Deep learning as a convex paradigm of computation: Minimizing circuit size with resnets.arXiv preprint arXiv:2511.20888, 2025

    Arthur Jacot. Deep learning as a convex paradigm of computation: Minimizing circuit size with resnets.arXiv preprint arXiv:2511.20888, 2025. 10

  19. [19]

    Three approaches to the quantitative definition ofinformation’.Problems of information transmission, 1(1):1–7, 1965

    Andrei N Kolmogorov. Three approaches to the quantitative definition ofinformation’.Problems of information transmission, 1(1):1–7, 1965

  20. [20]

    A simple weight decay can improve generalization.Advances in Neural Information Processing Systems, 4, 1991

    Anders Krogh and John Hertz. A simple weight decay can improve generalization.Advances in Neural Information Processing Systems, 4, 1991

  21. [21]

    Universal intelligence: A definition of machine intelligence

    Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds and Machines, 17(4):391–444, 2007

  22. [22]

    Leonid A. Levin. Universal sequential search problems.Problems of Information Transmission, 9(3):265–266, 1973

  23. [23]

    Springer, 3 edition, 2008

    Ming Li and Paul Vitányi.An Introduction to Kolmogorov Complexity and Its Applications. Springer, 3 edition, 2008

  24. [24]

    Constant bit-size transformers are turing complete

    Qian Li and Yuyi Wang. Constant bit-size transformers are turing complete. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 62273–62292. Curran Associates, Inc., 2025. URL https://proceedings.neurips.cc/paper_files/paper/2025/file/ 59d501de2e2839...

  25. [25]

    Efficient turing machine simulation with transformers

    Qian Li and Yuyi Wang. Efficient turing machine simulation with transformers. InThe Fourteenth International Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=bxVuILo1xx

  26. [26]

    Pac-bayes compression bounds so tight that they can explain generaliza- tion

    Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, and An- drew Gordon Wilson. Pac-bayes compression bounds so tight that they can explain generaliza- tion. InAdvances in Neural Information Processing Systems, 2022

  27. [27]

    Variational dropout sparsifies deep neural networks.International Conference on Machine Learning, 2017

    Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks.International Conference on Machine Learning, 2017

  28. [28]

    Zico Kolter

    Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. InAdvances in Neural Information Processing Systems, 2019

  29. [29]

    On the turing completeness of modern neural network architectures

    Jorge Pérez, Javier Marinkovi ´c, and Pablo Barceló. On the turing completeness of modern neural network architectures. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=HyGBdo0qFm

  30. [30]

    Attention is turing-complete.Journal of Machine Learning Research, 22(75):1–35, 2021

    Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is turing-complete.Journal of Machine Learning Research, 22(75):1–35, 2021. URL http://jmlr.org/papers/v22/ 20-302.html

  31. [31]

    Ask, and it shall be given: On the turing completeness of prompting

    Ruizhong Qiu, Zhe Xu, Wenxuan Bao, and Hanghang Tong. Ask, and it shall be given: On the turing completeness of prompting. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 6286– 6309, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/ 123d3e814e257e0781e5d3282...

  32. [32]

    Modeling by shortest data description.Automatica, 14(5):465–471, 1978

    Jorma Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, 1978

  33. [33]

    How powerful are decoder-only transformer neural models? In2024 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2024

    Jesse Roberts. How powerful are decoder-only transformer neural models? In2024 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2024. doi: 10.1109/IJCNN60899. 2024.10651286

  34. [34]

    Discovering neural nets with low kolmogorov complexity and high generalization capability.Neural Networks, 10(5):857–873, 1997

    Jürgen Schmidhuber. Discovering neural nets with low kolmogorov complexity and high generalization capability.Neural Networks, 10(5):857–873, 1997. ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(96)00127-X. URL https://www.sciencedirect.com/ science/article/pii/S089360809600127X

  35. [35]

    Bridging kolmogorov com- plexity and deep learning: Asymptotically optimal description length objectives for transformers

    Peter Shaw, James Cohan, Jacob Eisenstein, and Kristina Toutanova. Bridging kolmogorov com- plexity and deep learning: Asymptotically optimal description length objectives for transformers. InInternational Conference on Learning Representations, 2026. arXiv:2509.22445. 11

  36. [36]

    Siegelmann and Eduardo D

    Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, 1995

  37. [37]

    A formal theory of inductive inference

    Ray J Solomonoff. A formal theory of inductive inference. part i.Information and control, 7(1): 1–22, 1964

  38. [38]

    Understanding deep learning requires rethinking generalization

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017. A A Concrete Looped Neural Network Model We instantiate Definition 1 as a ternary-weight looped transformer, mirroring Giannou et al.[9] with the constant...

  39. [39]

    Parses the self-delimited prefix to recoverW, n, L ′, and the architecture metadata

  40. [40]

    Parses theWtuples to reconstructθ∈Θ δ,M

  41. [41]

    Initialises X(0) =0 and iterates the network forward at fixed precision, applying each layer’s linear update plus ReLU and the activation discretisation specified by the architecture

  42. [42]

    edge table

    At each iteration t≥1 , after each layer is applied: if (X(t))halt >0 , halt; otherwise, if (X(t))emit >0, emit1[(X (t))bit >0]. Πis fixed, with constant size|Π|independent ofθ. Step 5: assembly.The complete program is ˆp(θ) = Π·encoding(θ) (concatenation, parseable on UsinceΠknows where its own code ends). ThenU(ˆp(θ)) =sand |ˆp(θ)| ≤ |Π|+ 3Wlog 2 W+O(W)...