Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Nets
Pith reviewed 2026-05-25 15:43 UTC · model grok-4.3
The pith
A multi-scale generalization of the Gibbs posterior uniquely minimizes the multilevel entropically regularized empirical risk for neural networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The multi-scale generalization of the Gibbs posterior distribution achieves the unique minimum of the empirical risk minimization problem with multilevel entropic regularization. This is derived from generalization and excess risk bounds obtained via generated hierarchical coverings of neural nets and the chaining mutual information technique, resulting in a training procedure that exploits the chain rule of relative entropy rather than the chain rule of derivatives.
What carries the argument
The multi-scale generalization of the Gibbs posterior distribution, which serves as the unique minimizer of the multilevel entropically regularized empirical risk.
If this is right
- The resulting bounds are algorithm-dependent and exploit the multilevel structure of neural nets.
- A multilevel Metropolis algorithm can simulate the multi-scale Gibbs distribution for practical training.
- Performance guarantees are provided for the new training procedure on neural nets.
- The method applies to a two-layer neural net on the MNIST dataset as demonstrated experimentally.
Where Pith is reading between the lines
- If hierarchical coverings can be generated for deeper networks, the bounds and training method could scale accordingly.
- The replacement of backpropagation with entropy chain rule might offer advantages in settings where gradients are hard to compute or noisy.
- Connections to other information-theoretic regularization techniques could be explored for hybrid methods.
Load-bearing premise
Generated hierarchical coverings of neural nets exist and allow the chaining mutual information technique to produce non-vacuous, algorithm-dependent bounds when applied level by level.
What would settle it
If experiments with the multilevel Metropolis algorithm on MNIST fail to achieve competitive performance or if the derived bounds do not hold for the trained models, the central claim would be falsified.
Figures
read the original abstract
We derive generalization and excess risk bounds for neural nets using a family of complexity measures based on a multilevel relative entropy. The bounds are obtained by introducing the notion of generated hierarchical coverings of neural nets and by using the technique of chaining mutual information introduced in Asadi et al. NeurIPS'18. The resulting bounds are algorithm-dependent and exploit the multilevel structure of neural nets. This, in turn, leads to an empirical risk minimization problem with a multilevel entropic regularization. The minimization problem is resolved by introducing a multi-scale generalization of the celebrated Gibbs posterior distribution, proving that the derived distribution achieves the unique minimum. This leads to a new training procedure for neural nets with performance guarantees, which exploits the chain rule of relative entropy rather than the chain rule of derivatives (as in backpropagation). To obtain an efficient implementation of the latter, we further develop a multilevel Metropolis algorithm simulating the multi-scale Gibbs distribution, with an experiment for a two-layer neural net on the MNIST data set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper derives generalization and excess risk bounds for neural nets via multilevel relative entropy, using generated hierarchical coverings and the chaining mutual information technique. It formulates an empirical risk minimization problem with multilevel entropic regularization whose unique minimizer is a multi-scale generalization of the Gibbs posterior; this yields a training procedure implemented via a multilevel Metropolis algorithm that exploits the chain rule of relative entropy. An experiment is provided for a two-layer net on MNIST.
Significance. If the hierarchical covering construction can be realized efficiently for general depths and the resulting bounds remain non-vacuous, the work supplies algorithm-dependent guarantees that exploit net structure and offers a theoretically grounded alternative training method. The formal derivation that the multi-scale Gibbs distribution uniquely minimizes the multilevel regularized objective is a clear technical contribution when the supporting constructions hold.
major comments (2)
- [Abstract] Abstract and experiment description: the central claim that the multilevel entropic ERM yields performance guarantees via chaining MI applied level-by-level rests on the existence of generated hierarchical coverings that keep the per-level mutual information terms finite and contractive. The manuscript provides no explicit construction or bound computation for depth >2; the MNIST experiment is restricted to a two-layer net. Without this, the bounds risk becoming vacuous even if the chain-rule derivation is formally correct.
- [Abstract] The uniqueness proof for the multi-scale Gibbs posterior as minimizer of the multilevel entropic objective is stated as following from the chain rule of relative entropy, but the load-bearing step is whether the regularization parameters remain independent of the fitted model or reduce to a tautology once the coverings are chosen; this needs explicit verification in the derivation.
minor comments (2)
- The abstract introduces 'generated hierarchical coverings' and 'multi-scale generalization of the Gibbs posterior' without a one-sentence definition or pointer to the section containing the formal definition.
- Notation for the multilevel relative entropy and the chaining MI terms should be introduced with a short table or diagram to clarify the hierarchy levels.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract and experiment description: the central claim that the multilevel entropic ERM yields performance guarantees via chaining MI applied level-by-level rests on the existence of generated hierarchical coverings that keep the per-level mutual information terms finite and contractive. The manuscript provides no explicit construction or bound computation for depth >2; the MNIST experiment is restricted to a two-layer net. Without this, the bounds risk becoming vacuous even if the chain-rule derivation is formally correct.
Authors: We agree that the manuscript provides no explicit construction of generated hierarchical coverings (or associated bound computations) for depth greater than 2, and that the MNIST experiment is restricted to a two-layer network. The framework defines the coverings recursively via the multilevel structure, but the question of efficient, non-vacuous realizations for deeper nets is left open. We will revise the manuscript to state this limitation explicitly and to outline directions for future constructions. revision: yes
-
Referee: [Abstract] The uniqueness proof for the multi-scale Gibbs posterior as minimizer of the multilevel entropic objective is stated as following from the chain rule of relative entropy, but the load-bearing step is whether the regularization parameters remain independent of the fitted model or reduce to a tautology once the coverings are chosen; this needs explicit verification in the derivation.
Authors: The regularization parameters are chosen independently of any fitted model; they are determined by the (fixed) hierarchical covering before the optimization begins. Uniqueness follows from the strict convexity of relative entropy together with the additive chain-rule decomposition of the multilevel objective. We will expand the relevant derivation section to include an explicit verification of parameter independence and uniqueness. revision: yes
- Explicit construction of generated hierarchical coverings for neural nets of depth greater than 2 that keep per-level mutual information terms finite and contractive.
Circularity Check
Multilevel entropic regularization and Gibbs posterior defined interdependently by construction
specific steps
-
self definitional
[Abstract]
"This, in turn, leads to an empirical risk minimization problem with a multilevel entropic regularization. The minimization problem is resolved by introducing a multi-scale generalization of the celebrated Gibbs posterior distribution, proving that the derived distribution achieves the unique minimum."
The multilevel entropic regularization is constructed directly from the generalization bounds; the multi-scale Gibbs is then defined as the distribution minimizing that regularization (standard property of Gibbs posteriors), so the 'proof' of unique minimum reduces to the definition by construction rather than an independent result.
-
self citation load bearing
[Abstract]
"The bounds are obtained by introducing the notion of generated hierarchical coverings of neural nets and by using the technique of chaining mutual information introduced in Asadi et al. NeurIPS'18."
The bounds (and therefore the regularization they induce) depend on the chaining mutual information technique and the existence of generated hierarchical coverings, both justified only by citation to the authors' own prior NeurIPS'18 paper; no independent construction or verification is supplied here for the multilevel case.
full rationale
The derivation introduces generated hierarchical coverings and applies the chaining MI technique (cited from authors' prior work) to obtain generalization bounds; these bounds directly define the multilevel entropic regularization term in the ERM objective. The multi-scale Gibbs posterior is then introduced and 'proven' to achieve the unique minimum of that objective. This minimum property holds by the standard definition of the Gibbs distribution for any relative-entropy regularizer, rendering the central claim tautological. The chaining technique and coverings are load-bearing but rest on self-citation without new independent verification for depth >2. This produces partial circularity in the claimed training procedure and guarantees.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The chaining mutual information technique from Asadi et al. NeurIPS'18 extends to generated hierarchical coverings of neural nets at multiple levels.
invented entities (1)
-
multi-scale generalization of the Gibbs posterior distribution
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning
OlivierCatoni. PAC-Bayesiansupervisedclassification: thethermodynamicsofstatistical learning. arXiv preprint arXiv:0712.0248, 2007
work page internal anchor Pith review Pith/arXiv arXiv 2007
-
[2]
Controllingbiasinadaptivedataanalysisusinginformation theory
DanielRussoandJamesZou. Controllingbiasinadaptivedataanalysisusinginformation theory. InArtificial Intelligence and Statistics, pages 1232–1240, 2016
work page 2016
-
[3]
Information-theoreticanalysisofgeneralizationcapability of learning algorithms
AolinXuandMaximRaginsky. Information-theoreticanalysisofgeneralizationcapability of learning algorithms. InAdvances in Neural Information Processing Systems, pages 2524–2533, 2017
work page 2017
-
[4]
Asadi, Emmanuel Abbe, and Sergio Verdú
Amir R. Asadi, Emmanuel Abbe, and Sergio Verdú. Chaining mutual information and tightening generalization bounds. InAdvances in Neural Information Processing Systems, pages 7234–7243, 2018
work page 2018
-
[5]
Identity Matters in Deep Learning
Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
PeterL.Bartlett, StevenN.Evans, andPhilipM.Long. Representingsmoothfunctionsas compositions of near-identity functions with implications for deep network optimization. arXiv preprint arXiv:1804.05012, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[8]
On prediction of individual sequences.The Annals of Statistics, 27(6):1865–1895, 1999
Nicoló Cesa-Bianchi and Gábor Lugosi. On prediction of individual sequences.The Annals of Statistics, 27(6):1865–1895, 1999
work page 1999
-
[9]
A chaining algorithm for online nonpara- metric regression
Pierre Gaillard and Sébastien Gerchinovitz. A chaining algorithm for online nonpara- metric regression. InConference on Learning Theory, pages 764–796, 2015
work page 2015
-
[10]
Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning
Nicolò Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and Sébastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. arXiv preprint arXiv:1702.08211, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images
Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. InReadings in computer vision, pages 564–584. Elsevier, 1987
work page 1987
-
[12]
Max Welling and Yee W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 681–688, 2011
work page 2011
-
[13]
A primer on PAC-bayesian learning.arXiv preprint arXiv:1901.05353, 2019
Benjamin Guedj. A primer on PAC-bayesian learning.arXiv preprint arXiv:1901.05353, 2019
-
[14]
Jean-Yves Audibert and Olivier Bousquet. PAC-Bayesian generic chaining. InAdvances in neural information processing systems, pages 1125–1132, 2004
work page 2004
-
[15]
Information-theoretic analysis of stability and bias of learning algorithms
Maxim Raginsky, Alexander Rakhlin, Matthew Tsao, Yihong Wu, and Aolin Xu. Information-theoretic analysis of stability and bias of learning algorithms. In2016 IEEE Information Theory Workshop (ITW), pages 26–30. IEEE, 2016. 13
work page 2016
-
[16]
Dependence measures bounding the exploration bias for general measurements
Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Dependence measures bounding the exploration bias for general measurements. In2017 IEEE International Symposium on Information Theory (ISIT), pages 1475–1479. IEEE, 2017
work page 2017
-
[17]
Generalization error bounds for noisy, iterative algorithms
Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative algorithms. In2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550. IEEE, 2018
work page 2018
-
[18]
Learners that Use Little Information
Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. Learners that use little information.arXiv preprint arXiv:1710.05233, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening mutual information based bounds on generalization error.arXiv preprint arXiv:1901.04609, 2019
-
[20]
Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks.arXiv preprint arXiv:1707.09564, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach
Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous generalization bounds at the imagenet scale: a PAC-Bayesian compression approach. arXiv preprint arXiv:1804.05862, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
Gintare Karolina Dziugaite and Daniel M. Roy. Data-dependent PAC-Bayes priors via differential privacy. InAdvances in Neural Information Processing Systems, pages 8430–8441, 2018
work page 2018
- [24]
-
[25]
Theoretical analysis of a class of randomized regularization methods
Tong Zhang. Theoretical analysis of a class of randomized regularization methods. In Proceedings of the twelfth annual conference on Computational learning theory, pages 156–163. ACM, 1999
work page 1999
-
[26]
Tong Zhang. Fromϵ-entropy to KL-entropy: Analysis of minimum information com- plexity density estimation.The Annals of Statistics, 34(5):2180–2210, 2006
work page 2006
-
[27]
Information-theoretic upper and lower bounds for statistical estimation
Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006
work page 2006
-
[28]
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys.arXiv preprint arXiv:1611.01838, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[29]
Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis
Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Gintare Karolina Dziugaite and Daniel M. Roy. Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of entropy-SGD and data-dependent priors. arXiv preprint arXiv:1712.09376, 2017. 14
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
Asadi, Emmanuel Abbe, and Sergio Verdú
Amir R. Asadi, Emmanuel Abbe, and Sergio Verdú. Compressing data on graphs with clusters. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 1583–1587. IEEE, 2017
work page 2017
-
[32]
Rémi Bardenet, Arnaud Doucet, and Chris Holmes. On Markov chain Monte Carlo methods for tall data.The Journal of Machine Learning Research, 18(1):1515–1557, 2017
work page 2017
-
[33]
Ramon van Handel. Probability in high dimension. [Online]. Available: https: // www. princeton. edu/ ~rvan/ APC550. pdf, Dec. 21 2016
work page 2016
-
[34]
Cambridge Series in Statistical and Probabilistic Mathematics
Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018
work page 2018
-
[35]
Springer Science & Business Media, 2014
Michel Talagrand.Upper and lower bounds for stochastic processes: modern methods and classical problems, volume 60. Springer Science & Business Media, 2014
work page 2014
-
[36]
Evaluations de processus Gaussiens composes
Xavier Fernique. Evaluations de processus Gaussiens composes. InProbability in Banach Spaces, pages 67–83. Springer, 1976
work page 1976
-
[37]
Fifty years of Shannon theory.IEEE Transactions on information theory, 44(6):2057–2078, 1998
Sergio Verdú. Fifty years of Shannon theory.IEEE Transactions on information theory, 44(6):2057–2078, 1998
work page 2057
-
[38]
Springer Science & Business Media, 2011
Erhan Çınlar.Probability and stochastics, volume 261. Springer Science & Business Media, 2011
work page 2011
-
[39]
Peter L. Bartlett, Dylan J. Foster, and Matus J. Telgarsky. Spectrally-normalized margin bounds for neural networks. InAdvances in Neural Information Processing Systems, pages 6240–6249, 2017
work page 2017
-
[40]
Bartlett.Neural network learning: Theoretical foundations
Martin Anthony and Peter L. Bartlett.Neural network learning: Theoretical foundations. Cambridge University Press, 2009
work page 2009
-
[41]
Distribution-Dependent Analysis of Gibbs-ERM Principle
Ilja Kuzborskij, Nicolò Cesa-Bianchi, and Csaba Szepesvári. Distribution-dependent analysis of Gibbs-ERM principle.arXiv preprint arXiv:1902.01846, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[42]
Jean-Francois Bercher. A simple probabilistic construction yielding generalized entropies and divergences, escort distributions and q-gaussians.Physica A: Statistical Mechanics and its Applications, 391(19):4460–4469, 2012
work page 2012
-
[43]
Rényi divergence and Kullback-Leibler divergence
Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014
work page 2014
-
[44]
Radford M. Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, Citeseer, 1992
work page 1992
-
[45]
Stochastic gradient Hamiltonian Monte Carlo
Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691, 2014
work page 2014
-
[46]
Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of varia- tional approximations of Gibbs posteriors.The Journal of Machine Learning Research, 17(1):8374–8414, 2016. 15
work page 2016
-
[47]
Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. John Wiley & Sons, 2012
work page 2012
-
[48]
Sergio Verdú. α-mutual information. In 2015 Information Theory and Applications Workshop (ITA), pages 1–6. IEEE, 2015
work page 2015
-
[49]
On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning
Bolin Gao and Lacra Pavel. On the properties of the softmax function with application in game theory and reinforcement learning.arXiv preprint arXiv:1704.00805, 2017. 16 A Information-theoretic tools Definition 3 (Relative information). Given probability measuresP and Q defined on a measurable space(A, F ), such thatP ≪Q, the relative information betweenP a...
work page internal anchor Pith review Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.