pith. sign in

arxiv: 1906.11148 · v1 · pith:ULW3FWFXnew · submitted 2019-06-26 · 💻 cs.LG · cs.IT· math.IT· stat.ML

Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Nets

Pith reviewed 2026-05-25 15:43 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.ITstat.ML
keywords neural networksgeneralization boundsentropic regularizationGibbs posteriorchaining mutual informationmultilevel structureMetropolis algorithmempirical risk minimization
0
0 comments X

The pith

A multi-scale generalization of the Gibbs posterior uniquely minimizes the multilevel entropically regularized empirical risk for neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to obtain generalization bounds for neural nets by measuring complexity through multilevel relative entropy, using hierarchical coverings and chaining of mutual information. This leads to an empirical risk minimization problem augmented with multilevel entropic regularization. The authors prove that a multi-scale Gibbs posterior distribution is the unique solution to this minimization, which in turn defines a new training procedure that relies on the chain rule of relative entropy. Such an approach matters because it respects the layered structure of neural nets and supplies performance guarantees without relying on derivatives as in standard backpropagation.

Core claim

The multi-scale generalization of the Gibbs posterior distribution achieves the unique minimum of the empirical risk minimization problem with multilevel entropic regularization. This is derived from generalization and excess risk bounds obtained via generated hierarchical coverings of neural nets and the chaining mutual information technique, resulting in a training procedure that exploits the chain rule of relative entropy rather than the chain rule of derivatives.

What carries the argument

The multi-scale generalization of the Gibbs posterior distribution, which serves as the unique minimizer of the multilevel entropically regularized empirical risk.

If this is right

  • The resulting bounds are algorithm-dependent and exploit the multilevel structure of neural nets.
  • A multilevel Metropolis algorithm can simulate the multi-scale Gibbs distribution for practical training.
  • Performance guarantees are provided for the new training procedure on neural nets.
  • The method applies to a two-layer neural net on the MNIST dataset as demonstrated experimentally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If hierarchical coverings can be generated for deeper networks, the bounds and training method could scale accordingly.
  • The replacement of backpropagation with entropy chain rule might offer advantages in settings where gradients are hard to compute or noisy.
  • Connections to other information-theoretic regularization techniques could be explored for hybrid methods.

Load-bearing premise

Generated hierarchical coverings of neural nets exist and allow the chaining mutual information technique to produce non-vacuous, algorithm-dependent bounds when applied level by level.

What would settle it

If experiments with the multilevel Metropolis algorithm on MNIST fail to achieve competitive performance or if the derived bounds do not hold for the trained models, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 1906.11148 by Amir R. Asadi, Emmanuel Abbe.

Figure 1
Figure 1. Figure 1: Dyadic sequence of partitions of T Can T and the sequence {Pk}∞ k=1 be related to the hypothesis set of a multi￾level architecture and its generated cov￾erings? For all integers i ≥ 1, let Wi , cos θ − sin θ sin θ cos θ  [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example 2 Tuning the temperature parameter for simulating the Gibbs distribution is usually done with cross-validation [1, 13]. We leave for future work the problem of tuning the temperature vector for achieving low test error while having low mixing time. To simulate P ? W|S for more than two layers, similar to line 4 of Algorithm 2, one can compute Monte Carlo approximations to the acceptance ratio of ea… view at source ↗
read the original abstract

We derive generalization and excess risk bounds for neural nets using a family of complexity measures based on a multilevel relative entropy. The bounds are obtained by introducing the notion of generated hierarchical coverings of neural nets and by using the technique of chaining mutual information introduced in Asadi et al. NeurIPS'18. The resulting bounds are algorithm-dependent and exploit the multilevel structure of neural nets. This, in turn, leads to an empirical risk minimization problem with a multilevel entropic regularization. The minimization problem is resolved by introducing a multi-scale generalization of the celebrated Gibbs posterior distribution, proving that the derived distribution achieves the unique minimum. This leads to a new training procedure for neural nets with performance guarantees, which exploits the chain rule of relative entropy rather than the chain rule of derivatives (as in backpropagation). To obtain an efficient implementation of the latter, we further develop a multilevel Metropolis algorithm simulating the multi-scale Gibbs distribution, with an experiment for a two-layer neural net on the MNIST data set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper derives generalization and excess risk bounds for neural nets via multilevel relative entropy, using generated hierarchical coverings and the chaining mutual information technique. It formulates an empirical risk minimization problem with multilevel entropic regularization whose unique minimizer is a multi-scale generalization of the Gibbs posterior; this yields a training procedure implemented via a multilevel Metropolis algorithm that exploits the chain rule of relative entropy. An experiment is provided for a two-layer net on MNIST.

Significance. If the hierarchical covering construction can be realized efficiently for general depths and the resulting bounds remain non-vacuous, the work supplies algorithm-dependent guarantees that exploit net structure and offers a theoretically grounded alternative training method. The formal derivation that the multi-scale Gibbs distribution uniquely minimizes the multilevel regularized objective is a clear technical contribution when the supporting constructions hold.

major comments (2)
  1. [Abstract] Abstract and experiment description: the central claim that the multilevel entropic ERM yields performance guarantees via chaining MI applied level-by-level rests on the existence of generated hierarchical coverings that keep the per-level mutual information terms finite and contractive. The manuscript provides no explicit construction or bound computation for depth >2; the MNIST experiment is restricted to a two-layer net. Without this, the bounds risk becoming vacuous even if the chain-rule derivation is formally correct.
  2. [Abstract] The uniqueness proof for the multi-scale Gibbs posterior as minimizer of the multilevel entropic objective is stated as following from the chain rule of relative entropy, but the load-bearing step is whether the regularization parameters remain independent of the fitted model or reduce to a tautology once the coverings are chosen; this needs explicit verification in the derivation.
minor comments (2)
  1. The abstract introduces 'generated hierarchical coverings' and 'multi-scale generalization of the Gibbs posterior' without a one-sentence definition or pointer to the section containing the formal definition.
  2. Notation for the multilevel relative entropy and the chaining MI terms should be introduced with a short table or diagram to clarify the hierarchy levels.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful review and constructive feedback. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experiment description: the central claim that the multilevel entropic ERM yields performance guarantees via chaining MI applied level-by-level rests on the existence of generated hierarchical coverings that keep the per-level mutual information terms finite and contractive. The manuscript provides no explicit construction or bound computation for depth >2; the MNIST experiment is restricted to a two-layer net. Without this, the bounds risk becoming vacuous even if the chain-rule derivation is formally correct.

    Authors: We agree that the manuscript provides no explicit construction of generated hierarchical coverings (or associated bound computations) for depth greater than 2, and that the MNIST experiment is restricted to a two-layer network. The framework defines the coverings recursively via the multilevel structure, but the question of efficient, non-vacuous realizations for deeper nets is left open. We will revise the manuscript to state this limitation explicitly and to outline directions for future constructions. revision: yes

  2. Referee: [Abstract] The uniqueness proof for the multi-scale Gibbs posterior as minimizer of the multilevel entropic objective is stated as following from the chain rule of relative entropy, but the load-bearing step is whether the regularization parameters remain independent of the fitted model or reduce to a tautology once the coverings are chosen; this needs explicit verification in the derivation.

    Authors: The regularization parameters are chosen independently of any fitted model; they are determined by the (fixed) hierarchical covering before the optimization begins. Uniqueness follows from the strict convexity of relative entropy together with the additive chain-rule decomposition of the multilevel objective. We will expand the relevant derivation section to include an explicit verification of parameter independence and uniqueness. revision: yes

standing simulated objections not resolved
  • Explicit construction of generated hierarchical coverings for neural nets of depth greater than 2 that keep per-level mutual information terms finite and contractive.

Circularity Check

2 steps flagged

Multilevel entropic regularization and Gibbs posterior defined interdependently by construction

specific steps
  1. self definitional [Abstract]
    "This, in turn, leads to an empirical risk minimization problem with a multilevel entropic regularization. The minimization problem is resolved by introducing a multi-scale generalization of the celebrated Gibbs posterior distribution, proving that the derived distribution achieves the unique minimum."

    The multilevel entropic regularization is constructed directly from the generalization bounds; the multi-scale Gibbs is then defined as the distribution minimizing that regularization (standard property of Gibbs posteriors), so the 'proof' of unique minimum reduces to the definition by construction rather than an independent result.

  2. self citation load bearing [Abstract]
    "The bounds are obtained by introducing the notion of generated hierarchical coverings of neural nets and by using the technique of chaining mutual information introduced in Asadi et al. NeurIPS'18."

    The bounds (and therefore the regularization they induce) depend on the chaining mutual information technique and the existence of generated hierarchical coverings, both justified only by citation to the authors' own prior NeurIPS'18 paper; no independent construction or verification is supplied here for the multilevel case.

full rationale

The derivation introduces generated hierarchical coverings and applies the chaining MI technique (cited from authors' prior work) to obtain generalization bounds; these bounds directly define the multilevel entropic regularization term in the ERM objective. The multi-scale Gibbs posterior is then introduced and 'proven' to achieve the unique minimum of that objective. This minimum property holds by the standard definition of the Gibbs distribution for any relative-entropy regularizer, rendering the central claim tautological. The chaining technique and coverings are load-bearing but rest on self-citation without new independent verification for depth >2. This produces partial circularity in the claimed training procedure and guarantees.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger populated from abstract only; the central claims rest on an extension of prior chaining mutual information work and on the existence of suitable hierarchical coverings whose properties are not independently verified here.

axioms (1)
  • domain assumption The chaining mutual information technique from Asadi et al. NeurIPS'18 extends to generated hierarchical coverings of neural nets at multiple levels.
    Invoked to obtain the multilevel bounds and the corresponding regularization.
invented entities (1)
  • multi-scale generalization of the Gibbs posterior distribution no independent evidence
    purpose: Unique minimizer of the multilevel entropic regularized empirical risk minimization problem
    Introduced to resolve the minimization and to define the new training procedure; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5707 in / 1417 out tokens · 26428 ms · 2026-05-25T15:43:37.990855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 13 internal anchors

  1. [1]

    Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning

    OlivierCatoni. PAC-Bayesiansupervisedclassification: thethermodynamicsofstatistical learning. arXiv preprint arXiv:0712.0248, 2007

  2. [2]

    Controllingbiasinadaptivedataanalysisusinginformation theory

    DanielRussoandJamesZou. Controllingbiasinadaptivedataanalysisusinginformation theory. InArtificial Intelligence and Statistics, pages 1232–1240, 2016

  3. [3]

    Information-theoreticanalysisofgeneralizationcapability of learning algorithms

    AolinXuandMaximRaginsky. Information-theoreticanalysisofgeneralizationcapability of learning algorithms. InAdvances in Neural Information Processing Systems, pages 2524–2533, 2017

  4. [4]

    Asadi, Emmanuel Abbe, and Sergio Verdú

    Amir R. Asadi, Emmanuel Abbe, and Sergio Verdú. Chaining mutual information and tightening generalization bounds. InAdvances in Neural Information Processing Systems, pages 7234–7243, 2018

  5. [5]

    Identity Matters in Deep Learning

    Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016

  6. [6]

    Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

    PeterL.Bartlett, StevenN.Evans, andPhilipM.Long. Representingsmoothfunctionsas compositions of near-identity functions with implications for deep network optimization. arXiv preprint arXiv:1804.05012, 2018

  7. [7]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  8. [8]

    On prediction of individual sequences.The Annals of Statistics, 27(6):1865–1895, 1999

    Nicoló Cesa-Bianchi and Gábor Lugosi. On prediction of individual sequences.The Annals of Statistics, 27(6):1865–1895, 1999

  9. [9]

    A chaining algorithm for online nonpara- metric regression

    Pierre Gaillard and Sébastien Gerchinovitz. A chaining algorithm for online nonpara- metric regression. InConference on Learning Theory, pages 764–796, 2015

  10. [10]

    Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

    Nicolò Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and Sébastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. arXiv preprint arXiv:1702.08211, 2017

  11. [11]

    Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images

    Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. InReadings in computer vision, pages 564–584. Elsevier, 1987

  12. [12]

    Max Welling and Yee W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 681–688, 2011

  13. [13]

    A primer on PAC-bayesian learning.arXiv preprint arXiv:1901.05353, 2019

    Benjamin Guedj. A primer on PAC-bayesian learning.arXiv preprint arXiv:1901.05353, 2019

  14. [14]

    PAC-Bayesian generic chaining

    Jean-Yves Audibert and Olivier Bousquet. PAC-Bayesian generic chaining. InAdvances in neural information processing systems, pages 1125–1132, 2004

  15. [15]

    Information-theoretic analysis of stability and bias of learning algorithms

    Maxim Raginsky, Alexander Rakhlin, Matthew Tsao, Yihong Wu, and Aolin Xu. Information-theoretic analysis of stability and bias of learning algorithms. In2016 IEEE Information Theory Workshop (ITW), pages 26–30. IEEE, 2016. 13

  16. [16]

    Dependence measures bounding the exploration bias for general measurements

    Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Dependence measures bounding the exploration bias for general measurements. In2017 IEEE International Symposium on Information Theory (ISIT), pages 1475–1479. IEEE, 2017

  17. [17]

    Generalization error bounds for noisy, iterative algorithms

    Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative algorithms. In2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550. IEEE, 2018

  18. [18]

    Learners that Use Little Information

    Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. Learners that use little information.arXiv preprint arXiv:1710.05233, 2017

  19. [19]

    Veeravalli

    Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening mutual information based bounds on generalization error.arXiv preprint arXiv:1901.04609, 2019

  20. [20]

    Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017

  21. [21]

    A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

    Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks.arXiv preprint arXiv:1707.09564, 2017

  22. [22]

    Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach

    Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous generalization bounds at the imagenet scale: a PAC-Bayesian compression approach. arXiv preprint arXiv:1804.05862, 2018

  23. [23]

    Gintare Karolina Dziugaite and Daniel M. Roy. Data-dependent PAC-Bayes priors via differential privacy. InAdvances in Neural Information Processing Systems, pages 8430–8441, 2018

  24. [24]

    Tsybakov

    Philippe Rigollet and Alexandre B. Tsybakov. Sparse estimation by exponential weight- ing. Statistical Science, 27(4):558–575, 2012

  25. [25]

    Theoretical analysis of a class of randomized regularization methods

    Tong Zhang. Theoretical analysis of a class of randomized regularization methods. In Proceedings of the twelfth annual conference on Computational learning theory, pages 156–163. ACM, 1999

  26. [26]

    Fromϵ-entropy to KL-entropy: Analysis of minimum information com- plexity density estimation.The Annals of Statistics, 34(5):2180–2210, 2006

    Tong Zhang. Fromϵ-entropy to KL-entropy: Analysis of minimum information com- plexity density estimation.The Annals of Statistics, 34(5):2180–2210, 2006

  27. [27]

    Information-theoretic upper and lower bounds for statistical estimation

    Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006

  28. [28]

    Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

    Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys.arXiv preprint arXiv:1611.01838, 2016

  29. [29]

    Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

    Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017

  30. [30]

    Gintare Karolina Dziugaite and Daniel M. Roy. Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of entropy-SGD and data-dependent priors. arXiv preprint arXiv:1712.09376, 2017. 14

  31. [31]

    Asadi, Emmanuel Abbe, and Sergio Verdú

    Amir R. Asadi, Emmanuel Abbe, and Sergio Verdú. Compressing data on graphs with clusters. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 1583–1587. IEEE, 2017

  32. [32]

    On Markov chain Monte Carlo methods for tall data.The Journal of Machine Learning Research, 18(1):1515–1557, 2017

    Rémi Bardenet, Arnaud Doucet, and Chris Holmes. On Markov chain Monte Carlo methods for tall data.The Journal of Machine Learning Research, 18(1):1515–1557, 2017

  33. [33]

    Probability in high dimension

    Ramon van Handel. Probability in high dimension. [Online]. Available: https: // www. princeton. edu/ ~rvan/ APC550. pdf, Dec. 21 2016

  34. [34]

    Cambridge Series in Statistical and Probabilistic Mathematics

    Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018

  35. [35]

    Springer Science & Business Media, 2014

    Michel Talagrand.Upper and lower bounds for stochastic processes: modern methods and classical problems, volume 60. Springer Science & Business Media, 2014

  36. [36]

    Evaluations de processus Gaussiens composes

    Xavier Fernique. Evaluations de processus Gaussiens composes. InProbability in Banach Spaces, pages 67–83. Springer, 1976

  37. [37]

    Fifty years of Shannon theory.IEEE Transactions on information theory, 44(6):2057–2078, 1998

    Sergio Verdú. Fifty years of Shannon theory.IEEE Transactions on information theory, 44(6):2057–2078, 1998

  38. [38]

    Springer Science & Business Media, 2011

    Erhan Çınlar.Probability and stochastics, volume 261. Springer Science & Business Media, 2011

  39. [39]

    Bartlett, Dylan J

    Peter L. Bartlett, Dylan J. Foster, and Matus J. Telgarsky. Spectrally-normalized margin bounds for neural networks. InAdvances in Neural Information Processing Systems, pages 6240–6249, 2017

  40. [40]

    Bartlett.Neural network learning: Theoretical foundations

    Martin Anthony and Peter L. Bartlett.Neural network learning: Theoretical foundations. Cambridge University Press, 2009

  41. [41]

    Distribution-Dependent Analysis of Gibbs-ERM Principle

    Ilja Kuzborskij, Nicolò Cesa-Bianchi, and Csaba Szepesvári. Distribution-dependent analysis of Gibbs-ERM principle.arXiv preprint arXiv:1902.01846, 2019

  42. [42]

    Jean-Francois Bercher. A simple probabilistic construction yielding generalized entropies and divergences, escort distributions and q-gaussians.Physica A: Statistical Mechanics and its Applications, 391(19):4460–4469, 2012

  43. [43]

    Rényi divergence and Kullback-Leibler divergence

    Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

  44. [44]

    Radford M. Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, Citeseer, 1992

  45. [45]

    Stochastic gradient Hamiltonian Monte Carlo

    Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691, 2014

  46. [46]

    On the properties of varia- tional approximations of Gibbs posteriors.The Journal of Machine Learning Research, 17(1):8374–8414, 2016

    Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of varia- tional approximations of Gibbs posteriors.The Journal of Machine Learning Research, 17(1):8374–8414, 2016. 15

  47. [47]

    Cover and Joy A

    Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. John Wiley & Sons, 2012

  48. [48]

    α-mutual information

    Sergio Verdú. α-mutual information. In 2015 Information Theory and Applications Workshop (ITA), pages 1–6. IEEE, 2015

  49. [49]

    On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning

    Bolin Gao and Lacra Pavel. On the properties of the softmax function with application in game theory and reinforcement learning.arXiv preprint arXiv:1704.00805, 2017. 16 A Information-theoretic tools Definition 3 (Relative information). Given probability measuresP and Q defined on a measurable space(A, F ), such thatP ≪Q, the relative information betweenP a...