Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Nets

Amir R. Asadi; Emmanuel Abbe

arxiv: 1906.11148 · v1 · pith:ULW3FWFXnew · submitted 2019-06-26 · 💻 cs.LG · cs.IT· math.IT· stat.ML

Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Nets

Amir R. Asadi , Emmanuel Abbe This is my paper

Pith reviewed 2026-05-25 15:43 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.ITstat.ML

keywords neural networksgeneralization boundsentropic regularizationGibbs posteriorchaining mutual informationmultilevel structureMetropolis algorithmempirical risk minimization

0 comments

The pith

A multi-scale generalization of the Gibbs posterior uniquely minimizes the multilevel entropically regularized empirical risk for neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to obtain generalization bounds for neural nets by measuring complexity through multilevel relative entropy, using hierarchical coverings and chaining of mutual information. This leads to an empirical risk minimization problem augmented with multilevel entropic regularization. The authors prove that a multi-scale Gibbs posterior distribution is the unique solution to this minimization, which in turn defines a new training procedure that relies on the chain rule of relative entropy. Such an approach matters because it respects the layered structure of neural nets and supplies performance guarantees without relying on derivatives as in standard backpropagation.

Core claim

The multi-scale generalization of the Gibbs posterior distribution achieves the unique minimum of the empirical risk minimization problem with multilevel entropic regularization. This is derived from generalization and excess risk bounds obtained via generated hierarchical coverings of neural nets and the chaining mutual information technique, resulting in a training procedure that exploits the chain rule of relative entropy rather than the chain rule of derivatives.

What carries the argument

The multi-scale generalization of the Gibbs posterior distribution, which serves as the unique minimizer of the multilevel entropically regularized empirical risk.

If this is right

The resulting bounds are algorithm-dependent and exploit the multilevel structure of neural nets.
A multilevel Metropolis algorithm can simulate the multi-scale Gibbs distribution for practical training.
Performance guarantees are provided for the new training procedure on neural nets.
The method applies to a two-layer neural net on the MNIST dataset as demonstrated experimentally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If hierarchical coverings can be generated for deeper networks, the bounds and training method could scale accordingly.
The replacement of backpropagation with entropy chain rule might offer advantages in settings where gradients are hard to compute or noisy.
Connections to other information-theoretic regularization techniques could be explored for hybrid methods.

Load-bearing premise

Generated hierarchical coverings of neural nets exist and allow the chaining mutual information technique to produce non-vacuous, algorithm-dependent bounds when applied level by level.

What would settle it

If experiments with the multilevel Metropolis algorithm on MNIST fail to achieve competitive performance or if the derived bounds do not hold for the trained models, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 1906.11148 by Amir R. Asadi, Emmanuel Abbe.

**Figure 1.** Figure 1: Dyadic sequence of partitions of T Can T and the sequence {Pk}∞ k=1 be related to the hypothesis set of a multilevel architecture and its generated coverings? For all integers i ≥ 1, let Wi , cos θ − sin θ sin θ cos θ [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 3.** Figure 3: Example 2 Tuning the temperature parameter for simulating the Gibbs distribution is usually done with cross-validation [1, 13]. We leave for future work the problem of tuning the temperature vector for achieving low test error while having low mixing time. To simulate P ? W|S for more than two layers, similar to line 4 of Algorithm 2, one can compute Monte Carlo approximations to the acceptance ratio of ea… view at source ↗

read the original abstract

We derive generalization and excess risk bounds for neural nets using a family of complexity measures based on a multilevel relative entropy. The bounds are obtained by introducing the notion of generated hierarchical coverings of neural nets and by using the technique of chaining mutual information introduced in Asadi et al. NeurIPS'18. The resulting bounds are algorithm-dependent and exploit the multilevel structure of neural nets. This, in turn, leads to an empirical risk minimization problem with a multilevel entropic regularization. The minimization problem is resolved by introducing a multi-scale generalization of the celebrated Gibbs posterior distribution, proving that the derived distribution achieves the unique minimum. This leads to a new training procedure for neural nets with performance guarantees, which exploits the chain rule of relative entropy rather than the chain rule of derivatives (as in backpropagation). To obtain an efficient implementation of the latter, we further develop a multilevel Metropolis algorithm simulating the multi-scale Gibbs distribution, with an experiment for a two-layer neural net on the MNIST data set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multilevel Gibbs posterior and Metropolis sampler are the concrete new output, but the hierarchical coverings are only shown to work for two-layer nets.

read the letter

The paper takes the chaining mutual information technique from the authors' earlier NeurIPS work and lifts it to multilevel relative entropy bounds by defining generated hierarchical coverings of neural nets. From those bounds it defines a multilevel entropic regularizer whose unique minimizer is a multi-scale Gibbs distribution, then gives a Metropolis algorithm to sample from it. The derivation uses the chain rule on relative entropy to get the uniqueness result, which is formally neat and gives an explicit training procedure with built-in risk certificates. The two-layer MNIST experiment at least confirms the sampler can be implemented and run on a real dataset. The main limitation is that the whole construction depends on being able to generate the hierarchical coverings so that the per-level mutual information terms stay finite and useful. The paper only demonstrates this for depth two; there is no explicit covering construction or bound computation shown for deeper nets. If the coverings become expensive or the bounds turn vacuous at greater depth, the performance guarantees lose practical value even though the math itself holds. Readers already working on information-theoretic bounds or alternative training schemes for nets will get the most out of it. The formal derivation and the sampler are grounded enough to justify sending the paper to peer review rather than desk rejection.

Referee Report

2 major / 2 minor

Summary. The paper derives generalization and excess risk bounds for neural nets via multilevel relative entropy, using generated hierarchical coverings and the chaining mutual information technique. It formulates an empirical risk minimization problem with multilevel entropic regularization whose unique minimizer is a multi-scale generalization of the Gibbs posterior; this yields a training procedure implemented via a multilevel Metropolis algorithm that exploits the chain rule of relative entropy. An experiment is provided for a two-layer net on MNIST.

Significance. If the hierarchical covering construction can be realized efficiently for general depths and the resulting bounds remain non-vacuous, the work supplies algorithm-dependent guarantees that exploit net structure and offers a theoretically grounded alternative training method. The formal derivation that the multi-scale Gibbs distribution uniquely minimizes the multilevel regularized objective is a clear technical contribution when the supporting constructions hold.

major comments (2)

[Abstract] Abstract and experiment description: the central claim that the multilevel entropic ERM yields performance guarantees via chaining MI applied level-by-level rests on the existence of generated hierarchical coverings that keep the per-level mutual information terms finite and contractive. The manuscript provides no explicit construction or bound computation for depth >2; the MNIST experiment is restricted to a two-layer net. Without this, the bounds risk becoming vacuous even if the chain-rule derivation is formally correct.
[Abstract] The uniqueness proof for the multi-scale Gibbs posterior as minimizer of the multilevel entropic objective is stated as following from the chain rule of relative entropy, but the load-bearing step is whether the regularization parameters remain independent of the fitted model or reduce to a tautology once the coverings are chosen; this needs explicit verification in the derivation.

minor comments (2)

The abstract introduces 'generated hierarchical coverings' and 'multi-scale generalization of the Gibbs posterior' without a one-sentence definition or pointer to the section containing the formal definition.
Notation for the multilevel relative entropy and the chaining MI terms should be introduced with a short table or diagram to clarify the hierarchy levels.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful review and constructive feedback. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and experiment description: the central claim that the multilevel entropic ERM yields performance guarantees via chaining MI applied level-by-level rests on the existence of generated hierarchical coverings that keep the per-level mutual information terms finite and contractive. The manuscript provides no explicit construction or bound computation for depth >2; the MNIST experiment is restricted to a two-layer net. Without this, the bounds risk becoming vacuous even if the chain-rule derivation is formally correct.

Authors: We agree that the manuscript provides no explicit construction of generated hierarchical coverings (or associated bound computations) for depth greater than 2, and that the MNIST experiment is restricted to a two-layer network. The framework defines the coverings recursively via the multilevel structure, but the question of efficient, non-vacuous realizations for deeper nets is left open. We will revise the manuscript to state this limitation explicitly and to outline directions for future constructions. revision: yes
Referee: [Abstract] The uniqueness proof for the multi-scale Gibbs posterior as minimizer of the multilevel entropic objective is stated as following from the chain rule of relative entropy, but the load-bearing step is whether the regularization parameters remain independent of the fitted model or reduce to a tautology once the coverings are chosen; this needs explicit verification in the derivation.

Authors: The regularization parameters are chosen independently of any fitted model; they are determined by the (fixed) hierarchical covering before the optimization begins. Uniqueness follows from the strict convexity of relative entropy together with the additive chain-rule decomposition of the multilevel objective. We will expand the relevant derivation section to include an explicit verification of parameter independence and uniqueness. revision: yes

standing simulated objections not resolved

Explicit construction of generated hierarchical coverings for neural nets of depth greater than 2 that keep per-level mutual information terms finite and contractive.

Circularity Check

2 steps flagged

Multilevel entropic regularization and Gibbs posterior defined interdependently by construction

specific steps

self definitional [Abstract]
"This, in turn, leads to an empirical risk minimization problem with a multilevel entropic regularization. The minimization problem is resolved by introducing a multi-scale generalization of the celebrated Gibbs posterior distribution, proving that the derived distribution achieves the unique minimum."

The multilevel entropic regularization is constructed directly from the generalization bounds; the multi-scale Gibbs is then defined as the distribution minimizing that regularization (standard property of Gibbs posteriors), so the 'proof' of unique minimum reduces to the definition by construction rather than an independent result.
self citation load bearing [Abstract]
"The bounds are obtained by introducing the notion of generated hierarchical coverings of neural nets and by using the technique of chaining mutual information introduced in Asadi et al. NeurIPS'18."

The bounds (and therefore the regularization they induce) depend on the chaining mutual information technique and the existence of generated hierarchical coverings, both justified only by citation to the authors' own prior NeurIPS'18 paper; no independent construction or verification is supplied here for the multilevel case.

full rationale

The derivation introduces generated hierarchical coverings and applies the chaining MI technique (cited from authors' prior work) to obtain generalization bounds; these bounds directly define the multilevel entropic regularization term in the ERM objective. The multi-scale Gibbs posterior is then introduced and 'proven' to achieve the unique minimum of that objective. This minimum property holds by the standard definition of the Gibbs distribution for any relative-entropy regularizer, rendering the central claim tautological. The chaining technique and coverings are load-bearing but rest on self-citation without new independent verification for depth >2. This produces partial circularity in the claimed training procedure and guarantees.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger populated from abstract only; the central claims rest on an extension of prior chaining mutual information work and on the existence of suitable hierarchical coverings whose properties are not independently verified here.

axioms (1)

domain assumption The chaining mutual information technique from Asadi et al. NeurIPS'18 extends to generated hierarchical coverings of neural nets at multiple levels.
Invoked to obtain the multilevel bounds and the corresponding regularization.

invented entities (1)

multi-scale generalization of the Gibbs posterior distribution no independent evidence
purpose: Unique minimizer of the multilevel entropic regularized empirical risk minimization problem
Introduced to resolve the minimization and to define the new training procedure; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5707 in / 1417 out tokens · 26428 ms · 2026-05-25T15:43:37.990855+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 13 internal anchors

[1]

Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning

OlivierCatoni. PAC-Bayesiansupervisedclassiﬁcation: thethermodynamicsofstatistical learning. arXiv preprint arXiv:0712.0248, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007
[2]

Controllingbiasinadaptivedataanalysisusinginformation theory

DanielRussoandJamesZou. Controllingbiasinadaptivedataanalysisusinginformation theory. InArtiﬁcial Intelligence and Statistics, pages 1232–1240, 2016

work page 2016
[3]

Information-theoreticanalysisofgeneralizationcapability of learning algorithms

AolinXuandMaximRaginsky. Information-theoreticanalysisofgeneralizationcapability of learning algorithms. InAdvances in Neural Information Processing Systems, pages 2524–2533, 2017

work page 2017
[4]

Asadi, Emmanuel Abbe, and Sergio Verdú

Amir R. Asadi, Emmanuel Abbe, and Sergio Verdú. Chaining mutual information and tightening generalization bounds. InAdvances in Neural Information Processing Systems, pages 7234–7243, 2018

work page 2018
[5]

Identity Matters in Deep Learning

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

PeterL.Bartlett, StevenN.Evans, andPhilipM.Long. Representingsmoothfunctionsas compositions of near-identity functions with implications for deep network optimization. arXiv preprint arXiv:1804.05012, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[8]

On prediction of individual sequences.The Annals of Statistics, 27(6):1865–1895, 1999

Nicoló Cesa-Bianchi and Gábor Lugosi. On prediction of individual sequences.The Annals of Statistics, 27(6):1865–1895, 1999

work page 1999
[9]

A chaining algorithm for online nonpara- metric regression

Pierre Gaillard and Sébastien Gerchinovitz. A chaining algorithm for online nonpara- metric regression. InConference on Learning Theory, pages 764–796, 2015

work page 2015
[10]

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Nicolò Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and Sébastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. arXiv preprint arXiv:1702.08211, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images

Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. InReadings in computer vision, pages 564–584. Elsevier, 1987

work page 1987
[12]

Max Welling and Yee W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 681–688, 2011

work page 2011
[13]

A primer on PAC-bayesian learning.arXiv preprint arXiv:1901.05353, 2019

Benjamin Guedj. A primer on PAC-bayesian learning.arXiv preprint arXiv:1901.05353, 2019

work page arXiv 1901
[14]

PAC-Bayesian generic chaining

Jean-Yves Audibert and Olivier Bousquet. PAC-Bayesian generic chaining. InAdvances in neural information processing systems, pages 1125–1132, 2004

work page 2004
[15]

Information-theoretic analysis of stability and bias of learning algorithms

Maxim Raginsky, Alexander Rakhlin, Matthew Tsao, Yihong Wu, and Aolin Xu. Information-theoretic analysis of stability and bias of learning algorithms. In2016 IEEE Information Theory Workshop (ITW), pages 26–30. IEEE, 2016. 13

work page 2016
[16]

Dependence measures bounding the exploration bias for general measurements

Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Dependence measures bounding the exploration bias for general measurements. In2017 IEEE International Symposium on Information Theory (ISIT), pages 1475–1479. IEEE, 2017

work page 2017
[17]

Generalization error bounds for noisy, iterative algorithms

Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative algorithms. In2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550. IEEE, 2018

work page 2018
[18]

Learners that Use Little Information

Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoﬀ. Learners that use little information.arXiv preprint arXiv:1710.05233, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Veeravalli

Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening mutual information based bounds on generalization error.arXiv preprint arXiv:1901.04609, 2019

work page arXiv 1901
[20]

Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks.arXiv preprint arXiv:1707.09564, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach

Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous generalization bounds at the imagenet scale: a PAC-Bayesian compression approach. arXiv preprint arXiv:1804.05862, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Gintare Karolina Dziugaite and Daniel M. Roy. Data-dependent PAC-Bayes priors via diﬀerential privacy. InAdvances in Neural Information Processing Systems, pages 8430–8441, 2018

work page 2018
[24]

Tsybakov

Philippe Rigollet and Alexandre B. Tsybakov. Sparse estimation by exponential weight- ing. Statistical Science, 27(4):558–575, 2012

work page 2012
[25]

Theoretical analysis of a class of randomized regularization methods

Tong Zhang. Theoretical analysis of a class of randomized regularization methods. In Proceedings of the twelfth annual conference on Computational learning theory, pages 156–163. ACM, 1999

work page 1999
[26]

Fromϵ-entropy to KL-entropy: Analysis of minimum information com- plexity density estimation.The Annals of Statistics, 34(5):2180–2210, 2006

Tong Zhang. Fromϵ-entropy to KL-entropy: Analysis of minimum information com- plexity density estimation.The Annals of Statistics, 34(5):2180–2210, 2006

work page 2006
[27]

Information-theoretic upper and lower bounds for statistical estimation

Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006

work page 2006
[28]

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys.arXiv preprint arXiv:1611.01838, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[29]

Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Gintare Karolina Dziugaite and Daniel M. Roy. Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of entropy-SGD and data-dependent priors. arXiv preprint arXiv:1712.09376, 2017. 14

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Asadi, Emmanuel Abbe, and Sergio Verdú

Amir R. Asadi, Emmanuel Abbe, and Sergio Verdú. Compressing data on graphs with clusters. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 1583–1587. IEEE, 2017

work page 2017
[32]

On Markov chain Monte Carlo methods for tall data.The Journal of Machine Learning Research, 18(1):1515–1557, 2017

Rémi Bardenet, Arnaud Doucet, and Chris Holmes. On Markov chain Monte Carlo methods for tall data.The Journal of Machine Learning Research, 18(1):1515–1557, 2017

work page 2017
[33]

Probability in high dimension

Ramon van Handel. Probability in high dimension. [Online]. Available: https: // www. princeton. edu/ ~rvan/ APC550. pdf, Dec. 21 2016

work page 2016
[34]

Cambridge Series in Statistical and Probabilistic Mathematics

Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018

work page 2018
[35]

Springer Science & Business Media, 2014

Michel Talagrand.Upper and lower bounds for stochastic processes: modern methods and classical problems, volume 60. Springer Science & Business Media, 2014

work page 2014
[36]

Evaluations de processus Gaussiens composes

Xavier Fernique. Evaluations de processus Gaussiens composes. InProbability in Banach Spaces, pages 67–83. Springer, 1976

work page 1976
[37]

Fifty years of Shannon theory.IEEE Transactions on information theory, 44(6):2057–2078, 1998

Sergio Verdú. Fifty years of Shannon theory.IEEE Transactions on information theory, 44(6):2057–2078, 1998

work page 2057
[38]

Springer Science & Business Media, 2011

Erhan Çınlar.Probability and stochastics, volume 261. Springer Science & Business Media, 2011

work page 2011
[39]

Bartlett, Dylan J

Peter L. Bartlett, Dylan J. Foster, and Matus J. Telgarsky. Spectrally-normalized margin bounds for neural networks. InAdvances in Neural Information Processing Systems, pages 6240–6249, 2017

work page 2017
[40]

Bartlett.Neural network learning: Theoretical foundations

Martin Anthony and Peter L. Bartlett.Neural network learning: Theoretical foundations. Cambridge University Press, 2009

work page 2009
[41]

Distribution-Dependent Analysis of Gibbs-ERM Principle

Ilja Kuzborskij, Nicolò Cesa-Bianchi, and Csaba Szepesvári. Distribution-dependent analysis of Gibbs-ERM principle.arXiv preprint arXiv:1902.01846, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[42]

Jean-Francois Bercher. A simple probabilistic construction yielding generalized entropies and divergences, escort distributions and q-gaussians.Physica A: Statistical Mechanics and its Applications, 391(19):4460–4469, 2012

work page 2012
[43]

Rényi divergence and Kullback-Leibler divergence

Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

work page 2014
[44]

Radford M. Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, Citeseer, 1992

work page 1992
[45]

Stochastic gradient Hamiltonian Monte Carlo

Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691, 2014

work page 2014
[46]

On the properties of varia- tional approximations of Gibbs posteriors.The Journal of Machine Learning Research, 17(1):8374–8414, 2016

Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of varia- tional approximations of Gibbs posteriors.The Journal of Machine Learning Research, 17(1):8374–8414, 2016. 15

work page 2016
[47]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. John Wiley & Sons, 2012

work page 2012
[48]

α-mutual information

Sergio Verdú. α-mutual information. In 2015 Information Theory and Applications Workshop (ITA), pages 1–6. IEEE, 2015

work page 2015
[49]

On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning

Bolin Gao and Lacra Pavel. On the properties of the softmax function with application in game theory and reinforcement learning.arXiv preprint arXiv:1704.00805, 2017. 16 A Information-theoretic tools Deﬁnition 3 (Relative information). Given probability measuresP and Q deﬁned on a measurable space(A, F ), such thatP ≪Q, the relative information betweenP a...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning

OlivierCatoni. PAC-Bayesiansupervisedclassiﬁcation: thethermodynamicsofstatistical learning. arXiv preprint arXiv:0712.0248, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007

[2] [2]

Controllingbiasinadaptivedataanalysisusinginformation theory

DanielRussoandJamesZou. Controllingbiasinadaptivedataanalysisusinginformation theory. InArtiﬁcial Intelligence and Statistics, pages 1232–1240, 2016

work page 2016

[3] [3]

Information-theoreticanalysisofgeneralizationcapability of learning algorithms

AolinXuandMaximRaginsky. Information-theoreticanalysisofgeneralizationcapability of learning algorithms. InAdvances in Neural Information Processing Systems, pages 2524–2533, 2017

work page 2017

[4] [4]

Asadi, Emmanuel Abbe, and Sergio Verdú

Amir R. Asadi, Emmanuel Abbe, and Sergio Verdú. Chaining mutual information and tightening generalization bounds. InAdvances in Neural Information Processing Systems, pages 7234–7243, 2018

work page 2018

[5] [5]

Identity Matters in Deep Learning

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

PeterL.Bartlett, StevenN.Evans, andPhilipM.Long. Representingsmoothfunctionsas compositions of near-identity functions with implications for deep network optimization. arXiv preprint arXiv:1804.05012, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[8] [8]

On prediction of individual sequences.The Annals of Statistics, 27(6):1865–1895, 1999

Nicoló Cesa-Bianchi and Gábor Lugosi. On prediction of individual sequences.The Annals of Statistics, 27(6):1865–1895, 1999

work page 1999

[9] [9]

A chaining algorithm for online nonpara- metric regression

Pierre Gaillard and Sébastien Gerchinovitz. A chaining algorithm for online nonpara- metric regression. InConference on Learning Theory, pages 764–796, 2015

work page 2015

[10] [10]

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Nicolò Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and Sébastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. arXiv preprint arXiv:1702.08211, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images

Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. InReadings in computer vision, pages 564–584. Elsevier, 1987

work page 1987

[12] [12]

Max Welling and Yee W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 681–688, 2011

work page 2011

[13] [13]

A primer on PAC-bayesian learning.arXiv preprint arXiv:1901.05353, 2019

Benjamin Guedj. A primer on PAC-bayesian learning.arXiv preprint arXiv:1901.05353, 2019

work page arXiv 1901

[14] [14]

PAC-Bayesian generic chaining

Jean-Yves Audibert and Olivier Bousquet. PAC-Bayesian generic chaining. InAdvances in neural information processing systems, pages 1125–1132, 2004

work page 2004

[15] [15]

Information-theoretic analysis of stability and bias of learning algorithms

Maxim Raginsky, Alexander Rakhlin, Matthew Tsao, Yihong Wu, and Aolin Xu. Information-theoretic analysis of stability and bias of learning algorithms. In2016 IEEE Information Theory Workshop (ITW), pages 26–30. IEEE, 2016. 13

work page 2016

[16] [16]

Dependence measures bounding the exploration bias for general measurements

Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Dependence measures bounding the exploration bias for general measurements. In2017 IEEE International Symposium on Information Theory (ISIT), pages 1475–1479. IEEE, 2017

work page 2017

[17] [17]

Generalization error bounds for noisy, iterative algorithms

Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative algorithms. In2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550. IEEE, 2018

work page 2018

[18] [18]

Learners that Use Little Information

Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoﬀ. Learners that use little information.arXiv preprint arXiv:1710.05233, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Veeravalli

Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening mutual information based bounds on generalization error.arXiv preprint arXiv:1901.04609, 2019

work page arXiv 1901

[20] [20]

Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[21] [21]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks.arXiv preprint arXiv:1707.09564, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach

Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous generalization bounds at the imagenet scale: a PAC-Bayesian compression approach. arXiv preprint arXiv:1804.05862, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [23]

Gintare Karolina Dziugaite and Daniel M. Roy. Data-dependent PAC-Bayes priors via diﬀerential privacy. InAdvances in Neural Information Processing Systems, pages 8430–8441, 2018

work page 2018

[24] [24]

Tsybakov

Philippe Rigollet and Alexandre B. Tsybakov. Sparse estimation by exponential weight- ing. Statistical Science, 27(4):558–575, 2012

work page 2012

[25] [25]

Theoretical analysis of a class of randomized regularization methods

Tong Zhang. Theoretical analysis of a class of randomized regularization methods. In Proceedings of the twelfth annual conference on Computational learning theory, pages 156–163. ACM, 1999

work page 1999

[26] [26]

Fromϵ-entropy to KL-entropy: Analysis of minimum information com- plexity density estimation.The Annals of Statistics, 34(5):2180–2210, 2006

Tong Zhang. Fromϵ-entropy to KL-entropy: Analysis of minimum information com- plexity density estimation.The Annals of Statistics, 34(5):2180–2210, 2006

work page 2006

[27] [27]

Information-theoretic upper and lower bounds for statistical estimation

Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006

work page 2006

[28] [28]

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys.arXiv preprint arXiv:1611.01838, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[29] [29]

Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Gintare Karolina Dziugaite and Daniel M. Roy. Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of entropy-SGD and data-dependent priors. arXiv preprint arXiv:1712.09376, 2017. 14

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

Asadi, Emmanuel Abbe, and Sergio Verdú

Amir R. Asadi, Emmanuel Abbe, and Sergio Verdú. Compressing data on graphs with clusters. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 1583–1587. IEEE, 2017

work page 2017

[32] [32]

On Markov chain Monte Carlo methods for tall data.The Journal of Machine Learning Research, 18(1):1515–1557, 2017

Rémi Bardenet, Arnaud Doucet, and Chris Holmes. On Markov chain Monte Carlo methods for tall data.The Journal of Machine Learning Research, 18(1):1515–1557, 2017

work page 2017

[33] [33]

Probability in high dimension

Ramon van Handel. Probability in high dimension. [Online]. Available: https: // www. princeton. edu/ ~rvan/ APC550. pdf, Dec. 21 2016

work page 2016

[34] [34]

Cambridge Series in Statistical and Probabilistic Mathematics

Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018

work page 2018

[35] [35]

Springer Science & Business Media, 2014

Michel Talagrand.Upper and lower bounds for stochastic processes: modern methods and classical problems, volume 60. Springer Science & Business Media, 2014

work page 2014

[36] [36]

Evaluations de processus Gaussiens composes

Xavier Fernique. Evaluations de processus Gaussiens composes. InProbability in Banach Spaces, pages 67–83. Springer, 1976

work page 1976

[37] [37]

Fifty years of Shannon theory.IEEE Transactions on information theory, 44(6):2057–2078, 1998

Sergio Verdú. Fifty years of Shannon theory.IEEE Transactions on information theory, 44(6):2057–2078, 1998

work page 2057

[38] [38]

Springer Science & Business Media, 2011

Erhan Çınlar.Probability and stochastics, volume 261. Springer Science & Business Media, 2011

work page 2011

[39] [39]

Bartlett, Dylan J

Peter L. Bartlett, Dylan J. Foster, and Matus J. Telgarsky. Spectrally-normalized margin bounds for neural networks. InAdvances in Neural Information Processing Systems, pages 6240–6249, 2017

work page 2017

[40] [40]

Bartlett.Neural network learning: Theoretical foundations

Martin Anthony and Peter L. Bartlett.Neural network learning: Theoretical foundations. Cambridge University Press, 2009

work page 2009

[41] [41]

Distribution-Dependent Analysis of Gibbs-ERM Principle

Ilja Kuzborskij, Nicolò Cesa-Bianchi, and Csaba Szepesvári. Distribution-dependent analysis of Gibbs-ERM principle.arXiv preprint arXiv:1902.01846, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[42] [42]

Jean-Francois Bercher. A simple probabilistic construction yielding generalized entropies and divergences, escort distributions and q-gaussians.Physica A: Statistical Mechanics and its Applications, 391(19):4460–4469, 2012

work page 2012

[43] [43]

Rényi divergence and Kullback-Leibler divergence

Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

work page 2014

[44] [44]

Radford M. Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, Citeseer, 1992

work page 1992

[45] [45]

Stochastic gradient Hamiltonian Monte Carlo

Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691, 2014

work page 2014

[46] [46]

On the properties of varia- tional approximations of Gibbs posteriors.The Journal of Machine Learning Research, 17(1):8374–8414, 2016

Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of varia- tional approximations of Gibbs posteriors.The Journal of Machine Learning Research, 17(1):8374–8414, 2016. 15

work page 2016

[47] [47]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. John Wiley & Sons, 2012

work page 2012

[48] [48]

α-mutual information

Sergio Verdú. α-mutual information. In 2015 Information Theory and Applications Workshop (ITA), pages 1–6. IEEE, 2015

work page 2015

[49] [49]

On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning

Bolin Gao and Lacra Pavel. On the properties of the softmax function with application in game theory and reinforcement learning.arXiv preprint arXiv:1704.00805, 2017. 16 A Information-theoretic tools Deﬁnition 3 (Relative information). Given probability measuresP and Q deﬁned on a measurable space(A, F ), such thatP ≪Q, the relative information betweenP a...

work page internal anchor Pith review Pith/arXiv arXiv 2017