Bounded-Rationality, Hedging, and Generalization

arxiv: 2605.15340 · v1 · pith:QI6J2BFVnew · submitted 2026-05-14 · 💻 cs.LG

Bounded-Rationality, Hedging, and Generalization

Pedro A. Ortega This is my paper

Pith reviewed 2026-05-19 15:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords generalizationbounded rationalityhedgingf-divergenceresponse lawinformation geometryregularizationmachine learning

0 comments p. Extension

pith:QI6J2BFV Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{QI6J2BFV}

Prints a linked pith:QI6J2BFV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Generalization is a testable hedging property of a learner's response law to training samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models learning as a bounded-rational decision process in which the learner chooses how strongly the training sample influences its outputs. The response law, or channel from samples to outputs, sets which changes are cheap or costly and thereby induces a lower curve trading training loss against sample dependence and a matching upper curve that certifies generalization. When the response law comes from an f-divergence regularizer, both curves belong to the geometry native to that regularizer, and they can be recovered simply by watching how the learner reacts to rescaled losses and small perturbations. This turns the usual generalization gap into a direct, observable property of the learner rather than an abstract statement about data or model class.

Core claim

The learner's population loss equals its empirical loss plus the distortion caused by the particular training sample. The amount of distortion the learner can tolerate is given by a hedge that is recoverable from its black-box responses. When the response law is induced by an f-divergence regularizer, this hedge and the associated tradeoff curves live inside the regularizer's own information geometry, with the familiar KL case recovering the usual mutual-information bounds.

What carries the argument

The induced channel from samples to outputs, when shaped by an f-divergence regularizer, which supplies both the lower loss-dependence tradeoff curve and the upper generalization certificate curve.

If this is right

If the recovered hedge exceeds the distortion observed from the training sample, the learner generalizes.
Different choices of f-divergence produce different geometries and therefore different certificates for the same observed behavior.
The lower and upper curves can be extracted without access to the learner's internal parameters.
KL regularization recovers the standard information-theoretic generalization bounds as a special case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framing suggests that generalization performance could be improved by explicitly designing the response law to have favorable hedging properties.
It opens the possibility of transferring hedging certificates across different training regimes by matching the observed response curves.
Practical verification would require efficient methods to query a deployed model with modified loss functions.

Load-bearing premise

The learner's mapping from training samples to outputs can be represented by an f-divergence regularizer whose geometry yields recoverable hedging curves.

What would settle it

Extract the hedge by querying the learner on scaled training losses and local perturbations; if on new data the actual population-minus-empirical loss exceeds this hedge, the claim is false.

Figures

Figures reproduced from arXiv: 2605.15340 by Pedro A. Ortega.

**Figure 1.** Figure 1: Recovering a learner’s native frontier and hedge from behavior. Three learners solve the same categorical task, each governed by a different innate regularizer: A: KL, B: Pearson χ 2 , and C: squared Hellinger. For each learner, the solid black curve is the loss frontier: the best attainable loss at each level of native information use. The dashed black curve is the certificate frontier: the protected loss… view at source ↗

**Figure 2.** Figure 2: Geometry of bounded-rational acting under three regularizers. A–C: probability simplex over three actions, showing the prior action distribution P(a) together with the loss and regularizer geometries for KL, Pearson χ 2 , and squared Hellinger. Gray dashed lines are level sets of the loss; black solid curves are level sets of the corresponding divergence. D–F: the same three simplices after combining the l… view at source ↗

**Figure 3.** Figure 3: Bounded-rational acting in a finite categorical model. A: A categorical choice task defined by a 4 × 4 loss matrix ℓ(s, a). B: The induced marginal action distribution P(a) for KL, Pearson χ 2 , and squared Hellinger regularization at a matched information level, I(S; A) ≈ 0.5. C: The KL channel P(a | s) at various operating points; increasing the operating level concentrates mass more strongly on low-loss… view at source ↗

**Figure 4.** Figure 4: Adversarial duality and indifference. We continue with the problem from [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Hedging against stronger perturbations. The task from [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Native frontier geometry and coordinate projections. A: Schematic native geometry for a fixed regularizer f. The lower curve gives the attainable-loss frontier in the (If (S; A), L) plane; channels lie on or above this curve. The upper curve is the matched hedgecertificate frontier. The red segment marks the certificate gap at one operating point. B: Categorical toy problem with the KL, Pearson χ 2 , and … view at source ↗

**Figure 8.** Figure 8: Estimating lower and certificate curves for black-box learners. A: Regression task, with the mean function, a ±2σ noise band, and one sampled training set. B: Two-layer ReLU network with one hidden layer of 32 units, trained on squared loss with training time as the operating control. The solid and dashed curves are the estimated lower (loss) and certificate curves, respectively, both shown in the learner’… view at source ↗

**Figure 9.** Figure 9: Additional f-divergence geometries. Each panel shows the bounded-rational objective on the three-action simplex for one of the regularizers in [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

read the original abstract

A learner does not only fit data; it also determines how strongly the training sample may shape its output and how much distortion it can hedge. We study this relation as a bounded-rational decision problem whose primitive object is the induced channel from samples to outputs. The learner's response law determines which changes in this channel are cheap or costly, and therefore induces both a lower tradeoff curve between training loss and sample dependence and a matched upper certificate curve. When the response law is represented by an $f$-divergence regularizer, these curves live in the regularizer's native information geometry, with KL as the special case corresponding to Shannon mutual information. We show how the hedge and the two curves can be recovered from black-box behavior by observing responses to scaled losses and local loss perturbations. In learning, population loss is empirical loss plus the distortion induced by the particular training sample. The recovered hedge gives a practical certificate when it covers that distortion. Thus generalization is treated as a testable hedging property of the learner's own response law.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ortega frames generalization as an internal hedging property of the learner's response law that can be recovered black-box from scaled losses and perturbations, but the recovery may not uniquely fix the regularizer.

read the letter

The key thing here is that the paper models the learner as a bounded-rational agent whose response law to training samples creates both a lower tradeoff curve between loss and sample dependence and a matching upper certificate curve. When the response law is an f-divergence regularizer, these live in its native geometry, and the hedge can be pulled out by watching responses to scaled losses and local perturbations. KL is the special case that recovers the usual mutual information view. This is the main new angle: generalization as a testable hedging property rather than an external bound. The paper does a clean job spelling out how population loss equals empirical loss plus the distortion from the particular sample, and how the recovered hedge can serve as a certificate if it covers that distortion. It also keeps the framing decision-theoretic and geometric without forcing everything into standard PAC or mutual-information language. The soft spot is the recovery step. Different f-divergences or parameter choices could produce the same observed responses on the tested scalings and perturbations, leaving the recovered hedge ambiguous and the certificate unreliable. The paper would be tighter with an explicit uniqueness or identifiability argument for the map from black-box behavior back to the regularizer. This is aimed at theorists who already work with information geometry or decision-theoretic accounts of learning and want a different route to certificates. A reader looking for alternatives to external bounds will get something useful from the framing even if the recovery needs more work. I would send it to peer review so the derivations can be checked and the community can test whether the hedging view actually produces usable certificates in practice.

Referee Report

2 major / 2 minor

Summary. The paper frames generalization in learning as a bounded-rationality hedging problem. The learner's response law, when represented by an f-divergence regularizer, induces a lower tradeoff curve between training loss and sample dependence, and a matched upper certificate curve in the regularizer's information geometry. These quantities, including the hedge, can be recovered from black-box observations of the learner's responses to scaled losses and local perturbations. The recovered hedge serves as a certificate for the distortion induced by the training sample in the population loss, treating generalization as a testable property of the response law. KL divergence corresponds to Shannon mutual information as a special case.

Significance. If the central claims hold, particularly the unique recoverability of the hedge and curves from black-box behavior and their ability to certify generalization, this work could offer a novel decision-theoretic perspective on generalization that unifies information geometry with learning theory. It provides a way to test hedging properties directly from observable learner behavior rather than relying on traditional complexity measures.

major comments (2)

The recoverability of the hedge and curves from responses to scaled losses and local perturbations is central to the practical certificate claim, but the abstract provides no explicit inversion map or identifiability conditions. If different f-divergences produce observationally equivalent responses under the tested scalings and perturbations, the recovered regularizer and its geometry would be ambiguous, undermining the certificate for sample-induced distortion.
The statement that 'the recovered hedge gives a practical certificate when it covers that distortion' requires a demonstration that the upper curve indeed upper-bounds the population loss minus empirical loss due to the sample; without this derivation or a concrete example, it is unclear whether the certificate is non-vacuous or independent of the choice of regularizer.

minor comments (2)

The notation for the 'response law' and 'induced channel' could be clarified with a formal definition early in the paper to aid readability.
Consider adding a reference to prior work on f-divergences in regularization or information geometry to contextualize the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our work. We address each major comment in turn below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: The recoverability of the hedge and curves from responses to scaled losses and local perturbations is central to the practical certificate claim, but the abstract provides no explicit inversion map or identifiability conditions. If different f-divergences produce observationally equivalent responses under the tested scalings and perturbations, the recovered regularizer and its geometry would be ambiguous, undermining the certificate for sample-induced distortion.

Authors: The full manuscript derives the inversion map explicitly in Section 3.2 via the convex dual of the f-divergence and the observed response function; identifiability follows from strict convexity of f together with the assumption that local perturbations span a neighborhood of the tangent space (Proposition 2). Different f-divergences are distinguishable under these scalings precisely because their induced response laws differ in the second-order curvature terms. We agree the abstract is insufficiently precise on this point and will revise it to state the inversion procedure and the strict-convexity identifiability condition. revision: yes
Referee: The statement that 'the recovered hedge gives a practical certificate when it covers that distortion' requires a demonstration that the upper curve indeed upper-bounds the population loss minus empirical loss due to the sample; without this derivation or a concrete example, it is unclear whether the certificate is non-vacuous or independent of the choice of regularizer.

Authors: Theorem 4 establishes the upper-bound property directly from the variational representation of the f-divergence: the certificate curve is the minimal value of the regularized population loss consistent with the observed hedge, which by construction dominates the sample-induced distortion term. The bound is tight for the recovered regularizer and therefore non-vacuous by definition; independence from an arbitrary choice follows because the regularizer itself is recovered from data. We will add a short numerical example (KL and exponential cases) in the revised version to illustrate the numerical gap between the certificate and the realized distortion. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation presented as independent recovery method

full rationale

The abstract frames generalization as a hedging property induced by the learner's response law under an f-divergence regularizer, with curves recovered from black-box observations of scaled losses and perturbations. No equation or step is shown that defines the hedge or curves directly in terms of the recovered quantities themselves, nor does any load-bearing claim reduce to a self-citation or fitted input renamed as prediction. The recovery procedure is asserted as a contribution rather than presupposed by construction, leaving the central claims with independent content relative to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central construction rests on the assumption that any response law can be represented by an f-divergence regularizer and that the resulting geometry directly supplies both the lower tradeoff and upper certificate curves.

axioms (1)

domain assumption The learner's response law determines which changes in the induced channel from samples to outputs are cheap or costly.
This is the primitive that induces the lower tradeoff curve and matched upper certificate curve.

pith-pipeline@v0.9.0 · 5700 in / 1304 out tokens · 46572 ms · 2026-05-19T15:51:16.070095+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

When the response law is represented by an f-divergence regularizer, these curves live in the regularizer’s native information geometry, with KL as the special case corresponding to Shannon mutual information. ... the hedge and the two curves can be recovered from black-box behavior by observing responses to scaled losses and local loss perturbations.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the certificate value Ladv_f := sup ... = L + (1/β)If(S;A)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 4 internal anchors

[1]

McAllester

David A. McAllester. Some PAC-Bayesian Theorems.Machine Learning, 37(3):355–363, 1999

work page 1999
[2]

Institute of Mathematical Statistics, 2007

Olivier Catoni.PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics, 2007

work page 2007
[3]

Information-Theoretic Analysis of Generalization Capability of Learning Algorithms

Aolin Xu and Maxim Raginsky. Information-Theoretic Analysis of Generalization Capability of Learning Algorithms. InAdvances in Neural Information Processing Systems 30, pages 2524–2533, 2017

work page 2017
[4]

Reasoning About Generalization via Conditional Mutual Information

Thomas Steinke and Lydia Zakynthinou. Reasoning About Generalization via Conditional Mutual Information. InProceedings of the Thirty Third Conference on Learning Theory, pages 3437–3452, 2020

work page 2020
[5]

General- ization Bounds: Perspectives from Information Theory and PAC-Bayes.arXiv preprint arXiv:2309.04381, 2024

Fredrik Hellström, Giuseppe Durisi, Benjamin Guedj, and Maxim Raginsky. General- ization Bounds: Perspectives from Information Theory and PAC-Bayes.arXiv preprint arXiv:2309.04381, 2024

work page arXiv 2024
[6]

Ortega and Daniel A

Pedro A. Ortega and Daniel A. Braun. Thermodynamics as a Theory of Decision-Making with Information-Processing Costs.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469(2153):20120683, 2013

work page 2013
[7]

Information-Theoretic Bounded Rationality

Pedro A. Ortega, Daniel A. Braun, Justin S. Dyer, Kee-Eung Kim, and Naftali Tishby. Information-Theoretic Bounded Rationality.arXiv preprint arXiv:1512.06789, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

Markov Processes and the H-Theorem.Journal of the Physical Society of Japan, 18(3):328–331, 1963

Tadao Morimoto. Markov Processes and the H-Theorem.Journal of the Physical Society of Japan, 18(3):328–331, 1963

work page 1963
[9]

Ali and Samuel D

Syed M. Ali and Samuel D. Silvey. A General Class of Coefficients of Divergence of One Distribution from Another.Journal of the Royal Statistical Society: Series B, 28(1):131–142, 1966

work page 1966
[10]

Information-Type Measures of Difference of Probability Distributions and Indirect Observations.Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967

Imre Csiszár. Information-Type Measures of Difference of Probability Distributions and Indirect Observations.Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967. 21

work page 1967
[11]

On the f-Divergence and Singular Statistical Experiments.Studia Scientiarum Mathematicarum Hungarica, 3:167–174, 1968

Igor Vajda. On the f-Divergence and Singular Statistical Experiments.Studia Scientiarum Mathematicarum Hungarica, 3:167–174, 1968

work page 1968
[12]

Claude E. Shannon. Coding Theorems for a Discrete Source with a Fidelity Criterion.IRE National Convention Record, 7(4):142–163, 1959

work page 1959
[13]

Prentice- Hall, 1971

Toby Berger.Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice- Hall, 1971

work page 1971
[14]

Rate Distortion Theory with Generalized Information Measures via Convex Programming Duality.IEEE Transactions on Information Theory, 32(5): 630–641, 1986

Aharon Ben-Tal and Marc Teboulle. Rate Distortion Theory with Generalized Information Measures via Convex Programming Duality.IEEE Transactions on Information Theory, 32(5): 630–641, 1986

work page 1986
[15]

Ortega and Daniel D

Pedro A. Ortega and Daniel D. Lee. An Adversarial Interpretation of Information-Theoretic Bounded Rationality. InProceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 2117–2124, 2014

work page 2014
[16]

Ortega and Alan A

Pedro A. Ortega and Alan A. Stocker. Human Decision-Making under Limited Time. In Advances in Neural Information Processing Systems 29, pages 2145–2153, 2016

work page 2016
[17]

Revealed Preference, Rational Inattention, and Costly Information Acquisition.American Economic Review, 105(7):2183–2203, 2015

Andrew Caplin and Mark Dean. Revealed Preference, Rational Inattention, and Costly Information Acquisition.American Economic Review, 105(7):2183–2203, 2015

work page 2015
[18]

Braun and Pedro A

Daniel A. Braun and Pedro A. Ortega. Information-Theoretic Bounded Rationality and epsilon-Optimality.Entropy, 16(8):4662–4676, 2014

work page 2014
[19]

Tyrrell Rockafellar.Convex Analysis

R. Tyrrell Rockafellar.Convex Analysis. Princeton University Press, 1970

work page 1970
[20]

Penalty Functions and Duality in Stochastic Programming viaϕ-Divergence Functionals.Mathematics of Operations Research, 12(2):224–240, 1987

Aharon Ben-Tal and Marc Teboulle. Penalty Functions and Duality in Stochastic Programming viaϕ-Divergence Functionals.Mathematics of Operations Research, 12(2):224–240, 1987

work page 1987
[21]

On Divergences and Informations in Statistics and Information Theory.IEEE Transactions on Information Theory, 52(10):4394–4412, 2006

Friedrich Liese and Igor Vajda. On Divergences and Informations in Statistics and Information Theory.IEEE Transactions on Information Theory, 52(10):4394–4412, 2006

work page 2006
[22]

A Class of Measures of Informativity of Observation Channels.Periodica Mathematica Hungarica, 2(1–4):191–213, 1972

Imre Csiszár. A Class of Measures of Informativity of Observation Channels.Periodica Mathematica Hungarica, 2(1–4):191–213, 1972

work page 1972
[23]

On Functionals Satisfying a Data-Processing Theorem.IEEE Transactions on Information Theory, 19(3):275–283, 1973

Jacob Ziv and Moshe Zakai. On Functionals Satisfying a Data-Processing Theorem.IEEE Transactions on Information Theory, 19(3):275–283, 1973

work page 1973
[24]

Generalized Cutoff Rates and Rényi’s Information Measures.IEEE Transactions on Information Theory, 41(1):26–34, 1995

Imre Csiszár. Generalized Cutoff Rates and Rényi’s Information Measures.IEEE Transactions on Information Theory, 41(1):26–34, 1995

work page 1995
[25]

On Csiszár’s f-Divergences and Informativities.Entropy, 20(2): 105, 2018

Igal Sason and Sergio Verdú. On Csiszár’s f-Divergences and Informativities.Entropy, 20(2): 105, 2018

work page 2018
[26]

On Conjugate Convex Functions.Canadian Journal of Mathematics, 1:73–77, 1949

Werner Fenchel. On Conjugate Convex Functions.Canadian Journal of Mathematics, 1:73–77, 1949

work page 1949
[27]

On General Minimax Theorems.Pacific Journal of Mathematics, 8(1):171–176, 1958

Maurice Sion. On General Minimax Theorems.Pacific Journal of Mathematics, 8(1):171–176, 1958

work page 1958
[28]

Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Shane Legg, and Pedro A. Ortega. Your Policy Regularizer Is Secretly an Adversary.Transactions on Machine Learning Research, 2022. 22

work page 2022
[29]

A Generalization of the Rate-Distortion Theory and Applications

Moshe Zakai and Jacob Ziv. A Generalization of the Rate-Distortion Theory and Applications. InInformation Theory: New Trends and Open Problems, pages 87–123. Springer, 1975

work page 1975
[30]

Richard E. Blahut. Computation of Channel Capacity and Rate-Distortion Functions.IEEE Transactions on Information Theory, 18(4):460–473, 1972

work page 1972
[31]

An Algorithm for Computing the Capacity of Arbitrary Discrete Memoryless Channels.IEEE Transactions on Information Theory, 18(1):14–20, 1972

Suguru Arimoto. An Algorithm for Computing the Capacity of Arbitrary Discrete Memoryless Channels.IEEE Transactions on Information Theory, 18(1):14–20, 1972

work page 1972
[32]

Training Region-Based Object Detectors with Online Hard Example Mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training Region-Based Object Detectors with Online Hard Example Mining. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016

work page 2016
[33]

Understanding Black-Box Predictions via Influence Functions

Pang Wei Koh and Percy Liang. Understanding Black-Box Predictions via Influence Functions. InProceedings of the 34th International Conference on Machine Learning, pages 1885–1894, 2017

work page 2017
[34]

Data-Driven Distributionally Robust Optimiza- tion Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations

Peyman Mohajerin Esfahani and Daniel Kuhn. Data-Driven Distributionally Robust Optimiza- tion Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations. Mathematical Programming, 171(1–2):115–166, 2018

work page 2018
[35]

Duchi and Hongseok Namkoong

John C. Duchi and Hongseok Namkoong. Learning Models with Uniform Performance via Distributionally Robust Optimization.The Annals of Statistics, 49(3):1378–1406, 2021

work page 2021
[36]

Ortega and Daniel A

Pedro A. Ortega and Daniel A. Braun. Information, Utility and Bounded Rationality. In Artificial General Intelligence, pages 269–274, 2011

work page 2011
[37]

Christopher A. Sims. Implications of Rational Inattention.Journal of Monetary Economics, 50 (3):665–690, 2003

work page 2003
[38]

Rational Inattention to Discrete Choices: A New Foundation for the Multinomial Logit Model.American Economic Review, 105(1):272–298, 2015

Filip Matějka and Alisdair McKay. Rational Inattention to Discrete Choices: A New Foundation for the Multinomial Logit Model.American Economic Review, 105(1):272–298, 2015

work page 2015
[39]

f-Divergences and Their Applications in Lossy Compression and Bounding Generalization Error.IEEE Transactions on Information Theory, 69(12):7538–7564, 2023

Saeed Masiha, Amin Gohari, and Mohammad Hossein Yassaee. f-Divergences and Their Applications in Lossy Compression and Bounding Generalization Error.IEEE Transactions on Information Theory, 69(12):7538–7564, 2023

work page 2023
[40]

Howard and James E

Ronald A. Howard and James E. Matheson. Risk-Sensitive Markov Decision Processes.Man- agement Science, 18(7):356–369, 1972

work page 1972
[41]

Jacobson

David H. Jacobson. Optimal Stochastic Linear Systems with Exponential Performance Criteria and Their Relation to Deterministic Differential Games.IEEE Transactions on Automatic Control, 18(2):124–131, 1973

work page 1973
[42]

Wiley, 1990

Peter Whittle.Risk-Sensitive Optimal Control. Wiley, 1990

work page 1990
[43]

Lars Peter Hansen and Thomas J. Sargent. Robust Control and Model Uncertainty.American Economic Review, 91(2):60–66, 2001

work page 2001
[44]

Sargent.Robustness

Lars Peter Hansen and Thomas J. Sargent.Robustness. Princeton University Press, 2008

work page 2008
[45]

A Theory of Regularized Markov Decision Processes

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A Theory of Regularized Markov Decision Processes. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2160–2169, 2019. 23

work page 2019
[46]

Regularized Policies Are Reward Robust

Hisham Husain, Kamil Ciosek, and Ryota Tomioka. Regularized Policies Are Reward Robust. InProceedings of the Twenty-Fourth International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 64–72, 2021

work page 2021
[47]

Twice Regularized MDPs and the Equiva- lence between Robustness and Regularization

Esther Derman, Matthieu Geist, and Shie Mannor. Twice Regularized MDPs and the Equiva- lence between Robustness and Regularization. InAdvances in Neural Information Processing Systems 34, pages 22274–22287, 2021

work page 2021
[48]

Maximum Entropy RL (Provably) Solves Some Robust RL Problems

Benjamin Eysenbach and Sergey Levine. Maximum Entropy RL (Provably) Solves Some Robust RL Problems. InInternational Conference on Learning Representations, 2022

work page 2022
[49]

Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems.Operations Research, 58(3):595–612, 2010

Erick Delage and Yinyu Ye. Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems.Operations Research, 58(3):595–612, 2010

work page 2010
[50]

Distributionally Robust Optimization and Its Tractable Approxi- mations.Operations Research, 58(4-part-1):902–917, 2010

Joel Goh and Melvyn Sim. Distributionally Robust Optimization and Its Tractable Approxi- mations.Operations Research, 58(4-part-1):902–917, 2010

work page 2010
[51]

Robust Solutions of Optimization Problems Affected by Uncertain Probabilities.Management Science, 59(2):341–357, 2013

Aharon Ben-Tal, Dick den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust Solutions of Optimization Problems Affected by Uncertain Probabilities.Management Science, 59(2):341–357, 2013

work page 2013
[52]

Jeff Hong

Zhaolin Hu and L. Jeff Hong. Kullback-Leibler Divergence Constrained Distributionally Robust Optimization.Operations Research Letters, 41(3):271–275, 2013

work page 2013
[53]

Distributionally Robust Convex Opti- mization.Operations Research, 62(6):1358–1376, 2014

Wolfram Wiesemann, Daniel Kuhn, and Melvyn Sim. Distributionally Robust Convex Opti- mization.Operations Research, 62(6):1358–1376, 2014

work page 2014
[54]

Frameworks and Results in Distributionally Robust Optimization.Open Journal of Mathematical Optimization, 3:1–85, 2022

Hamed Rahimian and Sanjay Mehrotra. Frameworks and Results in Distributionally Robust Optimization.Open Journal of Mathematical Optimization, 3:1–85, 2022

work page 2022
[55]

Güzin Bayraksan and David K. Love. Data-Driven Stochastic Programming Using Phi- Divergences. InINFORMS TutORials in Operations Research, pages 1–19. 2015

work page 2015
[56]

Particle diffusion and localized acceleration in inhomogeneous AGN jets - Part II: stochastic variation

David K. Love and Güzin Bayraksan. Phi-Divergence Constrained Ambiguous Stochastic Programs for Data-Driven Optimization.arXiv preprint arXiv:1603.00900, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[57]

Robust Sensitivity Analysis for Stochastic Systems.Mathematics of Operations Research, 41(4):1248–1275, 2016

Henry Lam. Robust Sensitivity Analysis for Stochastic Systems.Mathematics of Operations Research, 41(4):1248–1275, 2016

work page 2016
[58]

PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification

Matthias Seeger. PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification. Journal of Machine Learning Research, 3:233–269, 2002

work page 2002
[59]

Gintare Karolina Dziugaite and Daniel M. Roy. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, 2017

work page 2017
[60]

Gintare Karolina Dziugaite and Daniel M. Roy. Data-Dependent PAC-Bayes Priors via Differential Privacy.arXiv preprint arXiv:1802.09583, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[61]

Information- Theoretic Analysis of Stability and Bias of Learning Algorithms

Maxim Raginsky, Alexander Rakhlin, Matthew Tsao, Yihong Wu, and Aolin Xu. Information- Theoretic Analysis of Stability and Bias of Learning Algorithms. In2016 IEEE Information Theory Workshop (ITW), pages 26–30, 2016. 24

work page 2016
[62]

Veeravalli

Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening Mutual Information-Based Bounds on Generalization Error.IEEE Journal on Selected Areas in Information Theory, 1(1): 121–130, 2020

work page 2020
[63]

Roy, and Gintare Karolina Dziugaite

Mahdi Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M. Roy, and Gintare Karolina Dziugaite. Sharpened Generalization Bounds Based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms. InAdvances in Neural Information Processing Systems 33, pages 14177–14188, 2020

work page 2020
[64]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization Beyond Overfitting on Small Algorithmic Datasets.arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[65]

Michaud, and Max Tegmark

Ziming Liu, Eric J. Michaud, and Max Tegmark. Omnigrok: Grokking Beyond Algorithmic Data. InInternational Conference on Learning Representations, 2023

work page 2023
[66]

Reconciling Modern Machine- Learning Practice and the Classical Bias–Variance Trade-Off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling Modern Machine- Learning Practice and the Classical Bias–Variance Trade-Off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

work page 2019
[67]

Deep Double Descent: Where Bigger Models and More Data Hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep Double Descent: Where Bigger Models and More Data Hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

work page 2021
[68]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. InInternational Conference on Learning Representations, 2017

work page 2017
[69]

Cohen, Simran Kaur, Yuanzhi Li, J

Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability. InInternational Conference on Learning Representations, 2021. A Mathematical Derivations A.1 Derivation of the Optimal Adversarial Perturbation We seek the perturbationCs(a)that the adversary applies...

work page 2021

[1] [1]

McAllester

David A. McAllester. Some PAC-Bayesian Theorems.Machine Learning, 37(3):355–363, 1999

work page 1999

[2] [2]

Institute of Mathematical Statistics, 2007

Olivier Catoni.PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics, 2007

work page 2007

[3] [3]

Information-Theoretic Analysis of Generalization Capability of Learning Algorithms

Aolin Xu and Maxim Raginsky. Information-Theoretic Analysis of Generalization Capability of Learning Algorithms. InAdvances in Neural Information Processing Systems 30, pages 2524–2533, 2017

work page 2017

[4] [4]

Reasoning About Generalization via Conditional Mutual Information

Thomas Steinke and Lydia Zakynthinou. Reasoning About Generalization via Conditional Mutual Information. InProceedings of the Thirty Third Conference on Learning Theory, pages 3437–3452, 2020

work page 2020

[5] [5]

General- ization Bounds: Perspectives from Information Theory and PAC-Bayes.arXiv preprint arXiv:2309.04381, 2024

Fredrik Hellström, Giuseppe Durisi, Benjamin Guedj, and Maxim Raginsky. General- ization Bounds: Perspectives from Information Theory and PAC-Bayes.arXiv preprint arXiv:2309.04381, 2024

work page arXiv 2024

[6] [6]

Ortega and Daniel A

Pedro A. Ortega and Daniel A. Braun. Thermodynamics as a Theory of Decision-Making with Information-Processing Costs.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469(2153):20120683, 2013

work page 2013

[7] [7]

Information-Theoretic Bounded Rationality

Pedro A. Ortega, Daniel A. Braun, Justin S. Dyer, Kee-Eung Kim, and Naftali Tishby. Information-Theoretic Bounded Rationality.arXiv preprint arXiv:1512.06789, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

Markov Processes and the H-Theorem.Journal of the Physical Society of Japan, 18(3):328–331, 1963

Tadao Morimoto. Markov Processes and the H-Theorem.Journal of the Physical Society of Japan, 18(3):328–331, 1963

work page 1963

[9] [9]

Ali and Samuel D

Syed M. Ali and Samuel D. Silvey. A General Class of Coefficients of Divergence of One Distribution from Another.Journal of the Royal Statistical Society: Series B, 28(1):131–142, 1966

work page 1966

[10] [10]

Information-Type Measures of Difference of Probability Distributions and Indirect Observations.Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967

Imre Csiszár. Information-Type Measures of Difference of Probability Distributions and Indirect Observations.Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967. 21

work page 1967

[11] [11]

On the f-Divergence and Singular Statistical Experiments.Studia Scientiarum Mathematicarum Hungarica, 3:167–174, 1968

Igor Vajda. On the f-Divergence and Singular Statistical Experiments.Studia Scientiarum Mathematicarum Hungarica, 3:167–174, 1968

work page 1968

[12] [12]

Claude E. Shannon. Coding Theorems for a Discrete Source with a Fidelity Criterion.IRE National Convention Record, 7(4):142–163, 1959

work page 1959

[13] [13]

Prentice- Hall, 1971

Toby Berger.Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice- Hall, 1971

work page 1971

[14] [14]

Rate Distortion Theory with Generalized Information Measures via Convex Programming Duality.IEEE Transactions on Information Theory, 32(5): 630–641, 1986

Aharon Ben-Tal and Marc Teboulle. Rate Distortion Theory with Generalized Information Measures via Convex Programming Duality.IEEE Transactions on Information Theory, 32(5): 630–641, 1986

work page 1986

[15] [15]

Ortega and Daniel D

Pedro A. Ortega and Daniel D. Lee. An Adversarial Interpretation of Information-Theoretic Bounded Rationality. InProceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 2117–2124, 2014

work page 2014

[16] [16]

Ortega and Alan A

Pedro A. Ortega and Alan A. Stocker. Human Decision-Making under Limited Time. In Advances in Neural Information Processing Systems 29, pages 2145–2153, 2016

work page 2016

[17] [17]

Revealed Preference, Rational Inattention, and Costly Information Acquisition.American Economic Review, 105(7):2183–2203, 2015

Andrew Caplin and Mark Dean. Revealed Preference, Rational Inattention, and Costly Information Acquisition.American Economic Review, 105(7):2183–2203, 2015

work page 2015

[18] [18]

Braun and Pedro A

Daniel A. Braun and Pedro A. Ortega. Information-Theoretic Bounded Rationality and epsilon-Optimality.Entropy, 16(8):4662–4676, 2014

work page 2014

[19] [19]

Tyrrell Rockafellar.Convex Analysis

R. Tyrrell Rockafellar.Convex Analysis. Princeton University Press, 1970

work page 1970

[20] [20]

Penalty Functions and Duality in Stochastic Programming viaϕ-Divergence Functionals.Mathematics of Operations Research, 12(2):224–240, 1987

Aharon Ben-Tal and Marc Teboulle. Penalty Functions and Duality in Stochastic Programming viaϕ-Divergence Functionals.Mathematics of Operations Research, 12(2):224–240, 1987

work page 1987

[21] [21]

On Divergences and Informations in Statistics and Information Theory.IEEE Transactions on Information Theory, 52(10):4394–4412, 2006

Friedrich Liese and Igor Vajda. On Divergences and Informations in Statistics and Information Theory.IEEE Transactions on Information Theory, 52(10):4394–4412, 2006

work page 2006

[22] [22]

A Class of Measures of Informativity of Observation Channels.Periodica Mathematica Hungarica, 2(1–4):191–213, 1972

Imre Csiszár. A Class of Measures of Informativity of Observation Channels.Periodica Mathematica Hungarica, 2(1–4):191–213, 1972

work page 1972

[23] [23]

On Functionals Satisfying a Data-Processing Theorem.IEEE Transactions on Information Theory, 19(3):275–283, 1973

Jacob Ziv and Moshe Zakai. On Functionals Satisfying a Data-Processing Theorem.IEEE Transactions on Information Theory, 19(3):275–283, 1973

work page 1973

[24] [24]

Generalized Cutoff Rates and Rényi’s Information Measures.IEEE Transactions on Information Theory, 41(1):26–34, 1995

Imre Csiszár. Generalized Cutoff Rates and Rényi’s Information Measures.IEEE Transactions on Information Theory, 41(1):26–34, 1995

work page 1995

[25] [25]

On Csiszár’s f-Divergences and Informativities.Entropy, 20(2): 105, 2018

Igal Sason and Sergio Verdú. On Csiszár’s f-Divergences and Informativities.Entropy, 20(2): 105, 2018

work page 2018

[26] [26]

On Conjugate Convex Functions.Canadian Journal of Mathematics, 1:73–77, 1949

Werner Fenchel. On Conjugate Convex Functions.Canadian Journal of Mathematics, 1:73–77, 1949

work page 1949

[27] [27]

On General Minimax Theorems.Pacific Journal of Mathematics, 8(1):171–176, 1958

Maurice Sion. On General Minimax Theorems.Pacific Journal of Mathematics, 8(1):171–176, 1958

work page 1958

[28] [28]

Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Shane Legg, and Pedro A. Ortega. Your Policy Regularizer Is Secretly an Adversary.Transactions on Machine Learning Research, 2022. 22

work page 2022

[29] [29]

A Generalization of the Rate-Distortion Theory and Applications

Moshe Zakai and Jacob Ziv. A Generalization of the Rate-Distortion Theory and Applications. InInformation Theory: New Trends and Open Problems, pages 87–123. Springer, 1975

work page 1975

[30] [30]

Richard E. Blahut. Computation of Channel Capacity and Rate-Distortion Functions.IEEE Transactions on Information Theory, 18(4):460–473, 1972

work page 1972

[31] [31]

An Algorithm for Computing the Capacity of Arbitrary Discrete Memoryless Channels.IEEE Transactions on Information Theory, 18(1):14–20, 1972

Suguru Arimoto. An Algorithm for Computing the Capacity of Arbitrary Discrete Memoryless Channels.IEEE Transactions on Information Theory, 18(1):14–20, 1972

work page 1972

[32] [32]

Training Region-Based Object Detectors with Online Hard Example Mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training Region-Based Object Detectors with Online Hard Example Mining. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016

work page 2016

[33] [33]

Understanding Black-Box Predictions via Influence Functions

Pang Wei Koh and Percy Liang. Understanding Black-Box Predictions via Influence Functions. InProceedings of the 34th International Conference on Machine Learning, pages 1885–1894, 2017

work page 2017

[34] [34]

Data-Driven Distributionally Robust Optimiza- tion Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations

Peyman Mohajerin Esfahani and Daniel Kuhn. Data-Driven Distributionally Robust Optimiza- tion Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations. Mathematical Programming, 171(1–2):115–166, 2018

work page 2018

[35] [35]

Duchi and Hongseok Namkoong

John C. Duchi and Hongseok Namkoong. Learning Models with Uniform Performance via Distributionally Robust Optimization.The Annals of Statistics, 49(3):1378–1406, 2021

work page 2021

[36] [36]

Ortega and Daniel A

Pedro A. Ortega and Daniel A. Braun. Information, Utility and Bounded Rationality. In Artificial General Intelligence, pages 269–274, 2011

work page 2011

[37] [37]

Christopher A. Sims. Implications of Rational Inattention.Journal of Monetary Economics, 50 (3):665–690, 2003

work page 2003

[38] [38]

Rational Inattention to Discrete Choices: A New Foundation for the Multinomial Logit Model.American Economic Review, 105(1):272–298, 2015

Filip Matějka and Alisdair McKay. Rational Inattention to Discrete Choices: A New Foundation for the Multinomial Logit Model.American Economic Review, 105(1):272–298, 2015

work page 2015

[39] [39]

f-Divergences and Their Applications in Lossy Compression and Bounding Generalization Error.IEEE Transactions on Information Theory, 69(12):7538–7564, 2023

Saeed Masiha, Amin Gohari, and Mohammad Hossein Yassaee. f-Divergences and Their Applications in Lossy Compression and Bounding Generalization Error.IEEE Transactions on Information Theory, 69(12):7538–7564, 2023

work page 2023

[40] [40]

Howard and James E

Ronald A. Howard and James E. Matheson. Risk-Sensitive Markov Decision Processes.Man- agement Science, 18(7):356–369, 1972

work page 1972

[41] [41]

Jacobson

David H. Jacobson. Optimal Stochastic Linear Systems with Exponential Performance Criteria and Their Relation to Deterministic Differential Games.IEEE Transactions on Automatic Control, 18(2):124–131, 1973

work page 1973

[42] [42]

Wiley, 1990

Peter Whittle.Risk-Sensitive Optimal Control. Wiley, 1990

work page 1990

[43] [43]

Lars Peter Hansen and Thomas J. Sargent. Robust Control and Model Uncertainty.American Economic Review, 91(2):60–66, 2001

work page 2001

[44] [44]

Sargent.Robustness

Lars Peter Hansen and Thomas J. Sargent.Robustness. Princeton University Press, 2008

work page 2008

[45] [45]

A Theory of Regularized Markov Decision Processes

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A Theory of Regularized Markov Decision Processes. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2160–2169, 2019. 23

work page 2019

[46] [46]

Regularized Policies Are Reward Robust

Hisham Husain, Kamil Ciosek, and Ryota Tomioka. Regularized Policies Are Reward Robust. InProceedings of the Twenty-Fourth International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 64–72, 2021

work page 2021

[47] [47]

Twice Regularized MDPs and the Equiva- lence between Robustness and Regularization

Esther Derman, Matthieu Geist, and Shie Mannor. Twice Regularized MDPs and the Equiva- lence between Robustness and Regularization. InAdvances in Neural Information Processing Systems 34, pages 22274–22287, 2021

work page 2021

[48] [48]

Maximum Entropy RL (Provably) Solves Some Robust RL Problems

Benjamin Eysenbach and Sergey Levine. Maximum Entropy RL (Provably) Solves Some Robust RL Problems. InInternational Conference on Learning Representations, 2022

work page 2022

[49] [49]

Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems.Operations Research, 58(3):595–612, 2010

Erick Delage and Yinyu Ye. Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems.Operations Research, 58(3):595–612, 2010

work page 2010

[50] [50]

Distributionally Robust Optimization and Its Tractable Approxi- mations.Operations Research, 58(4-part-1):902–917, 2010

Joel Goh and Melvyn Sim. Distributionally Robust Optimization and Its Tractable Approxi- mations.Operations Research, 58(4-part-1):902–917, 2010

work page 2010

[51] [51]

Robust Solutions of Optimization Problems Affected by Uncertain Probabilities.Management Science, 59(2):341–357, 2013

Aharon Ben-Tal, Dick den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust Solutions of Optimization Problems Affected by Uncertain Probabilities.Management Science, 59(2):341–357, 2013

work page 2013

[52] [52]

Jeff Hong

Zhaolin Hu and L. Jeff Hong. Kullback-Leibler Divergence Constrained Distributionally Robust Optimization.Operations Research Letters, 41(3):271–275, 2013

work page 2013

[53] [53]

Distributionally Robust Convex Opti- mization.Operations Research, 62(6):1358–1376, 2014

Wolfram Wiesemann, Daniel Kuhn, and Melvyn Sim. Distributionally Robust Convex Opti- mization.Operations Research, 62(6):1358–1376, 2014

work page 2014

[54] [54]

Frameworks and Results in Distributionally Robust Optimization.Open Journal of Mathematical Optimization, 3:1–85, 2022

Hamed Rahimian and Sanjay Mehrotra. Frameworks and Results in Distributionally Robust Optimization.Open Journal of Mathematical Optimization, 3:1–85, 2022

work page 2022

[55] [55]

Güzin Bayraksan and David K. Love. Data-Driven Stochastic Programming Using Phi- Divergences. InINFORMS TutORials in Operations Research, pages 1–19. 2015

work page 2015

[56] [56]

Particle diffusion and localized acceleration in inhomogeneous AGN jets - Part II: stochastic variation

David K. Love and Güzin Bayraksan. Phi-Divergence Constrained Ambiguous Stochastic Programs for Data-Driven Optimization.arXiv preprint arXiv:1603.00900, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[57] [57]

Robust Sensitivity Analysis for Stochastic Systems.Mathematics of Operations Research, 41(4):1248–1275, 2016

Henry Lam. Robust Sensitivity Analysis for Stochastic Systems.Mathematics of Operations Research, 41(4):1248–1275, 2016

work page 2016

[58] [58]

PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification

Matthias Seeger. PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification. Journal of Machine Learning Research, 3:233–269, 2002

work page 2002

[59] [59]

Gintare Karolina Dziugaite and Daniel M. Roy. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, 2017

work page 2017

[60] [60]

Gintare Karolina Dziugaite and Daniel M. Roy. Data-Dependent PAC-Bayes Priors via Differential Privacy.arXiv preprint arXiv:1802.09583, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[61] [61]

Information- Theoretic Analysis of Stability and Bias of Learning Algorithms

Maxim Raginsky, Alexander Rakhlin, Matthew Tsao, Yihong Wu, and Aolin Xu. Information- Theoretic Analysis of Stability and Bias of Learning Algorithms. In2016 IEEE Information Theory Workshop (ITW), pages 26–30, 2016. 24

work page 2016

[62] [62]

Veeravalli

Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening Mutual Information-Based Bounds on Generalization Error.IEEE Journal on Selected Areas in Information Theory, 1(1): 121–130, 2020

work page 2020

[63] [63]

Roy, and Gintare Karolina Dziugaite

Mahdi Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M. Roy, and Gintare Karolina Dziugaite. Sharpened Generalization Bounds Based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms. InAdvances in Neural Information Processing Systems 33, pages 14177–14188, 2020

work page 2020

[64] [64]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization Beyond Overfitting on Small Algorithmic Datasets.arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[65] [65]

Michaud, and Max Tegmark

Ziming Liu, Eric J. Michaud, and Max Tegmark. Omnigrok: Grokking Beyond Algorithmic Data. InInternational Conference on Learning Representations, 2023

work page 2023

[66] [66]

Reconciling Modern Machine- Learning Practice and the Classical Bias–Variance Trade-Off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling Modern Machine- Learning Practice and the Classical Bias–Variance Trade-Off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

work page 2019

[67] [67]

Deep Double Descent: Where Bigger Models and More Data Hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep Double Descent: Where Bigger Models and More Data Hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

work page 2021

[68] [68]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. InInternational Conference on Learning Representations, 2017

work page 2017

[69] [69]

Cohen, Simran Kaur, Yuanzhi Li, J

Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability. InInternational Conference on Learning Representations, 2021. A Mathematical Derivations A.1 Derivation of the Optimal Adversarial Perturbation We seek the perturbationCs(a)that the adversary applies...

work page 2021