Information-Theoretic Generalization Bounds for Stochastic Gradient Descent with Predictable Virtual Noise

Mohammad Partohaghighi

arxiv: 2605.00064 · v1 · submitted 2026-04-30 · 💻 cs.LG

Information-Theoretic Generalization Bounds for Stochastic Gradient Descent with Predictable Virtual Noise

Mohammad Partohaghighi This is my paper

Pith reviewed 2026-05-09 20:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords generalization boundsstochastic gradient descentinformation theoryvirtual perturbationsmutual informationadaptive noiserelative entropymachine learning

0 comments

The pith

Predictable history-adaptive virtual perturbations enable information-theoretic generalization bounds for SGD that incorporate path-dependent noise geometry without changing the algorithm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that virtual Gaussian perturbations added only in the proof can have covariances that depend on past SGD iterations but remain independent of current or future randomness. This predictability supports a conditional relative-entropy calculation that produces generalization bounds with adaptive terms for sensitivity, gradient deviation, and accumulated noise. The framework also introduces a separate covariance-comparison cost when the actual adaptive covariance must be compared against a fixed reference kernel, and it recovers existing fixed-noise bounds as special cases under synchronization rules such as deterministic or prefix-observable covariances.

Core claim

By restricting virtual perturbation covariances to depend solely on past real SGD history, a conditional Gaussian relative-entropy argument yields generalization bounds that replace fixed sensitivity and deviation terms with conditional adaptive counterparts, include an output-sensitivity penalty from accumulated perturbation covariance, and reduce the deviation term to conditional variance under conditional unbiasedness; the bounds hold after separating local smoothing from global reference-kernel comparison and recover fixed isotropic and geometry-aware bounds as special cases.

What carries the argument

Predictable history-adaptive virtual perturbations: auxiliary Gaussian noise whose covariance at each step is a function of past real SGD trajectory but independent of current and future randomness, enabling conditional mutual-information bounds.

If this is right

Bounds replace fixed sensitivity and gradient-deviation terms with conditional adaptive versions that track moving gradient statistics.
An explicit output-sensitivity penalty appears from the accumulated perturbation covariance across iterations.
The deviation term simplifies to conditional variance alone when the noise satisfies conditional unbiasedness.
Fixed-noise-style bounds are recovered exactly when the covariance rule is deterministic, public, or prefix-observable.
The same machinery extends virtual-perturbation analysis to history-dependent SGD variants without modifying the underlying algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same predictability condition could be applied to analyze preconditioned or momentum-based variants by letting the reference kernel encode curvature proxies from earlier steps.
Empirical checks on whether the covariance-comparison cost remains small for common adaptive optimizers would indicate how often the new bounds are meaningfully tighter than fixed-noise versions.
If the reference-kernel comparison cost grows with data dependence, the framework suggests choosing covariances that stay close to a stable public schedule to keep the bound practical.
The separation of local Gaussian smoothing from global kernel comparison may generalize to other information-theoretic proofs that mix data-dependent and data-independent quantities.

Load-bearing premise

The perturbation covariance at each iteration depends only on past real SGD history and is independent of current or future randomness, plus the existence of an admissible reference kernel when the actual covariance is data-dependent.

What would settle it

Construct an SGD run with a covariance rule that violates predictability by depending on the current gradient realization and check whether the derived generalization bound still holds or is violated while a non-predictable analysis remains valid.

read the original abstract

Information-theoretic generalization bounds analyze stochastic optimization by relating expected generalization error to the mutual information between learned parameters and training data. Virtual perturbation analyses of SGD add auxiliary Gaussian noise only in the proof, making mutual information tractable while leaving the actual SGD trajectory unchanged. Existing bounds, however, typically require perturbation covariances to be fixed independently of the optimization history, limiting their ability to represent geometries induced by moving gradient statistics, preconditioners, curvature proxies, and other pathwise information. We introduce predictable history-adaptive virtual perturbations, where the perturbation covariance at each iteration may depend on the past real SGD history but not on current or future randomness. This predictability enables a conditional Gaussian relative-entropy argument and yields generalization bounds for SGD with adaptive virtual-noise geometry. The bounds replace fixed sensitivity and gradient-deviation terms with conditional adaptive counterparts, include an output-sensitivity penalty from accumulated perturbation covariance, and reduce the deviation term to a conditional variance only under conditional unbiasedness. Since adaptive covariances may be data-dependent, we separate local Gaussian smoothing from global reference-kernel comparison. The resulting bound includes a covariance-comparison cost measuring the KL price of using an admissible reference geometry different from the actual adaptive covariance. Fixed-noise-style bounds are recovered under admissible synchronization, such as deterministic, public, or prefix-observable covariance rules. The framework recovers fixed isotropic and geometry-aware bounds as special cases while extending virtual perturbation analysis to history-dependent SGD without modifying the algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper lets virtual noise covariance in SGD generalization bounds depend on past history via a predictability condition and a KL comparison cost, recovering fixed-noise cases cleanly.

read the letter

The main point is that the authors extend virtual perturbation analysis so the added Gaussian noise can have a covariance that depends on the SGD history up to the current step, as long as it ignores current and future randomness. Predictability makes a conditional relative-entropy argument work and produces bounds with adaptive terms instead of fixed sensitivity and deviation quantities. They also add an output-sensitivity penalty tied to accumulated covariance and reduce the deviation term to conditional variance when gradients are unbiased conditionally. The split between local smoothing and a global reference-kernel comparison handles data-dependent covariances by charging an explicit KL cost for the mismatch. Fixed isotropic and geometry-aware bounds drop out as special cases when the covariance rule is deterministic, public, or prefix-observable. The argument avoids circularity because predictability is an explicit modeling assumption rather than something derived from the data or parameters. The reference kernel is chosen as an admissible comparator, not fitted post hoc. One limitation is that the comparison cost can grow large if the reference geometry is a poor match for the actual adaptive covariance, and the paper gives little guidance on picking a useful reference for common adaptive methods. There are also no experiments or tightness checks against prior bounds, so the practical improvement is not yet visible. This is aimed at people already working on information-theoretic generalization for stochastic optimizers. If that is your area, the new condition and the separation into local and global pieces are worth seeing. I would send it to referees; the core construction is explicit and the derivations appear consistent on their own terms.

Referee Report

0 major / 3 minor

Summary. The paper claims to extend information-theoretic generalization bounds for SGD by introducing predictable history-adaptive virtual perturbations. The perturbation covariance at each step may depend on the past real SGD history (but not current or future randomness), enabling a conditional Gaussian relative-entropy argument. This produces bounds that replace fixed sensitivity and gradient-deviation terms with conditional adaptive counterparts, incorporate an output-sensitivity penalty from accumulated covariance, and reduce the deviation term to conditional variance under unbiasedness. Data-dependent covariances are handled by separating local Gaussian smoothing from a global reference-kernel comparison that incurs a KL covariance-comparison cost; fixed-noise bounds are recovered under deterministic, public, or prefix-observable covariance rules. The framework recovers isotropic and geometry-aware fixed-noise bounds as special cases.

Significance. If the derivation holds, the result meaningfully broadens virtual-perturbation analyses to adaptive, history-dependent SGD geometries (preconditioners, curvature proxies, moving gradient statistics) without modifying the underlying algorithm. The explicit separation of local smoothing from global reference comparison, together with the recovery of standard bounds under admissible synchronization rules, addresses a concrete limitation of prior fixed-covariance approaches. The construction appears parameter-free in the sense noted by the axiom ledger and relies on standard mutual-information and conditional-KL properties rather than circular or fitted quantities.

minor comments (3)

The predictability condition (covariance measurable w.r.t. past SGD history only) is load-bearing for the conditional relative-entropy step; a short explicit definition in terms of the underlying filtration would improve readability even if it appears in the full text.
The abstract states that the deviation term reduces to conditional variance 'under conditional unbiasedness'; a brief remark or example clarifying when this holds for common SGD variants would help readers assess applicability.
Notation for the reference kernel and the covariance-comparison cost (KL price) should be introduced with a short table or diagram contrasting the actual adaptive covariance, the reference geometry, and the resulting penalty term.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our manuscript, the assessment of its significance, and the recommendation for minor revision. The referee's description correctly reflects our introduction of predictable history-adaptive virtual perturbations and the resulting conditional information-theoretic bounds for SGD.

Circularity Check

0 steps flagged

No significant circularity; derivation relies on explicit modeling assumptions and standard information-theoretic identities

full rationale

The paper defines predictable history-adaptive virtual perturbations as an explicit modeling choice (covariance measurable w.r.t. past SGD history only, independent of current/future randomness). This condition is used to invoke conditional Gaussian relative-entropy and KL comparison costs, which are standard properties of mutual information and conditional divergence rather than self-referential quantities. No derivation step reduces a claimed result to a fitted parameter, self-citation chain, or ansatz smuggled from prior work by the same authors. Fixed-noise bounds are recovered as special cases under deterministic/public rules, confirming the framework is self-contained against external benchmarks. The central claims rest on the predictability assumption and admissible reference kernel, both stated as inputs rather than derived outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only; the ledger is populated from explicit statements in the abstract. The framework relies on standard information-theoretic tools and introduces the predictability modeling choice as the main new assumption.

axioms (2)

standard math Conditional Gaussian relative-entropy inequality holds when covariance is predictable (depends only on past history)
Invoked to obtain the generalization bound after introducing predictable virtual perturbations.
domain assumption Mutual information between parameters and data can be bounded via virtual perturbations without changing the SGD trajectory
Core premise of the virtual perturbation analysis approach stated in the abstract.

pith-pipeline@v0.9.0 · 5555 in / 1430 out tokens · 34025 ms · 2026-05-09T20:46:30.481568+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Train faster, generalize better: Stability of stochastic gradient descent

Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. InInternational Conference on Machine Learning, 2016

work page 2016
[2]

Fine-grained analysis of stability and generalization for stochastic gradient descent

Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. InInternational Conference on Machine Learning, 2020

work page 2020
[3]

Stability of stochastic gradient descent on nonsmooth convex losses

Raef Bassily, Vitaly Feldman, Crist´obal Guzm´an, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[4]

Sharper bounds for uniformly stable algorithms

Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. InConference on Learning Theory, 2020

work page 2020
[5]

Controlling bias in adaptive data analysis using information theory

Daniel Russo and James Zou. Controlling bias in adaptive data analysis using information theory. In Artificial Intelligence and Statistics, 2016

work page 2016
[6]

Information-theoretic analysis of generalization capability of learning algorithms

Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[7]

Generalization error bounds for noisy, iterative algorithms

Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative algorithms. InIEEE International Symposium on Information Theory, 2018

work page 2018
[8]

Jeffrey Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, and Daniel M. Roy. Information-theoretic generalization bounds for SGLD via data-dependent estimates. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[9]

Roy, and Gintare Karolina Dziugaite

Mahdi Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M. Roy, and Gintare Karolina Dziugaite. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[10]

Reasoning about generalization via conditional mutual information

Thomas Steinke and Lydia Zakynthinou. Reasoning about generalization via conditional mutual information. InConference on Learning Theory, 2020

work page 2020
[11]

Gergely Neu, Gintare Karolina Dziugaite, Mahdi Haghifam, and Daniel M. Roy. Information-theoretic generalization bounds for stochastic gradient descent. InProceedings of Machine Learning Research, 2021

work page 2021
[12]

Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 2011

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 2011. 60

work page 2011
[13]

Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012

work page 2012
[14]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations, 2015. 61 Algorithm 1Proxy estimation for adaptive virtual perturbation diagnostics Require: SGD trajectory {Wt, Gt, Bt}T t=1, learning rates {ηt}T−1 t=1 , actual covariance rule Σt = Φ t(Ht−1), optional admissible reference ...

work page 2015

[1] [1]

Train faster, generalize better: Stability of stochastic gradient descent

Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. InInternational Conference on Machine Learning, 2016

work page 2016

[2] [2]

Fine-grained analysis of stability and generalization for stochastic gradient descent

Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. InInternational Conference on Machine Learning, 2020

work page 2020

[3] [3]

Stability of stochastic gradient descent on nonsmooth convex losses

Raef Bassily, Vitaly Feldman, Crist´obal Guzm´an, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. InAdvances in Neural Information Processing Systems, 2020

work page 2020

[4] [4]

Sharper bounds for uniformly stable algorithms

Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. InConference on Learning Theory, 2020

work page 2020

[5] [5]

Controlling bias in adaptive data analysis using information theory

Daniel Russo and James Zou. Controlling bias in adaptive data analysis using information theory. In Artificial Intelligence and Statistics, 2016

work page 2016

[6] [6]

Information-theoretic analysis of generalization capability of learning algorithms

Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. InAdvances in Neural Information Processing Systems, 2017

work page 2017

[7] [7]

Generalization error bounds for noisy, iterative algorithms

Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative algorithms. InIEEE International Symposium on Information Theory, 2018

work page 2018

[8] [8]

Jeffrey Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, and Daniel M. Roy. Information-theoretic generalization bounds for SGLD via data-dependent estimates. InAdvances in Neural Information Processing Systems, 2019

work page 2019

[9] [9]

Roy, and Gintare Karolina Dziugaite

Mahdi Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M. Roy, and Gintare Karolina Dziugaite. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. InAdvances in Neural Information Processing Systems, 2020

work page 2020

[10] [10]

Reasoning about generalization via conditional mutual information

Thomas Steinke and Lydia Zakynthinou. Reasoning about generalization via conditional mutual information. InConference on Learning Theory, 2020

work page 2020

[11] [11]

Gergely Neu, Gintare Karolina Dziugaite, Mahdi Haghifam, and Daniel M. Roy. Information-theoretic generalization bounds for stochastic gradient descent. InProceedings of Machine Learning Research, 2021

work page 2021

[12] [12]

Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 2011

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 2011. 60

work page 2011

[13] [13]

Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012

work page 2012

[14] [14]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations, 2015. 61 Algorithm 1Proxy estimation for adaptive virtual perturbation diagnostics Require: SGD trajectory {Wt, Gt, Bt}T t=1, learning rates {ηt}T−1 t=1 , actual covariance rule Σt = Φ t(Ht−1), optional admissible reference ...

work page 2015