Recognition: unknown
Optimal Stability of KL Divergence under Gaussian Perturbations
Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3
The pith
For any distribution P with finite second moments, KL(P to N2) is at least KL(P to N1) minus O(sqrt(ε)) whenever two Gaussians N1 and N2 differ by at most ε in KL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Let P be any distribution with finite second moment and let N1, N2 be multivariate Gaussians. If KL(P || N1) is large and KL(N1 || N2) ≤ ε, then KL(P || N2) ≥ KL(P || N1) − O(√ε). The paper also proves that this √ε rate cannot be improved in general, even when P itself is Gaussian.
What carries the argument
Relaxed triangle inequality for KL divergence under Gaussian perturbations, derived from moment bounds and the specific geometry of Gaussians.
If this is right
- KL-based out-of-distribution scoring becomes rigorous for flow-based generative models that are not purely Gaussian.
- KL reasoning can be applied directly in reinforcement learning and deep learning pipelines without forcing Gaussian assumptions on the target distribution.
- The tightness of the √ε rate limits how large a Gaussian perturbation can be tolerated before the stability guarantee breaks.
- Classical Gaussian-only stability results are now special cases of a more general statement.
Where Pith is reading between the lines
- The same moment condition might yield analogous stability results for other f-divergences.
- Numerical checks on simple non-Gaussian mixtures could verify whether the O(√ε) constant is sharp in practice.
- In variational inference the bound suggests that small perturbations to an approximate posterior Gaussian incur only modest extra KL cost when the true posterior has finite variance.
Load-bearing premise
P must have finite second moment; without it the stated bound can fail.
What would settle it
A concrete distribution P with infinite second moment together with Gaussians N1, N2 where KL(P||N2) drops by more than any fixed multiple of √ε below KL(P||N1) for arbitrarily small ε.
Figures
read the original abstract
We study the problem of characterizing the stability of Kullback-Leibler (KL) divergence under Gaussian perturbations beyond Gaussian families. Existing relaxed triangle inequalities for KL divergence critically rely on the assumption that all involved distributions are Gaussian, which limits their applicability in modern applications such as out-of-distribution (OOD) detection with flow-based generative models. In this paper, we remove this restriction by establishing a sharp stability bound between an arbitrary distribution and Gaussian families under mild moment conditions. Specifically, let $P$ be a distribution with finite second moment, and let $\mathcal{N}_1$ and $\mathcal{N}_2$ be multivariate Gaussian distributions. We show that if $KL(P||\mathcal{N}_1)$ is large and $KL(\mathcal{N}_1||\mathcal{N}_2)$ is at most $\epsilon$, then $KL(P||\mathcal{N}_2) \ge KL(P||\mathcal{N}_1) - O(\sqrt{\epsilon})$. Moreover, we prove that this $\sqrt{\epsilon}$ rate is optimal in general, even within the Gaussian family. This result reveals an intrinsic stability property of KL divergence under Gaussian perturbations, extending classical Gaussian-only relaxed triangle inequalities to general distributions. The result is non-trivial due to the asymmetry of KL divergence and the absence of a triangle inequality in general probability spaces. As an application, we provide a rigorous foundation for KL-based OOD analysis in flow-based models, removing strong Gaussian assumptions used in prior work. More broadly, our result enables KL-based reasoning in non-Gaussian settings arising in deep learning and reinforcement learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to establish a sharp stability result for KL divergence under Gaussian perturbations: for any distribution P with finite second moment and multivariate Gaussians N1, N2, if KL(P || N1) is large and KL(N1 || N2) ≤ ε then KL(P || N2) ≥ KL(P || N1) − O(√ε). The √ε rate is shown to be optimal even when restricting to the Gaussian family. The result removes the all-Gaussian assumption from prior relaxed triangle inequalities and is applied to justify KL-based OOD detection in flow-based generative models.
Significance. If the central derivation holds, the result is significant because it supplies the first explicit, optimal one-sided stability bound that applies to non-Gaussian P under only a second-moment condition. The explicit Gaussian constructions establishing optimality and the careful treatment of KL asymmetry are strengths that directly support applications in deep generative models and reinforcement learning where strong Gaussian assumptions are unrealistic.
minor comments (4)
- Abstract and §1: the qualifier “KL(P||N1) is large” is used without a precise threshold; the main theorem statement should clarify whether the O(√ε) bound holds for all finite-second-moment P or only when KL(P||N1) exceeds a quantity depending on ε and the second moments of P.
- §3 (proof of the lower bound): the argument that the difference of Gaussian log-densities is a quadratic polynomial controlled by KL(N1||N2) ≤ ε is sketched but the explicit constant in the O(√ε) term is not displayed; adding the dependence on dimension d and the second-moment bound of P would make the result more usable.
- §4 (optimality construction): the mean-shift example is convincing, yet it would help to state the exact scaling of the mean displacement with √ε and to confirm that the resulting KL(P||N2) drop is exactly Θ(√ε) rather than o(√ε) for the chosen sequence of ε.
- Notation: the symbols n1 and n2 for the Gaussian densities are introduced without a global definition; a short notation table or consistent use of N1(x), N2(x) would improve readability.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our manuscript and for recommending minor revision. The referee's summary accurately captures the central result: a sharp one-sided stability bound for KL(P || N2) in terms of KL(P || N1) when N1 and N2 are Gaussians, P has finite second moments, and the bound is optimal even within the Gaussian family. As the report contains no specific major comments, we have no revisions to propose at this stage.
Circularity Check
No significant circularity; self-contained mathematical derivation
full rationale
The paper derives the claimed stability bound KL(P||N2) ≥ KL(P||N1) - O(√ε) directly from the explicit form of Gaussian log-densities and the finite-second-moment assumption on P, which guarantees that the expectation of the quadratic difference exists and can be controlled by the parameter distance induced by KL(N1||N2) ≤ ε. The one-sided lower bound is obtained by discarding the positive part of the difference; optimality follows from explicit mean-shift constructions within the Gaussian family that achieve a matching Θ(√ε) drop. No fitted parameters are renamed as predictions, no self-citations serve as load-bearing premises for the core inequality, and the result does not reduce to a redefinition or ansatz imported from prior author work. The derivation is therefore independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption P has finite second moment
Reference graph
Works this paper leans on
-
[1]
Kullback,Information theory and statistics
S. Kullback,Information theory and statistics. Courier Corporation, 1997
1997
-
[2]
C. M. Bishop,Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006
2006
-
[3]
Goodfellow, Y
I. Goodfellow, Y . Bengio, A. Courville, and Y . Bengio,Deep learning. MIT press Cambridge, 2016, vol. 1, no. 2
2016
-
[4]
Pardo,Statistical inference based on divergence measures
L. Pardo,Statistical inference based on divergence measures. CRC press, 2018
2018
-
[5]
T. M. Cover and J. A. Thomas,Elements of information theory. John Wiley & Sons, 2012
2012
-
[6]
On the properties of kullback-leibler divergence between multivariate gaus- sian distributions,
Y . Zhang, J. Pan, L. K. Li, W. Liu, Z. Chen, X. Liu, and J. Wang, “On the properties of kullback-leibler divergence between multivariate gaus- sian distributions,”Advances in neural information processing systems, vol. 36, pp. 58 152–58 165, 2023
2023
-
[7]
The kullback–leibler divergence between lattice gaussian distributions,
F. Nielsen, “The kullback–leibler divergence between lattice gaussian distributions,”Journal of the Indian Institute of Science, vol. 102, no. 4, pp. 1177–1188, 2022
2022
-
[8]
Lower and upper bounds for approximation of the Kullback-Leibler divergence between gaussian mixture models,
J. . Durrieu, J. . Thiran, and F. Kelly, “Lower and upper bounds for approximation of the Kullback-Leibler divergence between gaussian mixture models,” in2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4833–4836
2012
-
[9]
Approximating the Kullback-Leibler di- vergence between gaussian mixture models,
J. R. Hershey and P. A. Olsen, “Approximating the Kullback-Leibler di- vergence between gaussian mixture models,” in2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 4. IEEE, 2007, pp. IV–317
2007
-
[10]
The Kullback-Leibler diver- gence rate between markov sources,
Z. Rached, F. Alajaji, and L. L. Campbell, “The Kullback-Leibler diver- gence rate between markov sources,”IEEE Transactions on Information Theory, vol. 50, no. 5, pp. 917–921, 2004
2004
-
[11]
S. Xiao, Y . Zhang, C. Liu, Y . Ding, K. Li, and K. Li, “Relaxed triangle inequality for kullback-leibler divergence between multivariate gaussian distributions,” 2026. [Online]. Available: https: //arxiv.org/abs/2602.02577
-
[12]
Constrained variational policy optimization for safe reinforcement learning,
Z. Liu, Z. Cen, V . Isenbaev, W. Liu, S. Wu, B. Li, and D. Zhao, “Constrained variational policy optimization for safe reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 13 644–13 668
2022
-
[13]
H. Ashtiani, S. Ben-David, N. J. A. Harvey, C. Liaw, A. Mehrabian, and Y . Plan, “Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes,”J. ACM, vol. 67, no. 6, oct 2020. [Online]. Available: https://doi.org/10.1145/3417994
-
[14]
Kullback-leibler divergence-based out-of-distribution detection with flow-based generative models,
Y . Zhang, J. Pan, W. Liu, Z. Chen, K. Li, J. Wang, Z. Liu, and H. Wei, “Kullback-leibler divergence-based out-of-distribution detection with flow-based generative models,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 4, pp. 1683–1697, 2024
2024
-
[15]
OpenAI, “Glow,” https://github.com/openai/glow, 2018
2018
-
[16]
Optimism in reinforcement learning and kullback-leibler divergence,
S. Filippi, O. Capp ´e, and A. Garivier, “Optimism in reinforcement learning and kullback-leibler divergence,” in2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2010, pp. 115–122
2010
-
[17]
Online Markov decision processes with Kullback–Leibler control cost,
P. Guan, M. Raginsky, and R. M. Willett, “Online Markov decision processes with Kullback–Leibler control cost,”IEEE Transactions on Automatic Control, vol. 59, no. 6, pp. 1423–1438, 2014
2014
-
[18]
Learning a distance metric by balancing kl-divergence for imbalanced datasets,
L. Feng, H. Wang, B. Jin, H. Li, M. Xue, and L. Wang, “Learning a distance metric by balancing kl-divergence for imbalanced datasets,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 49, no. 12, pp. 2384–2395, 2019
2019
-
[19]
Kullback–Leibler divergence between mul- tivariate generalized gaussian distributions,
N. Bouhlel and A. Dziri, “Kullback–Leibler divergence between mul- tivariate generalized gaussian distributions,”IEEE Signal Processing Letters, vol. 26, no. 7, pp. 1021–1025, 2019
2019
-
[20]
On the Kullback-Leibler divergence between discrete nor- mal distributions,
F. Nielsen, “On the Kullback-Leibler divergence between discrete nor- mal distributions,”arXiv preprint arXiv:2109.14920, 2021
-
[21]
The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming,
L. M. Bregman, “The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming,”USSR computational mathematics and mathematical physics, vol. 7, no. 3, pp. 200–217, 1967
1967
-
[22]
Clustering with Bregman divergences,
A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,”Journal of Machine Learning Research, vol. 6, no. 58, pp. 1705–1749, 2005. [Online]. Available: http: //jmlr.org/papers/v6/banerjee05b.html
2005
-
[23]
Functional Bregman divergence,
B. A. Frigyik, S. Srivastava, and M. R. Gupta, “Functional Bregman divergence,” in2008 IEEE International Symposium on Information Theory, 2008, pp. 1681–1685
2008
-
[24]
A general class of coefficients of divergence of one distribution from another,
S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 28, no. 1, pp. 131–142, 1966. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14
1966
-
[25]
Optimal bounds between f-divergences and integral probability metrics,
R. Agrawal and T. Horel, “Optimal bounds between f-divergences and integral probability metrics,”Journal of Machine Learning Research, vol. 22, no. 128, pp. 1–59, 2021. [Online]. Available: http://jmlr.org/papers/v22/20-867.html
2021
-
[26]
On measures of entropy and information,
A. R ´enyi, “On measures of entropy and information,” inProceedings of the fourth Berkeley symposium on mathematical statistics and probabil- ity, volume 1: contributions to the theory of statistics, vol. 4. University of California Press, 1961, pp. 547–562
1961
-
[27]
Non-negative latent factor model based onβ-divergence for recommender systems,
L. Xin, Y . Yuan, M. Zhou, Z. Liu, and M. Shang, “Non-negative latent factor model based onβ-divergence for recommender systems,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 8, pp. 4612–4623, 2021
2021
-
[28]
(f,Γ)-divergences: Interpolating betweenf-divergences and integral probability metrics,
J. Birrell, P. Dupuis, M. A. Katsoulakis, Y . Pantazis, and L. Rey- Bellet, “(f,Γ)-divergences: Interpolating betweenf-divergences and integral probability metrics,”Journal of Machine Learning Research, vol. 23, no. 39, pp. 1–70, 2022. [Online]. Available: http://jmlr.org/papers/v23/21-0100.html
2022
-
[29]
The jensen- shannon divergence,
M. L. Men ´endez, J. A. Pardo, L. Pardo, and M. d. C. Pardo, “The jensen- shannon divergence,”Journal of the Franklin Institute, vol. 334, no. 2, pp. 307–318, 1997
1997
-
[30]
A class of wasserstein metrics for probability distributions
C. R. Givens and R. M. Shortt, “A class of wasserstein metrics for probability distributions.”Michigan Mathematical Journal, vol. 31, no. 2, pp. 231–240, 1984
1984
-
[31]
Generalized out-of-distribution detection: A survey,
J. Yang, K. Zhou, Y . Li, and Z. Liu, “Generalized out-of-distribution detection: A survey,”Int. J. Comput. Vis., vol. 132, no. 12, pp. 5635–5662, 2024. [Online]. Available: https://doi.org/10.1007/ s11263-024-02117-4
2024
-
[32]
Anomaly detection: A survey,
V . Chandola, A. Banerjee, and V . Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, no. 3, Jul. 2009
2009
-
[33]
Deep learning for anomaly detection: A review,
G. Pang, C. Shen, L. Cao, and A. V . D. Hengel, “Deep learning for anomaly detection: A review,”ACM Comput. Surv., vol. 54, no. 2, Mar. 2021
2021
-
[34]
Survey on applying gan for anomaly detection,
B. J. Beula Rani and L. Sumathi M. E, “Survey on applying gan for anomaly detection,” in2020 International Conference on Computer Communication and Informatics (ICCCI), 2020, pp. 1–5
2020
-
[35]
Recursive histogram tracking-based rapid online anomaly detection in cyber-physical systems,
R. Kumar, R. R. Hossain, S. Talukder, A. Jena, and A. T. A. Ghazo, “Recursive histogram tracking-based rapid online anomaly detection in cyber-physical systems,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 11, pp. 7123–7133, 2022
2022
-
[36]
Group deviation detection methods: A survey,
E. Toth and S. Chawla, “Group deviation detection methods: A survey,” ACM Comput. Surv., vol. 51, no. 4, Jul. 2018
2018
-
[37]
Detecting out-of-distribution inputs to deep generative models using typicality,
E. Nalisnick, A. Matsukawa, Y . W. Teh, and B. Lakshminarayanan, “Detecting out-of-distribution inputs to deep generative models using typicality,”4th workshop on Bayesian Deep Learning (NeurIPS 2019), 2019
2019
-
[38]
Density estimation using real nvp,
L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” inProceedings of the International Conference on Learning Representations (ICLR), 2017
2017
-
[39]
Residual flows for invertible generative modeling,
R. T. Chen, J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen, “Residual flows for invertible generative modeling,”Advances in neural informa- tion processing systems, vol. 32, 2019
2019
-
[40]
Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024
S. Zhai, R. Zhang, P. Nakkiran, D. Berthelot, J. Gu, H. Zheng, T. Chen, M. A. Bautista, N. Jaitly, and J. Susskind, “Normalizing flows are capable generative models,” 2025. [Online]. Available: https://arxiv.org/abs/2412.06329
-
[41]
Glow: Generative flow with invertible 1x1 convolutions,
D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” inAdvances in Neural Information Processing Sys- tems, 2018, pp. 10 215–10 224
2018
-
[42]
An elementary introduction to information geometry,
F. Nielsen, “An elementary introduction to information geometry,” Entropy, vol. 22, no. 10, 2020. [Online]. Available: https://www.mdpi. com/1099-4300/22/10/1100
2020
-
[43]
Normalizing flows for probabilistic modeling and inference,
G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, “Normalizing flows for probabilistic modeling and inference,” 2019
2019
-
[44]
Bhatia,Matrix analysis
R. Bhatia,Matrix analysis. Springer Science & Business Media, 2013
2013
-
[45]
Normalizing flows: An introduction and review of current methods,
I. Kobyzev, S. J. Prince, and M. A. Brubaker, “Normalizing flows: An introduction and review of current methods,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 3964– 3979, 2021
2021
-
[46]
Trust region policy optimization,
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897
2015
-
[47]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
Auto-Encoding Variational Bayes
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2014. [Online]. Available: http://arxiv.org/abs/1312.6114
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[49]
beta-vae: Learning basic visual concepts with a constrained variational framework
I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework.”ICLR, vol. 2, no. 5, p. 6, 2017
2017
-
[50]
Understanding disentangling inβ-vae,
C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner, “Understanding disentangling inβ-vae,” inWorkshop on Learning Disentangled Representations at the 31st Conference on Neural Information Processing Systems, 2018
2018
-
[51]
Wasserstein auto-encoders.arXiv preprint arXiv:1711.01558,
I. O. Tolstikhin, O. Bousquet, S. Gelly, and B. Sch ¨olkopf, “Wasserstein auto-encoders,”ArXiv, vol. abs/1711.01558, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:3833554
-
[52]
Csisz ´ar and J
I. Csisz ´ar and J. K ¨orner,Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition. Cambridge University Press, 2011. [Online]. Available: https://doi.org/10.1017/ CBO9780511921889
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.