Characterisations of Kullback--Leibler approximation by finite Gaussian mixtures

Hien Duy Nguyen

arxiv: 2604.10899 · v1 · submitted 2026-04-13 · 🧮 math.ST · stat.TH

Characterisations of Kullback--Leibler approximation by finite Gaussian mixtures

Hien Duy Nguyen This is my paper

Pith reviewed 2026-05-10 16:27 UTC · model grok-4.3

classification 🧮 math.ST stat.TH

keywords Kullback-Leibler divergenceGaussian mixture modelsfinite mixturesapproximation theorysecond momentsuniform integrabilitylog-moment classes

0 comments

The pith

Finite second moments are necessary for any density to be approximable in Kullback-Leibler divergence by finite Gaussian mixtures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes necessary and sufficient conditions for a probability density to be approximable in Kullback-Leibler divergence by finite Gaussian mixture models. Necessity holds universally: any approximable density must have finite second moment. Sufficiency is shown by constructing sequences of finite mixtures whose likelihood ratios converge pointwise and whose log-ratios are uniformly integrable. This works for two classes of target densities: continuous strictly positive densities with finite log-moments, and countable-scale support-aware densities that may have regions of zero density. Counterexamples confirm that the two classes are distinct and that their union does not cover every possible density.

Core claim

A density is approximable in Kullback-Leibler divergence by finite Gaussian mixtures if and only if it has finite second moment, whenever the density lies in the finite log-moment class of continuous strictly positive functions or in the countable-scale support-aware class. The proof reduces sufficiency to the explicit construction of approximating mixtures that achieve pointwise convergence of the likelihood ratios together with uniform integrability of the truncated log-ratios.

What carries the argument

The abstract mechanism of necessity via finite second moments combined with sufficiency via pointwise-convergent likelihood ratios and uniformly integrable finite log-ratios.

If this is right

Any density with infinite second moment cannot be approximated in Kullback-Leibler divergence by finite Gaussian mixtures.
Every continuous strictly positive density with finite logarithmic moments admits finite Gaussian mixture approximations in Kullback-Leibler divergence.
Countable-scale support-aware densities, including those with regions of zero density, also admit such approximations.
The finite log-moment class and the countable-scale class are incomparable, and densities exist outside their union.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The necessity of finite second moments may constrain which empirical distributions can be faithfully represented by finite Gaussian mixtures under information-theoretic criteria.
Similar ratio-convergence and integrability conditions could be used to characterize approximability for other mixture families or other divergences.
The separation between the two density classes suggests that moment-based and support-based restrictions play independent roles in mixture approximation.

Load-bearing premise

The target density belongs to either the finite log-moment class of continuous strictly positive densities or the countable-scale support-aware class that allows zero-density regions.

What would settle it

A concrete density possessing a finite second moment yet lying outside both the finite log-moment class and the countable-scale support-aware class for which no sequence of finite Gaussian mixtures converges in Kullback-Leibler divergence.

read the original abstract

We study the Kullback--Leibler (KL) divergence approximation theory of Gaussian mixture models (GMMs) by isolating an abstract mechanism behind several necessary-and-sufficient statements. The necessity direction is universal: if a density is approximable in KL divergence by finite GMMs, then it must have finite second moment. The sufficient direction is reduced to the construction of approximating GMMs whose likelihood ratios converge pointwise and whose finite log-ratios form a uniformly integrable family. We verify this mechanism on a finite log-moment class of continuous strictly positive target densities, from which bounded, $\mathcal L^p$ $(p>1)$, and Orlicz-dominated subfamilies follow immediately. We also show that a countable-scale support-aware target density class, which allows zero density regions, satisfies the same equivalence. Finally, we give counterexamples showing that the countable-scale class strictly extends the fixed-scale class, that the finite log-moment and countable-scale support-aware classes do not contain one another, and that their union is not exhaustive.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Finite second moments are necessary for any KL approximation by finite GMMs, with sufficiency reduced to pointwise convergence plus uniform integrability and verified on two scoped classes plus counterexamples.

read the letter

The main thing to know is that this paper isolates a clean necessity result: if a density can be approximated in KL by finite Gaussian mixtures, it must have finite second moment, since heavier tails drive the divergence to infinity. Sufficiency boils down to constructing GMM sequences where likelihood ratios converge pointwise and the log-ratios stay uniformly integrable, and they check this explicitly for two classes of target densities while supplying counterexamples that separate the classes and show their union is incomplete.

Referee Report

0 major / 3 minor

Summary. The manuscript isolates a general mechanism for characterizing when a target density p can be approximated in Kullback-Leibler divergence by finite Gaussian mixture models q. Necessity is shown to be universal: approximability implies that p must have finite second moment. Sufficiency is reduced to the existence of a sequence of finite GMMs whose likelihood ratios converge pointwise to 1 and whose log-ratios are uniformly integrable; this mechanism is verified explicitly for the finite log-moment class of continuous strictly positive densities (yielding corollaries for bounded, L^p, and Orlicz-dominated subfamilies) and for a countable-scale support-aware class that permits regions of zero density. Counterexamples establish that the two classes are incomparable and that their union is not exhaustive.

Significance. If the derivations hold, the paper makes a useful contribution to approximation theory for divergences by providing a clean necessary-and-sufficient framework scoped to explicit, verifiable classes of densities. The universal necessity result, the reduction to pointwise convergence plus uniform integrability, and the sharpness counterexamples are all strengths that clarify the boundary of GMM approximability in KL divergence. This has direct relevance for statistical modeling and theoretical machine learning.

minor comments (3)

[Introduction / Main results] The statement of the general mechanism (pointwise convergence of likelihood ratios together with uniform integrability of the finite log-ratios) would benefit from being isolated as a formal lemma or proposition early in the paper, with explicit references to the relevant sections where it is applied to each class.
[Counterexamples section] In the counterexample constructions showing that the finite log-moment and countable-scale classes are incomparable, the explicit verification that the constructed densities lie outside the other class could be expanded with one or two additional lines of calculation to make the incomparability immediate.
[Section 3 / Section 4] Notation for the finite log-moment class and the countable-scale support-aware class should be introduced with a single displayed definition each, rather than being described only in prose, to improve readability for readers who wish to apply the results.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment, clear summary of our contributions, and recommendation for minor revision. We appreciate the recognition of the universal necessity result, the reduction to pointwise convergence plus uniform integrability, and the sharpness of the counterexamples.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The necessity claim follows from standard tail integrability properties of the KL divergence against any finite GMM (log(p/q) grows quadratically for large |x| when q is Gaussian, forcing E_p[X²] < ∞ for the integral to be finite). The sufficiency direction is established by explicit construction of GMM sequences q_n that achieve pointwise convergence of likelihood ratios and uniform integrability of the truncated log-ratios, verified directly on the stated density classes without reducing to fitted parameters or prior self-referential results. Counterexamples are supplied to delimit the classes. All steps rely on external analytic facts about KL divergence, Gaussian densities, and uniform integrability rather than any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard mathematical properties of KL divergence, Gaussian densities, and integrability without introducing free parameters, new entities, or ad-hoc axioms beyond domain assumptions in probability theory.

axioms (2)

domain assumption KL divergence is well-defined and finite for the densities under consideration
Invoked throughout the necessity and sufficiency arguments as a basic property of the divergence.
standard math Finite Gaussian mixtures are valid probability densities with the standard form
Used as the approximating class in all statements.

pith-pipeline@v0.9.0 · 5468 in / 1384 out tokens · 36717 ms · 2026-05-10T16:27:07.259084+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Bacharoglou, A. G. (2010). Approximation of probability distributions by convex mixtures of Gaussian measures. Proceedings of the American Mathematical Society, 138(7), 2619--2628

work page 2010
[2]

Billingsley, P. (1995). Probability and Measure. Wiley, New York

work page 1995
[3]

Chen, J. (2023). Statistical Inference Under Mixture Models. Springer, Singapore

work page 2023
[4]

Cheney, E. W. and Light, W. A. (2009). A Course in Approximation Theory. American Mathematical Society, Providence, RI

work page 2009
[5]

and Ellis, R

Dupuis, P. and Ellis, R. S. (1997). A Weak Convergence Approach to the Theory of Large Deviations. Wiley, New York

work page 1997
[6]

Gelbaum, B. R. and Olmsted, J. M. H. (1964). Counterexamples in Analysis. Holden-Day, San Francisco

work page 1964
[7]

Ghosh, S., Guntuboyina, A., Mukherjee, S., and Tran, H.-S. (2026). Gaussian mixtures and non-parametric likelihoods through the lens of statistical mechanics. arXiv preprint arXiv:2603.23196

work page arXiv 2026
[8]

Kim, A. K. H. and Guntuboyina, A. (2022). Minimax bounds for estimating multivariate Gaussian location mixtures. Electronic Journal of Statistics, 16, 1461--1484

work page 2022
[9]

Kruijer, W., Rousseau, J., and van der Vaart, A. (2010). Adaptive Bayesian density estimation with location-scale mixtures. Electronic Journal of Statistics, 4, 1225--1257

work page 2010
[10]

Li, J. Q. and Barron, A. R. (2000). Mixture density estimation. In S. A. Solla, T. K. Leen, and K.-R. M\"uller (eds.), Advances in Neural Information Processing Systems 12, pp. 279--285. MIT Press, Cambridge, MA

work page 2000
[11]

and Michel, B

Maugis, C. and Michel, B. (2011). A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: Probability and Statistics, 15, 41--68

work page 2011
[12]

and Michel, B

Maugis-Rabusseau, C. and Michel, B. (2013). Adaptive density estimation for clustering with Gaussian mixtures. ESAIM: Probability and Statistics, 17, 698--724

work page 2013
[13]

McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. Wiley, New York

work page 2000
[14]

Nguyen, H. D. and McLachlan, G. J. (2019). On approximations via convolution-defined mixture models. Communications in Statistics---Theory and Methods, 48(16), 3945--3955

work page 2019
[15]

T., Nguyen, H

Nguyen, T. T., Nguyen, H. D., Chamroukhi, F., and McLachlan, G. J. (2020). Approximation by finite mixtures of continuous density functions that vanish at infinity. Cogent Mathematics & Statistics, 7, 1750861

work page 2020
[16]

T., Chamroukhi, F., Nguyen, H

Nguyen, T. T., Chamroukhi, F., Nguyen, H. D., and McLachlan, G. J. (2022). Approximation of probability density functions via location-scale finite mixtures in Lebesgue spaces. Communications in Statistics---Theory and Methods, 52, 5048--5059 (2023)

work page 2022
[17]

D., Chamroukhi, F., and Forbes, F

Nguyen, H. D., Chamroukhi, F., and Forbes, F. (2019). Approximation results regarding the multiple-output Gaussian gated mixture of linear experts model. Neurocomputing, 366, 208--214

work page 2019
[18]

D., Nguyen, T

Nguyen, H. D., Nguyen, T. T., Chamroukhi, F., and McLachlan, G. J. (2021). Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models. Journal of Statistical Distributions and Applications, 8, 13

work page 2021
[19]

and Pelenis, J

Norets, A. and Pelenis, J. (2012). Bayesian modeling of joint and conditional distributions. Journal of Econometrics, 168(2), 332--346

work page 2012
[20]

and Pelenis, J

Norets, A. and Pelenis, J. (2014). Posterior consistency in conditional density estimation by covariate dependent mixtures. Econometric Theory, 30(3), 606--646

work page 2014
[21]

and Sandberg, I

Park, J. and Sandberg, I. W. (1991). Universal approximation using radial-basis-function networks. Neural Computation, 3, 246--257

work page 1991
[22]

and Sandberg, I

Park, J. and Sandberg, I. W. (1993). Approximation and radial-basis-function networks. Neural Computation, 5, 305--316

work page 1993
[23]

Rakhlin, A., Panchenko, D., and Mukherjee, S. (2005). Risk bounds for mixture density estimation. ESAIM: Probability and Statistics, 9, 220--229

work page 2005
[24]

White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50(1), 1--25

work page 1982
[25]

Wiener, N. (1932). Tauberian theorems. Annals of Mathematics, 33(1), 1--100

work page 1932
[26]

Zeevi, A. J. and Meir, R. (1997). Density estimation through convex combinations of densities: Approximation and estimation bounds. Neural Networks, 10(1), 99--109

work page 1997

[1] [1]

Bacharoglou, A. G. (2010). Approximation of probability distributions by convex mixtures of Gaussian measures. Proceedings of the American Mathematical Society, 138(7), 2619--2628

work page 2010

[2] [2]

Billingsley, P. (1995). Probability and Measure. Wiley, New York

work page 1995

[3] [3]

Chen, J. (2023). Statistical Inference Under Mixture Models. Springer, Singapore

work page 2023

[4] [4]

Cheney, E. W. and Light, W. A. (2009). A Course in Approximation Theory. American Mathematical Society, Providence, RI

work page 2009

[5] [5]

and Ellis, R

Dupuis, P. and Ellis, R. S. (1997). A Weak Convergence Approach to the Theory of Large Deviations. Wiley, New York

work page 1997

[6] [6]

Gelbaum, B. R. and Olmsted, J. M. H. (1964). Counterexamples in Analysis. Holden-Day, San Francisco

work page 1964

[7] [7]

Ghosh, S., Guntuboyina, A., Mukherjee, S., and Tran, H.-S. (2026). Gaussian mixtures and non-parametric likelihoods through the lens of statistical mechanics. arXiv preprint arXiv:2603.23196

work page arXiv 2026

[8] [8]

Kim, A. K. H. and Guntuboyina, A. (2022). Minimax bounds for estimating multivariate Gaussian location mixtures. Electronic Journal of Statistics, 16, 1461--1484

work page 2022

[9] [9]

Kruijer, W., Rousseau, J., and van der Vaart, A. (2010). Adaptive Bayesian density estimation with location-scale mixtures. Electronic Journal of Statistics, 4, 1225--1257

work page 2010

[10] [10]

Li, J. Q. and Barron, A. R. (2000). Mixture density estimation. In S. A. Solla, T. K. Leen, and K.-R. M\"uller (eds.), Advances in Neural Information Processing Systems 12, pp. 279--285. MIT Press, Cambridge, MA

work page 2000

[11] [11]

and Michel, B

Maugis, C. and Michel, B. (2011). A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: Probability and Statistics, 15, 41--68

work page 2011

[12] [12]

and Michel, B

Maugis-Rabusseau, C. and Michel, B. (2013). Adaptive density estimation for clustering with Gaussian mixtures. ESAIM: Probability and Statistics, 17, 698--724

work page 2013

[13] [13]

McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. Wiley, New York

work page 2000

[14] [14]

Nguyen, H. D. and McLachlan, G. J. (2019). On approximations via convolution-defined mixture models. Communications in Statistics---Theory and Methods, 48(16), 3945--3955

work page 2019

[15] [15]

T., Nguyen, H

Nguyen, T. T., Nguyen, H. D., Chamroukhi, F., and McLachlan, G. J. (2020). Approximation by finite mixtures of continuous density functions that vanish at infinity. Cogent Mathematics & Statistics, 7, 1750861

work page 2020

[16] [16]

T., Chamroukhi, F., Nguyen, H

Nguyen, T. T., Chamroukhi, F., Nguyen, H. D., and McLachlan, G. J. (2022). Approximation of probability density functions via location-scale finite mixtures in Lebesgue spaces. Communications in Statistics---Theory and Methods, 52, 5048--5059 (2023)

work page 2022

[17] [17]

D., Chamroukhi, F., and Forbes, F

Nguyen, H. D., Chamroukhi, F., and Forbes, F. (2019). Approximation results regarding the multiple-output Gaussian gated mixture of linear experts model. Neurocomputing, 366, 208--214

work page 2019

[18] [18]

D., Nguyen, T

Nguyen, H. D., Nguyen, T. T., Chamroukhi, F., and McLachlan, G. J. (2021). Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models. Journal of Statistical Distributions and Applications, 8, 13

work page 2021

[19] [19]

and Pelenis, J

Norets, A. and Pelenis, J. (2012). Bayesian modeling of joint and conditional distributions. Journal of Econometrics, 168(2), 332--346

work page 2012

[20] [20]

and Pelenis, J

Norets, A. and Pelenis, J. (2014). Posterior consistency in conditional density estimation by covariate dependent mixtures. Econometric Theory, 30(3), 606--646

work page 2014

[21] [21]

and Sandberg, I

Park, J. and Sandberg, I. W. (1991). Universal approximation using radial-basis-function networks. Neural Computation, 3, 246--257

work page 1991

[22] [22]

and Sandberg, I

Park, J. and Sandberg, I. W. (1993). Approximation and radial-basis-function networks. Neural Computation, 5, 305--316

work page 1993

[23] [23]

Rakhlin, A., Panchenko, D., and Mukherjee, S. (2005). Risk bounds for mixture density estimation. ESAIM: Probability and Statistics, 9, 220--229

work page 2005

[24] [24]

White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50(1), 1--25

work page 1982

[25] [25]

Wiener, N. (1932). Tauberian theorems. Annals of Mathematics, 33(1), 1--100

work page 1932

[26] [26]

Zeevi, A. J. and Meir, R. (1997). Density estimation through convex combinations of densities: Approximation and estimation bounds. Neural Networks, 10(1), 99--109

work page 1997