Statistical estimation of the Kullback-Leibler divergence
Pith reviewed 2026-05-25 12:55 UTC · model grok-4.3
The pith
k-nearest neighbor statistics from independent samples yield asymptotically unbiased and L2-consistent estimates of Kullback-Leibler divergence under wide conditions on densities in R^d.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Wide conditions are provided to guarantee asymptotic unbiasedness and L2-consistency of the introduced estimates of the Kullback-Leibler divergence for probability measures in R^d having densities with respect to the Lebesgue measure. These estimates are constructed by means of two independent collections of i.i.d. observations and involve the specified k-nearest neighbor statistics. In particular, the established results are valid for estimates of the Kullback-Leibler divergence between any two Gaussian measures in R^d with nondegenerate covariance matrices. As a byproduct new statements are obtained concerning the Kozachenko-Leonenko estimators of the Shannon differential entropy.
What carries the argument
k-nearest neighbor statistics built from two independent collections of i.i.d. observations
If this is right
- The k-nearest neighbor estimates become asymptotically unbiased for the Kullback-Leibler divergence.
- The estimates converge in L2 to the true divergence value.
- The consistency statements apply to the divergence between any two nondegenerate Gaussian measures on R^d.
- New asymptotic unbiasedness and consistency results hold for the Kozachenko-Leonenko estimators of differential entropy.
Where Pith is reading between the lines
- The same nearest-neighbor construction may extend to estimation of other f-divergences provided the requisite density assumptions hold.
- Practical performance in moderate dimensions could be checked by direct Monte Carlo comparison against closed-form KL values for Gaussians.
- The independence requirement between the two samples could be relaxed in future work while preserving the consistency claims.
Load-bearing premise
The probability measures possess densities with respect to Lebesgue measure on R^d and the observations come from two independent i.i.d. collections.
What would settle it
A pair of densities on R^d together with explicit sequences of sample sizes for which the k-nearest neighbor KL estimator fails to converge to the true value in L2 or fails to be asymptotically unbiased.
read the original abstract
Wide conditions are provided to guarantee asymptotic unbiasedness and L^2-consistency of the introduced estimates of the Kullback-Leibler divergence for probability measures in R^d having densities w.r.t. the Lebesgue measure. These estimates are constructed by means of two independent collections of i.i.d. observations and involve the specified k-nearest neighbor statistics. In particular, the established results are valid for estimates of the Kullback-Leibler divergence between any two Gaussian measures in R^d with nondegenerate covariance matrices. As a byproduct we obtain new statements concerning the Kozachenko-Leonenko estimators of the Shannon differential entropy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper constructs k-nearest-neighbor estimators of the Kullback-Leibler divergence between two probability measures on R^d that admit densities with respect to Lebesgue measure, using two independent i.i.d. samples. It claims to supply wide conditions guaranteeing asymptotic unbiasedness and L^2-consistency of these estimators; the results are asserted to cover any pair of non-degenerate Gaussians in arbitrary dimension. As a byproduct, new consistency statements are derived for the Kozachenko-Leonenko entropy estimator.
Significance. If the stated conditions are indeed broad and the proofs are correct, the work supplies rigorous justification for a class of computationally attractive nonparametric estimators that are already used in practice. The explicit inclusion of the Gaussian case in every dimension and the new entropy results constitute concrete, usable advances in the theory of nearest-neighbor divergence estimation.
minor comments (3)
- The abstract asserts the existence of 'wide conditions' without naming them or sketching the proof strategy; while the full manuscript presumably supplies both, a brief indication in the abstract would improve readability.
- Notation for the two sample sizes (n and m) and the neighbor order k should be introduced once at the beginning of Section 2 and used consistently thereafter.
- The statement that the results hold for 'any two Gaussian measures with nondegenerate covariance matrices' would benefit from an explicit reference to the relevant theorem number.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the paper, the recognition of its significance for nearest-neighbor divergence estimation, and the recommendation of minor revision. No major comments appear in the report.
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper establishes asymptotic unbiasedness and L^2-consistency for k-NN KL divergence estimators directly from standard assumptions (i.i.d. samples, densities w.r.t. Lebesgue measure on R^d) via probabilistic analysis, including explicit coverage of nondegenerate Gaussians. No parameters are fitted to data and then relabeled as predictions, no self-citations bear the central load, and no ansatz or uniqueness claim reduces the result to its inputs by construction. The byproduct consistency statements for Kozachenko-Leonenko entropy estimators follow from the same direct arguments without circular reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Probability measures on R^d possess densities with respect to Lebesgue measure
- domain assumption Two independent collections of i.i.d. observations are available
Reference graph
Works this paper leans on
-
[1]
Alonso-Ruiz, P., Spodarev, E. (2016). Entropy-based inhomogeneity detection in fiber materials . Methodol. Comput. Appl. Probab. Published online: 27 November 2017, doi.org/10.1007/s11009-017- 9603-2
-
[2]
Berrett, T.B., Samworth R.J. and Yuan M. (2019). Efficient multivariate entropy estimation via k-nearest neighbour distances. Ann. of Statist . 47, 288–318
work page 2019
-
[3]
Biau G. and Devroye L. (2015). Lectures on the Nearest Neighbor Method . Springer, Cham
work page 2015
-
[4]
Billingsley, P. (1999). Convergence of Probability Measures , 2nd edn. John Wiley, New York
work page 1999
-
[5]
(2006) Pattern Recognition and Machine Learning
Bishop, C.M. (2006) Pattern Recognition and Machine Learning . Springer, Singapore
work page 2006
-
[6]
Borkar, V.S. (1995). Probability Theory. An Advanced Course . Springer, New York
work page 1995
-
[7]
Bulinski, A., Dimitrov, D. (2019). Statistical estimation of the Shannon entropy. Acta Mathematica Sinica. English series . 35, 17–46
work page 2019
-
[8]
Bulinski, A. and Kozhevin, A. (2018). Statistical estimation of conditional Shannon entropy. ESAIM: Probability and Statistics . Published online: November 28, 1–35
work page 2018
-
[9]
Charzy´nska, A., Gambin, A. (2016). Improvement of of the k-NN entropy estimator with applica- tions in systems biology. Entropy, 18(1), 13
work page 2016
-
[10]
Coelho F., Braga A.P., Verleysen M. (2016). A mutual information estimator for continuous and discrete variables applied to feature selection and classification problems, International Journal of Computational Intelligence Systems , 9, 726–733
work page 2016
-
[11]
Cui, S., Luo, C. (2016). Feature-based non-parametric estimation of Kullback–L eibler divergence for SAR image change detection. Remote Sensing Letters , 11, 1102–1111
work page 2016
-
[12]
Delattre, S., Fournier, N. (2017). On the Kozachenko-Leonenko entropy estimator. Journal of Statistical Planning and Inference , DOI: http://dx.doi.org/10.1016/j.jspi.2017.01.004 (accepted manuscript)
-
[13]
Deledalle, C-A. (2017). Estimation of Kullback-Leibler losses for noisy recovery pr oblems within the exponential family. Electronic Journal of Statistics 11, 3141–3164
work page 2017
-
[14]
Evans, D. (2008). A computationally efficient estimator for mutual informatio n. Proc. Royal Soc. A , 464, 1203–1215. 32
work page 2008
-
[15]
Evans, D., Jones, A.J. and Schmidt, W.M. (2002). Asymptotic moments of near-neighbour dis- tance distributions. Proc. Royal Soc. A , 458, 2839–2849
work page 2002
-
[16]
Gao, S., Steeg, G.V. and Galstyan A. (2015). Proc. of 31st Conference on Uncertainty in Arti- ficial Intelligence, Amsterdam, Netherlands, July 12 - 16, 2 015, 278–287
work page 2015
-
[17]
Granero-Belinch´on, C., Roux, S.G. and Garnier, N.B. (2018). Kullback-Leibler divergence measure of intermittency: Application to turbulence. Physical Review E . 97, 013107, 1–10
work page 2018
-
[18]
Kallenberg, O. (1997). Foundations of Modern Probability . Springer, New York
work page 1997
-
[19]
Kozachenko, L.F., Leonenko, N.N. (1987). Sample estimate of the entropy of a random vector. Problems of Information Transmission , 23, 9–16
work page 1987
-
[20]
Kraskov, A., St ¨ogbauer, H., Grassberger, P. (2004). Estimating mutual information. Phys. Rev. E, 69:066138
work page 2004
-
[21]
Leonenko, N.N., Pronzato, L., Savani V. (2008). A class of R´ enyi information estimations for multidimensional densities. The Annals of Statistics , 36, 2153–2182. Correction: The Annals of Statis- tics (2010). 38, 3837-3838
work page 2008
-
[22]
Li, J., Cheng, K., W ang, S., Morstatter, F., Trevino, R.P., T ang, J. and Liu, H. (2017). Feature Selection: A Data Perspective. ACM Comput. Surv. . 50, Article 94 (December 2017), 1–45
work page 2017
-
[23]
Ma, T., W ang, F., Cheng, J., Yu, Y. and Chen, X. (2016). A hybrid spectral clustering and deep neural network ensemble algorithm for intrusion detectionin s ensor networks. Sensors 16, 1701, doi:10.3390/s1610170, 1-23
-
[24]
Moon, K.R., Sricharan, K., Greenewald, K. and Hero, A.O.III (2014). Ensemble estimation of information divergence. Entropy, 20, 560; doi:10.3390/e20080560, 1–39
-
[25]
Moulin, P. and Veeravalli, V.V. (2019). Statistical Inference for Engineers and Data Scientists . Cambridge University Press
work page 2019
-
[26]
Noh, Y.K., Sugiyama, M., Liu, S., du Plessis, M.C., Park, F.C . and Lee, D.D. (2018). Bias reduction and metric learning for nearest-neighbor estimation of K ullback-Leibler divergence. Neural Computation. 30, 1930–1960
work page 2018
-
[27]
P´al, D., P ´oczos, B., Szepesv ´ari C. (2010). Estimation of R´ enyi entropy and mutual information based on generalized nearest-neighbor graphs. In: NIPS’10 Proceedings of the 23rd International Con- ference on Neural Information Processing Systems, Vancouv er, British Columbia, Canada (December 06 - 09, 2010) , 1849–1857
work page 2010
-
[28]
Pardo, L. (2006). Statistical Inference Based on Divergence Measures. C hapman and Hall/CRC, Boca Raton
work page 2006
-
[29]
Peng, H., Long, F., Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. on Pattern Analysis and Machine Intelligence, 27, 1226–1238
work page 2005
-
[30]
(2013) Limit theory for point processes in manifolds
Penrose M.D., Yukich J.E. (2013) Limit theory for point processes in manifolds. Annals of Applied Probability, 6, 2160–2211
work page 2013
-
[31]
P´ erez-Cruz, F. (2009). Estimation of information theoretic measures for continu ous random variables. Advances in Neural Information Processing Systems , 1257–1264
work page 2009
-
[32]
P´oczos, B, Xiong, L., Schneider, J. (2011). Nonparametric divergence estimation with applications to machine learning on distributions. Proceedings of the Twenty-Seventh Conference on Uncertain ty in Artificial Intelligence, Barcelona, Spain July 14 - 17, 2011 . AUAI Press, Arlington, 599–608
work page 2011
-
[33]
Sasaki, H., Noh, Y-K., Niu, G. and Sugiyama, M. (2016). Direct density derivative estimation. Neural Computation. 28, 1101–1140. 33
work page 2016
-
[34]
(2016) F-difergence inequalities
Sason I., Verd ´u S. (2016) F-difergence inequalities. IEEE Transactions on Information Theory . 62, 5973 - 6006
work page 2016
-
[35]
Shannon, C.E. (1948). A mathematical theory of communication. Bell Systems Technical Journal , 27, July and October, 379–423 and 623–656
work page 1948
-
[36]
Shiryaev, A.N. (2016). Probability - 1 . 3rd edn. Springer, New York
work page 2016
-
[37]
Singh, S., P ´oszoc, B. (2016). Analysis of k-nearest neighbor distances with application to entropy estimation, arXiv preptint , arXiv: 1603.08578v2
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[38]
Sricharan, K., Wei, D., Hero, A.O. (2013). Ensemble estimators for multivariate entropy estima- tion. IEEE Transactions on Information Theory , 59, 4374–4388
work page 2013
-
[39]
Stowell, D., Plumbley, M.D. (2009). Fast multidimensional entropy estimation by k-d partitioning. IEEE Signal Processing Letters , 16, NO. 6, JUNE (2009), preprint
work page 2009
-
[40]
Tsybakov A.B., V an der Meulen, E. C. (1996). Root- n consistent estimators of entropy for densities with unbounded support. Scand. J. Stat. 23, 75–83
work page 1996
-
[41]
Vergara J.R., Est ´ evez P.A. (2014). A review of feature selection methods based on mutual inf or- mation. Neural Comput. and Applic. 24, 175–186
work page 2014
-
[42]
W ang, Q., Kulkarni, S.R., Verd ´u, S. (2009). Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Transactions on Information Theory 55, 2392–2405
work page 2009
-
[43]
Yeh Yeh, J. (2014). Real Analysis: Theory of Measure and Integration . 3rd edn. World Scientific, Singapore
work page 2014
-
[44]
Yu, X-P, Chen, S-X. and Peng, M-L. (2017). Application of partial least squares algorithm based on Kullback - Leibler divergence in intrusion detection. In: Cai N. (Ed .) Proc. of the Int. conference Computer Science and Technology (CST2016), Shenzhen, Chin a, 8 10 January 2016 , World Scientific, Singapore, 256–263
work page 2017
-
[45]
Zhou, R., Cai, R. and Tong, G. (2013). Applications of entropy in finance: a review. Entropy. 15, 4909–4931. 34
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.