Density estimation for Hellinger via minimum-distance estimators: mixtures of Gaussians, log-concave, and more
Pith reviewed 2026-06-27 10:56 UTC · model grok-4.3
The pith
Minimum-distance estimators extend to Hellinger distance via reverse data processing inequalities, giving near-linear time algorithms for Gaussian mixture and log-concave density estimation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By replacing the total-variation Yatracos class with a related concept class whose VC dimension is bounded, and invoking reverse data processing inequalities, the minimum-distance estimator yields an estimator whose Hellinger error is controlled by the same sample size and runtime that previously controlled total-variation error.
What carries the argument
The modified minimum-distance estimator that uses VC-dimension bounds on a related concept class together with reverse data processing inequalities to obtain Hellinger guarantees while preserving the original near-linear runtime.
If this is right
- Near-linear time density estimation in Hellinger distance for univariate Gaussian mixtures with arbitrary variances.
- Near-linear time density estimation in Hellinger distance for univariate mixtures of log-concave densities.
- Near-optimal sample complexity for both classes.
- Any future total-variation algorithm whose analysis rests only on a VC-dimension bound can be reused for Hellinger with only the addition of the reverse-inequality step.
Where Pith is reading between the lines
- The same recipe may apply to other distances that admit reverse data processing inequalities, such as certain f-divergences.
- The approach could be tested on multivariate log-concave mixtures once suitable VC bounds are derived.
- If the related concept class admits efficient empirical risk minimization, the overall procedure remains practical beyond the univariate setting.
Load-bearing premise
That a bound on the VC dimension of the related concept class, once fed through reverse data processing inequalities, is enough to control Hellinger error and to keep the algorithm near-linear.
What would settle it
A concrete mixture of two univariate Gaussians with arbitrary variances for which the VC dimension of the associated concept class grows faster than linear in the number of components, or for which the modified Acharya et al. procedure requires super-linear time.
read the original abstract
We study the task of density estimation, where we hope to accurately estimate a probability density from $n$ samples. A textbook method for density estimation in total variation distance is the minimum-distance estimator approach, where we conclude both the algorithm and the analysis merely from bounding the VC dimension of a particular concept class (the so-called Yatracos class). While this technique has originally yielded sharp guarantees primarily for total variation distance, in this work we extend the minimum-distance estimator approach for learning within Hellinger distance. Our main observation is that we may produce an analogous recipe for Hellinger (where we only require bounding the VC dimension of a related concept class) by drawing connections to recent results yielding reverse data processing inequalities. This recipe is flexible enough to accommodate fast algorithms originally designed for total variation distance; by modifying the approach of Acharya et al. (2017) we conclude the first near-linear time algorithm for learning classes including univariate mixtures of log-concave densities and mixtures of Gaussians (with arbitrary variances), with near-optimal sample complexity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends the minimum-distance estimator technique from total variation (TV) distance—where guarantees follow from VC-dimension bounds on the Yatracos class—to Hellinger distance. The key step invokes recent reverse data-processing inequalities to obtain an analogous recipe that again requires only a VC-dimension bound on a related concept class. By adapting the algorithmic framework of Acharya et al. (2017), the authors claim the first near-linear-time algorithms for Hellinger density estimation of univariate mixtures of log-concave densities and of Gaussians with arbitrary variances, attaining near-optimal sample complexity.
Significance. If the central claims hold, the work supplies a modular template for converting existing TV-efficient estimators into Hellinger estimators while preserving both statistical optimality and computational efficiency. This is valuable for the listed nonparametric classes, where Hellinger distance is often the more natural loss and where near-linear runtime had previously been unavailable.
major comments (2)
- [Abstract / §3] Abstract and §3 (algorithmic modification): the claim that the modified Acharya et al. (2017) procedure retains near-linear runtime rests on the assertion that the Hellinger analogue of the Yatracos class admits the same efficient optimization or enumeration steps as the TV version. Because the original runtime exploits specific structural properties of the TV class, an explicit argument (or pseudocode) showing that the Hellinger version inherits an equivalent implementation is required; bounded VC dimension alone does not automatically guarantee this.
- [Theorem 1] Theorem 1 (or main sample-complexity statement): the paper must state the precise sample complexity obtained for the listed classes and verify that it matches the known Hellinger lower bounds up to logarithmic factors. Without an explicit comparison or derivation of the constant factors arising from the reverse DPI, the “near-optimal” claim cannot be assessed.
minor comments (2)
- [§2] Notation for the Hellinger Yatracos class should be introduced with an explicit definition parallel to the TV version, including the precise functional form used in the minimum-distance program.
- [§2.2] The dependence on the external reverse-DPI results should be stated as a black-box lemma with a clear citation and a one-sentence reminder of the exact inequality applied.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We appreciate the recognition of the work's potential as a modular template for Hellinger estimators. We address each major comment below and will revise the manuscript accordingly to address the points raised.
read point-by-point responses
-
Referee: [Abstract / §3] Abstract and §3 (algorithmic modification): the claim that the modified Acharya et al. (2017) procedure retains near-linear runtime rests on the assertion that the Hellinger analogue of the Yatracos class admits the same efficient optimization or enumeration steps as the TV version. Because the original runtime exploits specific structural properties of the TV class, an explicit argument (or pseudocode) showing that the Hellinger version inherits an equivalent implementation is required; bounded VC dimension alone does not automatically guarantee this.
Authors: We agree that the manuscript would benefit from an explicit argument. In the revision we will add a dedicated paragraph and pseudocode in §3 that directly compares the optimization oracle for the Hellinger concept class (induced by the reverse DPI) to the TV Yatracos class, showing that the same enumeration and linear-programming steps from Acharya et al. (2017) apply verbatim because the class retains the same combinatorial structure. revision: yes
-
Referee: [Theorem 1] Theorem 1 (or main sample-complexity statement): the paper must state the precise sample complexity obtained for the listed classes and verify that it matches the known Hellinger lower bounds up to logarithmic factors. Without an explicit comparison or derivation of the constant factors arising from the reverse DPI, the “near-optimal” claim cannot be assessed.
Authors: We will expand Theorem 1 to state the exact sample complexities (including all logarithmic factors) for univariate Gaussian mixtures and log-concave mixtures. We will also add a short derivation of the multiplicative constants contributed by the reverse DPI and a direct comparison paragraph showing that the resulting bounds match the known Hellinger lower bounds up to logarithmic factors. revision: yes
Circularity Check
No circularity; derivation relies on external VC bounds and reverse DPI results
full rationale
The paper extends the minimum-distance estimator technique from total variation (originally from Acharya et al. 2017) to Hellinger distance. The key step is observing that bounding the VC dimension of a related concept class, combined with reverse data processing inequalities from recent external results, yields the Hellinger guarantees while preserving the algorithmic structure. No claim reduces by construction to a quantity defined inside the paper, no fitted parameter is relabeled as a prediction, and no load-bearing step collapses to a self-citation chain. The runtime preservation for mixtures of Gaussians and log-concave densities follows from modifying the external procedure once the VC bound is established. This is a standard non-circular extension relying on independent prior results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reverse data processing inequalities connect Hellinger distance to the minimum-distance estimator framework for the relevant concept classes
Reference graph
Works this paper leans on
-
[1]
Spencer Compton and Gregory Valiant. Attainability of two-point testing rates for finite-sample location estimation.arXiv preprint arXiv:2502.05730,
-
[2]
Ratio covers of convex sets and optimal mixture density estimation.arXiv preprint arXiv:2602.16142,
Spencer Compton, Gábor Lugosi, Jaouad Mourtada, Jian Qian, and Nikita Zhivotovskiy. Ratio covers of convex sets and optimal mixture density estimation.arXiv preprint arXiv:2602.16142,
-
[3]
Ilias Diakonikolas, Moritz Hardt, and Ludwig Schmidt. Differentially private learning of structured discrete distributions.Advances in Neural Information Processing Systems, 28, 2015a. Ilias Diakonikolas, Daniel M Kane, and Vladimir Nikishkin. Optimal algorithms and lower bounds for testing closeness of structured distributions. In2015 IEEE 56th Annual Sy...
-
[4]
Ilias Diakonikolas, Daniel M Kane, and Vladimir Nikishkin. Near-optimal closeness testing of discrete histogram distributions.arXiv preprint arXiv:1703.01913, 2017a. Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Learning multivariate log-concave distri- butions. InConference on Learning Theory, pages 711–727. PMLR, 2017b. Ilias Diakonikolas, Da...
-
[5]
URL https://doi.org/10.1214/16-AOS1480
doi: 10.1214/16-AOS1480. URL https://doi.org/10.1214/16-AOS1480. Gil Kur, Yuval Dagan, and Alexander Rakhlin. Optimality of maximum likelihood for log-concave density estimation and bounded convex regression.arXiv preprint arXiv:1903.05315,
-
[6]
Yury Polyanskiy and Mark Sellke. Nonparametric mle for gaussian location mixtures: certified computation and generic behavior.arXiv preprint arXiv:2503.20193,
-
[7]
Yury Polyanskiy and Yihong Wu. Self-regularizing property of nonparametric maximum likelihood estimator in mixture models.arXiv preprint arXiv:2008.08244,
arXiv 2008
-
[8]
URLhttps://doi.org/10.1214/ 18-STS666
doi: 10.1214/18-STS666. URLhttps://doi.org/10.1214/ 18-STS666. Richard J Samworth. Nonparametric inference under shape constraints: past, present and future. arXiv preprint arXiv:2509.26040,
-
[9]
Optimal estimation of Gaussian mixtures via denoised method of moments.The Annals of Statistics, 48(4):1981 – 2007,
Yihong Wu and Pengkun Yang. Optimal estimation of Gaussian mixtures via denoised method of moments.The Annals of Statistics, 48(4):1981 – 2007,
1981
-
[10]
URL https://doi.org/10.1214/19-AOS1873
doi: 10.1214/19-AOS1873. URL https://doi.org/10.1214/19-AOS1873. Yuhong Yang and Andrew Barron. Information-theoretic determination of minimax rates of con- vergence.Annals of Statistics, pages 1564–1599,
-
[11]
Discussion In this work, we developed fast, sample-efficient algorithms for density estimation in Hellinger
Appendix A. Discussion In this work, we developed fast, sample-efficient algorithms for density estimation in Hellinger. Our primary applications focused on mixtures of log-concave or Gaussian densities; classical total variation guarantees for minimum-distance estimation seem ripe for adoption with our recipe, and we believe this can achieve nearly-optim...
2017
-
[12]
Nothing in this presen- tation is notably different from the technique of Pensia et al
ProofFor a self-contained presentation, we will present a proof in this text. Nothing in this presen- tation is notably different from the technique of Pensia et al. (2023), and we will restrict to discrete distributions only for ease of notation. Also, for ease of notation, letN≜ 1 H2(p,q) . The proof technique will be to decompose the Hellinger distance...
2023
-
[13]
H2(p, q)≥ 1 2·log(4/H 2(p, q)) H2(p, q) All that remains is to show that for any suchI ∗, there exists anf τ with approximately the Hellinger distance of the interval’s contribution. Case 1:I ∗ = [1 + 1 2j+1 ,1 + 1 2j ).First, we upper bound the value of the sum fromI ∗: X x: p(x) q(x) ∈[1+ 1 2j+1 ,1+ 1 2j ) p p(x)− p q(x) 2 ≤ X x: p(x) q(x) ∈[1+ 1 2j+1 ,...
2005
-
[14]
+ ln 8 δ . a−b√a ≤r=⇒∆≤r √a=⇒∆ 2 −r 2a≤0 =⇒∆ 2 −r 2(∆ +b)≤0 =⇒∆ 2 −r 2∆−r 2b≤0 =⇒∆≤ r2 + √ r4 + 4r2b 2 =⇒∆≤r 2 +r √ b(root of the quadratic equation) We will continue using this shorthand notation ofa, b,∆, r. The proof will follow quickly via the same logic as Claim 2.22 of Compton and Valiant (2025): √a− √ b ≤min 2|a−b|√ b , p |a−b| (fora, b≥0; first pa...
2025
-
[15]
IfH 2(fθ, ˆfn)≥1/n, then Theorem 16 impliesH 2(fθ, ˆfn)≲d D H(fθ, ˆfn)
Our modified estimator will choose D=⌈log(4n)⌉and estimate: ˆfn ≜arg min fθ∈F dD H(fθ, fn)(21) The desired error guarantee now follows by the same proof strategy as Theorem 7: H2(f, ˆfn)≲H 2(f, fθ) + H2(fθ, ˆfn) 23 COMPTONLI Consider two cases. IfH 2(fθ, ˆfn)≥1/n, then Theorem 16 impliesH 2(fθ, ˆfn)≲d D H(fθ, ˆfn). Otherwise, we can use the upper bound of...
2009
-
[16]
PIECEWISE-LINEAR APPROXIMATION We now present the required piecewise-linear approximation result for log-concave densities
G.1.1. PIECEWISE-LINEAR APPROXIMATION We now present the required piecewise-linear approximation result for log-concave densities. This result follows from the proof of Theorem 12 of Diakonikolas et al. (2016): 28 DENSITY ESTIMATION FORHELLINGER VIA MINIMUM-DISTANCE ESTIMATORS Lemma 20 (Piecewise-linear approximation for log-concave; follows from Diakonik...
2016
-
[17]
Proof Without loss of generality, supposep(x)≥q(x)in the support
Then, there exists some intervalI ⊆[0,1]where sZ I p(x)− sZ I q(x) !2 ≳H 2(p, q) Note how here we are proving a reverse data processing inequality that crucially only loses a constant fraction of squared Hellinger distance, instead of alog(2/H 2(p, q))term. Proof Without loss of generality, supposep(x)≥q(x)in the support. (Ifp, qintersect, we can break th...
2017
-
[18]
Any degree-dpolynomial can be defined by its coefficients inR d+1
We will do so by approximately determining whether there is a polynomial withA 1 distance at mostτ, and employing a binary search over the correct value ofτ. Any degree-dpolynomial can be defined by its coefficients inR d+1. LetC τ ⊂R d+1 denote the set of such polynomials. These polynomials must: (i) be non-negative overJ, and (ii) must satisfy an upper ...
1989
-
[19]
special points
Note that the analogous result in Acharya et al. (2017) forA 1 distance only has additive error, and this is why ourH1 oracle also incurs multiplicative error. Additionally, since the runtime of this subroutine isO(s)it is not a dominating term in the final runtime. Theorem 26Fix an intervalJ= [a, b]⊆[−1,1]. Suppose there aressamples in this rangea≤ x1 ≤ ...
2017
-
[20]
We will chooseFas the class of3k-piecewise degree≲log(n)polynomial densities. By Corollary 12: H2(f, ˆfn)≲inf fθ∈F t d log(n)·H 2(f, fθ) log(2k/H2(f, fθ)) + klog 2(n)·(log(n) + log(2/δ)) n We use existing results that imply any Gaussian density is approximated withinεsquared Hellinger distance by a 3-piecewise degree≲log(1/ε)polynomial (Lemma 36 of Chan e...
2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.