Estimating location parameters in entangled single-sample distributions

Ankit Pensia; Po-Ling Loh; Varun Jog

arxiv: 1907.03087 · v1 · pith:6MY2HQ7Inew · submitted 2019-07-06 · 🧮 math.ST · cs.IT· cs.LG· math.IT· stat.ML· stat.TH

Estimating location parameters in entangled single-sample distributions

Ankit Pensia , Varun Jog , Po-Ling Loh This is my paper

Pith reviewed 2026-05-25 01:56 UTC · model grok-4.3

classification 🧮 math.ST cs.ITcs.LGmath.ITstat.MLstat.TH

keywords mean estimationheterogeneous distributionsunimodal symmetric distributionsadaptive estimationempirical processesmixture modelslinear regression

0 comments

The pith

A hybrid estimator adapts to heterogeneity to estimate the common mean from non-identical symmetric unimodal distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies estimating a shared mean from independent samples drawn from symmetric unimodal distributions that may differ in their scales or shapes. It introduces an estimator blending modal interval, shorth, and median methods that automatically adjusts to the amount of heterogeneity present. This estimator achieves near-optimal performance both when all samples are identical and when only a small fraction log n over n are low-noise. The approach relies on new empirical process results for non-identical data and extends to multivariate mixtures and linear regression.

Core claim

We propose an estimator that adapts to the level of heterogeneity in the data, achieving near-optimality in both the i.i.d. setting and some heterogeneous settings, where the fraction of low-noise points is as small as log n / n. Our estimator is a hybrid of the modal interval, shorth, and median estimators from classical statistics; however, the key technical contributions rely on novel empirical process theory results that we derive for independent but non-i.i.d. data. In the multivariate setting, we generalize our theory to mean estimation for mixtures of radially symmetric distributions, and derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales

What carries the argument

hybrid of the modal interval, shorth, and median estimators from classical statistics, supported by novel empirical process theory results for independent but non-i.i.d. data

If this is right

The estimator is near-optimal in the i.i.d. setting.
It remains near-optimal when the fraction of low-noise points is as small as log n / n.
In the multivariate case it applies to mean estimation for mixtures of radially symmetric distributions.
Minimax lower bounds hold for any estimator agnostic to individual data-point scales.
The method extends to linear regression with computationally feasible polynomial-time versions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The non-i.i.d. empirical process results may apply to other estimation tasks with independent but heterogeneous samples.
Adaptation to heterogeneity could be explored for estimating other parameters such as variance in similar single-sample settings.
The lower bounds suggest that access to per-point scale information would be needed to improve rates beyond the agnostic case.
Polynomial-time versions make the approach suitable for testing on large heterogeneous datasets.

Load-bearing premise

Samples are drawn independently from symmetric, unimodal distributions that share a common mean.

What would settle it

An experiment showing the estimator fails to achieve the claimed near-optimal rate when the fraction of low-noise points drops below log n / n while distributions remain symmetric and unimodal.

Figures

Figures reproduced from arXiv: 1907.03087 by Ankit Pensia, Po-Ling Loh, Varun Jog.

**Figure 2.** Figure 2: Plot comparing average error of various estimator [PITH_FULL_IMAGE:figures/full_fig_p029_2.png] view at source ↗

**Figure 3.** Figure 3: Plot comparing average error of various estimator [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗

**Figure 4.** Figure 4: Plots comparing average error of various estimato [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗

read the original abstract

We consider the problem of estimating the common mean of independently sampled data, where samples are drawn in a possibly non-identical manner from symmetric, unimodal distributions with a common mean. This generalizes the setting of Gaussian mixture modeling, since the number of distinct mixture components may diverge with the number of observations. We propose an estimator that adapts to the level of heterogeneity in the data, achieving near-optimality in both the i.i.d. setting and some heterogeneous settings, where the fraction of ``low-noise'' points is as small as $\frac{\log n}{n}$. Our estimator is a hybrid of the modal interval, shorth, and median estimators from classical statistics; however, the key technical contributions rely on novel empirical process theory results that we derive for independent but non-i.i.d. data. In the multivariate setting, we generalize our theory to mean estimation for mixtures of radially symmetric distributions, and derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we describe an extension of our estimators applicable to linear regression. In the multivariate mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a hybrid estimator for common mean under heterogeneous symmetric unimodal samples that adapts and reaches near-optimal rates down to a log n/n fraction of low-noise points, using new non-i.i.d. empirical process bounds.

read the letter

The main point is that they built an estimator which adapts to the amount of heterogeneity and still gets near the best possible rate even when only log n over n of the points are low-noise. This comes from combining the modal interval, shorth, and median in a particular way, plus fresh bounds on empirical processes for independent but non-identical samples that share a mean and are symmetric and unimodal. The same ideas extend to multivariate radial mixtures and to linear regression, with polynomial-time versions in those cases. They also prove minimax lower bounds that apply to any estimator blind to the individual scales. These pieces are the actual new material. The setting generalizes Gaussian mixtures to diverging numbers of components, which is a natural move for heterogeneous data. The proofs rest on the stated symmetry and unimodality, and the stress-test found no internal gaps in the argument as described. A small soft spot is that the abstract stays high-level on the derivations, so the precise way the hybrid avoids breakdown under growing heterogeneity would need verification in the full write-up, but nothing indicates a load-bearing flaw. This paper is for people working on robust location estimation or non-i.i.d. empirical processes. A reader who cares about adaptive rates in mixture-like models would get concrete value from the adaptation threshold and the new process bounds. It deserves a serious referee because the claims are specific, the model is cleanly stated, and the technical steps are reproducible in principle.

Referee Report

0 major / 3 minor

Summary. The manuscript considers estimating the common mean of n independent samples drawn from symmetric unimodal distributions that may be non-identical (generalizing Gaussian mixtures to diverging components). It proposes a hybrid estimator combining the modal interval, shorth, and median, which adapts to the level of heterogeneity and is claimed to achieve near-optimal rates both in the i.i.d. case and in heterogeneous regimes where the fraction of low-noise observations can be as small as log n / n. The central technical contributions are new empirical process bounds derived for the independent non-i.i.d. setting; the work also provides multivariate extensions to radially symmetric mixtures, minimax lower bounds for scale-agnostic estimators, an extension to linear regression, and polynomial-time computational versions of the estimators.

Significance. If the stated rates and bounds hold, the paper makes a meaningful contribution to robust location estimation by handling extreme heterogeneity with a single adaptive procedure. The novel non-i.i.d. empirical process results and the minimax lower bounds for scale-agnostic estimators are technically useful; the computational feasibility in the multivariate and regression settings strengthens the practical relevance.

minor comments (3)

The abstract and introduction state the near-optimality claims at a high level; adding a brief outline of the key empirical-process argument (even without full proofs) would improve readability for readers outside the immediate subfield.
Notation for the hybrid estimator and the heterogeneity parameter should be introduced consistently before the main theorems to avoid forward references.
In the multivariate section, clarify whether the radial symmetry assumption is used only for the lower bounds or also for the upper bounds on the estimator.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation of minor revision. The assessment correctly identifies the main contributions, including the adaptive hybrid estimator, non-i.i.d. empirical process bounds, multivariate and regression extensions, and minimax lower bounds. No major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation relies on novel bounds

full rationale

The manuscript derives new empirical process bounds for independent non-i.i.d. symmetric unimodal distributions and combines them with classical modal/shorth/median estimators to obtain an adaptive estimator. The abstract explicitly states that the key technical contributions are these novel results rather than reductions of prior fitted quantities or self-citations. No equations or steps are described that equate a prediction to its own input by construction, import uniqueness from the authors' prior work, or smuggle an ansatz via citation. The adaptation to heterogeneity down to log n / n fraction is presented as following from the new bounds and separately derived minimax lower bounds. This matches the default expectation for a self-contained theoretical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the modeling assumption of symmetry and unimodality with shared mean; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Data points are independently sampled from symmetric, unimodal distributions sharing a common mean.
This is the core setting stated in the abstract that enables the hybrid estimator and theory.

pith-pipeline@v0.9.0 · 5761 in / 1185 out tokens · 29915 ms · 2026-05-25T01:56:14.917776+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 1 internal anchor

[1]

Abraham, G

C. Abraham, G. Biau, and B. Cadre. On the asymptotic propert ies of a simple estimate of the mode. ESAIM: Probability and Statistics , 8:1–11, 2004

work page 2004
[2]

Achlioptas and F

D. Achlioptas and F. McSherry. On spectral learning of mi xtures of distributions. In Interna- tional Conference on Computational Learning Theory , pages 458–469. Springer, 2005

work page 2005
[3]

P. K. Agarwal and M. Sharir. Eﬃcient algorithms for geome tric optimization. ACM Computing Surveys, 30(4):412–458, 1998

work page 1998
[4]

D. F. Andrews, P. J. Bickel, F. R. Hampel, P. J. Huber, W. H. R ogers, and J. W. Tukey. Robust Estimates of Location: Survey and Advances . Princeton University Press, 1972

work page 1972
[5]

Arora and R

S. Arora and R. Kannan. Learning mixtures of arbitrary Ga ussians. In Proceedings of the 33rd annual ACM Symposium on Theory of Computing , pages 247–257, 2001

work page 2001
[6]

Boucheron, G

S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 1 edition, 4 2016

work page 2016
[7]

H. Chernoﬀ. Estimation of the mode. Annals of the Institute of Statistical Mathematics , 16(1):31–41, dec 1964

work page 1964
[8]

Chierichetti, A

F. Chierichetti, A. Dasgupta, R. Kumar, and S. Lattanzi. Learning entangled single-sample Gaussians. In Proceedings of the 25th Annual Symposium on Discrete Algori thms, SODA, pages 511–522, 2014

work page 2014
[9]

Dasgupta

S. Dasgupta. Learning mixtures of Gaussians. In 40th Annual Symposium on Foundations of Computer Science, pages 634–644. IEEE, 1999

work page 1999
[10]

Dasgupta and S

S. Dasgupta and S. Kpotufe. Optimal rates for k-NN density and mode estimation. In Advances in Neural Information Processing Systems , pages 2555–2563, 2014

work page 2014
[11]

Dundar, B

M. Dundar, B. Krishnapuram, J. Bi, and R. B. Rao. Learning cl assiﬁers when the training data is not iid. In IJCAI, pages 756–761, 2007

work page 2007
[12]

Eppstein and J

D. Eppstein and J. Erickson. Iterated nearest neighbor s and ﬁnding minimal polytopes. Discrete & Computational Geometry , 11(3):321–350, 1994

work page 1994
[13]

S. R. Flaxman, D. B. Neill, and A. J. Smola. Gaussian proce sses for independence tests with non-iid data in causal inference. ACM Transactions on TIST , 7(2):22, 2016

work page 2016
[14]

P. J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics , 35(1):73–101, 1964

work page 1964
[15]

H. Jiang. Uniform convergence rates for kernel density estimation. In Proceedings of the 34th International Conference on Machine Learning , pages 1694–1703, 2017

work page 2017
[16]

Kannan, H

R. Kannan, H. Salmasian, and S. Vempala. The spectral me thod for general mixture models. In International Conference on Computational Learning Theor y, pages 444–457. Springer, 2005

work page 2005
[17]

Kim and D

J. Kim and D. Pollard. Cube root asymptotics. Ann. Statist. , 18(1):191–219, 03 1990. 32

work page 1990
[18]

K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation o f mean and covariance. In 57th Annual Symposium on Foundations of Computer Science (FOCS) , pages 665–674, Oct 2016

work page 2016
[19]

Lee and F

D.-T. Lee and F. P. Preparata. Computational geometry— a survey. IEEE Transactions on Computers, (12):1072–1101, 1984

work page 1984
[20]

O. V. Lepskii. On a problem of adaptive estimation in Gau ssian white noise. Theory of Probability & Its Applications , 35(3):454–466, 1991

work page 1991
[21]

F urther investigations of r \’enyi en- tropy power inequalities and an entropic characterization of s-concave densities

Jiange Li, Arnaud Marsiglietti, and James Melbourne. F urther investigations of r \’enyi en- tropy power inequalities and an entropic characterization of s-concave densities. arXiv preprint arXiv:1901.10616, 2019

work page arXiv 1901
[22]

B. G. Lindsay. Mixture models: Theory, geometry and appl ications. In NSF-CBMS Regional Conference Series in Probability and Statistics , pages i–163. JSTOR, 1995

work page 1995
[23]

R. Y. Liu. Bootstrap procedures under some non-iid model s. The Annals of Statistics , 16(4):1696–1708, 1988

work page 1988
[24]

Pensia, V

A. Pensia, V. Jog, and P. Loh. Mean estimation for entang led single-sample distributions. In 2019 IEEE International Symposium on Information Theory (I SIT). IEEE, 2019

work page 2019
[25]

A Unified Approach to Robust Mean Estimation

A. Prasad, S. Balakrishnan, and P. Ravikumar. A uniﬁed ap proach to robust mean estimation. arXiv preprint arXiv:1907.00927 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[26]

Raskutti, M

G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eige nvalue properties for correlated Gaussian designs. Journal of Machine Learning Research , 11(Aug):2241–2259, 2010

work page 2010
[27]

Steinwart and A

I. Steinwart and A. Christmann. Fast learning from non- iid observations. In Advances in NIPS, pages 1768–1776, 2009

work page 2009
[28]

Tsybakov

A.B. Tsybakov. Introduction to Nonparametric Estimation . Springer Series in Statistics. Springer New York, 2008

work page 2008
[29]

S. A. Van de Geer. Empirical Processes in M -Estimation, volume 6. Cambridge University Press, 2000

work page 2000
[30]

Van Der Vaart and J

A. Van Der Vaart and J. A. Wellner. A note on bounds for VC d imensions. Institute of Mathematical Statistics Collections , 5:103, 2009

work page 2009
[31]

Vershynin

R. Vershynin. High-Dimensional Probability: An Introduction with Appli cations in Data Sci- ence. Cambridge University Press, 2018

work page 2018
[32]

Wainwright

M.J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge Series in Statistical and Probabilistic Mathematics. Camb ridge University Press, 2019

work page 2019
[33]

R. S. Wenocur and R. M. Dudley. Some special Vapnik-Cher vonenkis classes. Discrete Math- ematics, 33(3):313–318, 1981

work page 1981
[34]

T. Zhu, P. Xiong, G. Li, and W. Zhou. Correlated diﬀerent ial privacy: Hiding information in non-iid data set. IEEE Transactions on Information Forensics and Security , 10(2):229–242, 2015. 33 A Properties of symmetric distributions In this Appendix, we derive the lemmas concerning propertie s of symmetric distributions (when d = 1) and radially symmetric...

work page 2015
[35]

The desired resu lt then follows by Proposition 8 in Li et al

Note that R(fx,r ) can be written as convolution of P with indicator function of Br, both of which are unimodal and radially symmetric. The desired resu lt then follows by Proposition 8 in Li et al. [ 21], which implies that R(fx,r ) is also unimodal and radially symmetric

work page
[36]

This follows from the nonnegativity of the density

work page
[37]

R∗ r can be written as R∗ r = C ∫ r 0 p(s)sd− 1ds where C is a constant for a ﬁxed dimension

As P is radially symmetric, let the density of P at x be given by p(∥x∥). R∗ r can be written as R∗ r = C ∫ r 0 p(s)sd− 1ds where C is a constant for a ﬁxed dimension. Deﬁne g(r) := R∗ r Cr d =∫ r 0 p(s)sd− 1ds rd forr> 0. Property (iii) is equivalent to showing that d drg(r)< 0. By unimodality of p(·), it follows that g(r)> p(r) d . Diﬀerentiating g(·), ...

work page
[38]

Furthermore, by Lemma 6(i) above, we know that R(fx,r 1)≥ R(fr2,r 1) when∥x∥2≤ r2

Note that any r1-packing of B(0,r 2− r1) has the property that all balls in the packing must be entirely contained within the larger ball Br2. Furthermore, by Lemma 6(i) above, we know that R(fx,r 1)≥ R(fr2,r 1) when∥x∥2≤ r2. Hence, by summing up the densities of all balls in the packing, we obtain R(f0,r 2)≥ P (Br2− r1,r 1)R(fr2,r 1), from which the ﬁrst...

work page
[39]

The second inequality follows by noting that E∥Xi− µ∥2 2 = Tr(Σi) = dσ 2 i

The proof of the ﬁrst inequality is the same as the proof of t he corresponding statement in Lemma 1. The second inequality follows by noting that E∥Xi− µ∥2 2 = Tr(Σi) = dσ 2 i . By Chebyshev’s inequality, we have ˜Ri(f0, 2σ i √ d) = P(∥Xi− µ∥2≤ 2 √ dσi)≥ 3 4, for each i. Thus, B2σ (2k) √ d covers at least 3 4 of the mass of at least 2k distributions, imp...

work page
[40]

The upper bound follows by noting that density at x =|σ| is within constant factor of the density at 0

The lower bound follows by noting that the density at 0 is 1√ 2πσ . The upper bound follows by noting that density at x =|σ| is within constant factor of the density at 0

work page
[41]

The upper bound follows by noting that the density at x = 1 is p(1) = ( n∑ i=1 e− 1 i2c2 √ 2πcin ) ≥ p(0)−O (1 n )

The lower bound follows by noting that the density at x = 0 is p(0) = ( n∑ i=1 1√ 2πcin ) = Θ (logn cn ) . The upper bound follows by noting that the density at x = 1 is p(1) = ( n∑ i=1 e− 1 i2c2 √ 2πcin ) ≥ p(0)−O (1 n )

work page
[42]

Thus the interval [− 1, 1] contains more than 0

For α≥ 1, the upper bound follows from the fact that at least c logn distributions have small variance 1. Thus the interval [− 1, 1] contains more than 0. 6 probability of at least c logn distributions. The lower bound follows by noting that the de nsity at 0 is c logn n 1√ 2π + n− c logn n 1 nα = Θ (logn n ) . Forα< 1, the density at 0 is c logn n 1√ 2π ...

work page
[43]

5VnR ∗r logn +nR∗ r≤ nR∗ r ( 144 √

work page
[44]

5V logn nR∗r + 1 ) ≤ nR∗ r ( 144 √

work page
[45]

2Note that the deﬁnition of Z in Theorem 15 has a factor of 1/√n as opposed to the factor of 1/n here

5V logn 1300V logn + 1 ) < 6nR∗ r. 2Note that the deﬁnition of Z in Theorem 15 has a factor of 1/√n as opposed to the factor of 1/n here. 37 Thus, ntR∗ r 2v > t 12 , so log ( 1 + 2 log ( 1 + ntR∗ r 2v )) ≥ log ( 1 + 2 log ( 1 + t 12 )) ≥ t 50, (28) using the fact that t≤ 1. Now suppose nR∗ r ≥ Ct V 2 logn for the constant Ct = (144 t )2. Note that for t≤ ...

work page
[46]

5V logn t √ nR∗r ≤ 144√ 0

5tnR∗r = 144√ 0. 5V logn t √ nR∗r ≤ 144√ 0. 5V logn t√ 0. 5CtV logn = 144 t√Ct < 1. Now we have all the ingredients required for the application of Theorem 12.9 : P{Z≥ tR∗ r}≤ P{Z≥ EZ + 0. 5tR∗ r} ≤ exp ( − ntR∗ r 4 log ( 1 + 2 log ( 1 + ntR∗ r 2v ))) ≤ exp ( − 1 200nt2R∗ r ) , where the last inequality follows by inequality ( 28). An identical argument c...

work page
[47]

Inequality ( 30) then gives the result

Analogously to Proposition 1, we have rC log n = Θ ( Cσ log n n ) . Inequality ( 30) then gives the result. 40

work page
[48]

We now focus on how to obtain the tighter bound of O(nǫ) for an ǫ> 0, using inequality ( 5)

The bound of ˜O(n) follows by inequality ( 30) and noting that rC log n =O(1) for a ﬁxed C and suﬃciently small c >0. We now focus on how to obtain the tighter bound of O(nǫ) for an ǫ> 0, using inequality ( 5). Let ˜Ri(f ) be the expectation of f under Pi, i.e., ˜Ri(f ) = Ef (Xi). Fix an ǫ> 0. Let r′ =nǫ andr = 1. Then it suﬃces to show that R∗ r− R(fr′,r...

work page
[49]

Then it is easy that R(fr′,r )≤ R∗ r 2

For α< 1, let r′ = Θ (nα ). Then it is easy that R(fr′,r )≤ R∗ r 2 . This follows by observing that the density of a Gaussian distribution decreases by more tha n half at a distance of σ from the mean. Forα≥ 1, let r′ = 10 . Then R∗ r≥ 0. 5 C log n n , as a Gaussian distribution contains about 0. 68 mass within 1 standard deviation of the mean. Moreover, ...

work page
[50]

Moreover, we have the following straightforward relations:

work page
[51]

Deﬁne the following random variables: Z1 = sup f ∈K Rn(f ), Z 2 = sup f ∈J Rn(f ) These relations suﬃce for showing that Z1 < Z2 with constant probability

For every constant C ′, there exists another constant C >0 such that R∗ J +C (√ R∗ J n ) ≥ R∗ 1 +C ′ (√ R∗ 1 n ) . Deﬁne the following random variables: Z1 = sup f ∈K Rn(f ), Z 2 = sup f ∈J Rn(f ) These relations suﬃce for showing that Z1 < Z2 with constant probability. To this end, we would show that with constant probability both (1) Z1 =R∗ 1 +O (√ R∗ 1...

work page
[52]

P(|ˆµ M,r|≥ b− a 2 )≥ c> 0

work page
[53]

Lemmas 14, 15, and 16 give us the required lower bound on the probability of error

P(|ˆµ M,r|≥ b− a 2 |Z≥ k)≥ c> 0. Lemmas 14, 15, and 16 give us the required lower bound on the probability of error. Let ˆµ M, 1, J := arg maxf ∈J Rn(f ). Clearly, we can write P { |ˆµ M,r|≥ nα 2 } = P { Z1 <Z 2,|ˆµ M, 1, J|≥ nα 2 } = ∑ S⊂ [n] P(ES)P ( Z1≤ Z2,|ˆµ M, 1, J|≥ nα 2 ⏐ ⏐ ⏐ ⏐ES ) ≥ ∑ S⊂ [n]:|S|≤ nP(A) P(ES)P ( Z1≤ nR∗ +C √ nR∗,Z 2≥ nR∗ +C √ nR∗,...

work page
[54]

As in the proof of Proposition 1, we have rk = Θ (σk n ) for small k

work page
[55]

By Lemma 1(i), we have r2√ n log n≤ 2σ(4√ n log n) =O (√ n logn)

work page
[56]

Thus, we have r2√ n log n =O ( nα√ n log n n ) =O ( nα − 0

Note that for any ﬁxed k, the value of rk for Example 3 is smaller than the value of rk for Example 1 with σ =nα . Thus, we have r2√ n log n =O ( nα√ n log n n ) =O ( nα − 0. 5 logn ) . E.3 Proof of Lemma 8 We ﬁrst show (i). Note that by Lemma 19, we know that for a ﬁxedi, we have 0∈ [min(Sk,i, max(Sk,i ))] with probability at least 1− 2 exp(−k2/n ). Taki...

work page
[57]

Then the output is the shortest gap estimator itself, so |ˆµ k1,k 2|=|ˆµ S,k 2|

ˆµ S,k 2∈ [min(Sk1), max(Sk1)]. Then the output is the shortest gap estimator itself, so |ˆµ k1,k 2|=|ˆµ S,k 2|

work page
[58]

We will use the following lemma, a slight generalization of L emma 4

ˆµ S,k 2̸∈ [min(Sk1), max(Sk1)]. We will use the following lemma, a slight generalization of L emma 4. 1 from Chierichetti et al. [8]. The lemma states that for suﬃciently large values of k, the k-median contains the true mean, µ ∗ = 0, with high probability. Lemma 19. Let Sk be the output of the k-median algorithm. Then with probability at least 1− 2 exp...

work page
[59]

2.|F|≤ 2 R∗ J

For each f∈F , we have R∗ J 2 ≤ R(f )≤ R∗ J . 2.|F|≤ 2 R∗ J . 3.F coversJ in the sense that ∀f∈J ,∃f1,f 2∈F :f (x)≤ f1(x) +f2(x). It follows that if any interval in J contains at least k points, then at least one interval in F contains at least k 2 points. We construct F of cardinality|F|=⌈ 1 R∗ J ⌉≤ 2 R∗ J , as follows: To create the ﬁrst interval (i = 1...

work page
[60]

Combining the inequalities, we conclude that min ˆµ max {Pi}⊆P (s1,s 2,p ) E[∥ˆµ− µ∥2]≥ s ( min ˆµ max µ Pµ Bin(∥ˆµ− µ∥2≥ s)− 2 exp(−c′logn) )

and the last inequality follows by the assumption p = Ω ( log n n ) . Combining the inequalities, we conclude that min ˆµ max {Pi}⊆P (s1,s 2,p ) E[∥ˆµ− µ∥2]≥ s ( min ˆµ max µ Pµ Bin(∥ˆµ− µ∥2≥ s)− 2 exp(−c′logn) ) . Thus, it suﬃces to ﬁnd s such that the expression minˆµ maxµ Pµ Bin(∥ˆµ− µ∥2≥ s) can be lower- bounded by a constant. For part (i), using stan...

work page
[61]

and the formula ( 39), we have A′ z = (p′)2 1− p′ ( σ1√ 2πσ 2 2 )d ∫ exp (−∥x− µ 2∥2 2 σ 2 2 +∥x− µ 2∥2 2 σ 2 1 − ∥x− µ 1∥2 2 2σ 2 1 ) dx = (p′)2 1− p′   σ1 √ 2σ 2 2 √ 1 σ 2 2 − 1 2σ 2 1   d exp   1 4 ( 1 σ 2 2 − 1 2σ 2 1 )     2µ 2 σ 2 2 − 2µ 2 σ 2 1 + µ 1 σ 2 1     2 2 − µ T 2µ 2 σ 2 2 + µ T 2µ 2 σ 2 1 − µ T 1µ 1 2σ 2 1   = (p′)2 1− p′ ...

work page
[62]

On the other hand, we can also argue that the maximizer is unique

is therefore maximized when β = β ∗. On the other hand, we can also argue that the maximizer is unique . Indeed, suppose β∈ Rd were such that β̸= β ∗. The set S := { {xi}n i=1⊆ (Rd)n :xT i (β− ˆβ ) = 0 ∀i } has Lebesgue measure 0. We can write E [ n∑ i=1 E [ 1 { |yi− xT i β|≤ r } |{xi}n i=1 ] ] = ∫ {xi}∈S E [ 1 { |yi− xT i β|≤ r } |{xi}n i=1 ] dP({xi}) + ...

work page

[1] [1]

Abraham, G

C. Abraham, G. Biau, and B. Cadre. On the asymptotic propert ies of a simple estimate of the mode. ESAIM: Probability and Statistics , 8:1–11, 2004

work page 2004

[2] [2]

Achlioptas and F

D. Achlioptas and F. McSherry. On spectral learning of mi xtures of distributions. In Interna- tional Conference on Computational Learning Theory , pages 458–469. Springer, 2005

work page 2005

[3] [3]

P. K. Agarwal and M. Sharir. Eﬃcient algorithms for geome tric optimization. ACM Computing Surveys, 30(4):412–458, 1998

work page 1998

[4] [4]

D. F. Andrews, P. J. Bickel, F. R. Hampel, P. J. Huber, W. H. R ogers, and J. W. Tukey. Robust Estimates of Location: Survey and Advances . Princeton University Press, 1972

work page 1972

[5] [5]

Arora and R

S. Arora and R. Kannan. Learning mixtures of arbitrary Ga ussians. In Proceedings of the 33rd annual ACM Symposium on Theory of Computing , pages 247–257, 2001

work page 2001

[6] [6]

Boucheron, G

S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 1 edition, 4 2016

work page 2016

[7] [7]

H. Chernoﬀ. Estimation of the mode. Annals of the Institute of Statistical Mathematics , 16(1):31–41, dec 1964

work page 1964

[8] [8]

Chierichetti, A

F. Chierichetti, A. Dasgupta, R. Kumar, and S. Lattanzi. Learning entangled single-sample Gaussians. In Proceedings of the 25th Annual Symposium on Discrete Algori thms, SODA, pages 511–522, 2014

work page 2014

[9] [9]

Dasgupta

S. Dasgupta. Learning mixtures of Gaussians. In 40th Annual Symposium on Foundations of Computer Science, pages 634–644. IEEE, 1999

work page 1999

[10] [10]

Dasgupta and S

S. Dasgupta and S. Kpotufe. Optimal rates for k-NN density and mode estimation. In Advances in Neural Information Processing Systems , pages 2555–2563, 2014

work page 2014

[11] [11]

Dundar, B

M. Dundar, B. Krishnapuram, J. Bi, and R. B. Rao. Learning cl assiﬁers when the training data is not iid. In IJCAI, pages 756–761, 2007

work page 2007

[12] [12]

Eppstein and J

D. Eppstein and J. Erickson. Iterated nearest neighbor s and ﬁnding minimal polytopes. Discrete & Computational Geometry , 11(3):321–350, 1994

work page 1994

[13] [13]

S. R. Flaxman, D. B. Neill, and A. J. Smola. Gaussian proce sses for independence tests with non-iid data in causal inference. ACM Transactions on TIST , 7(2):22, 2016

work page 2016

[14] [14]

P. J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics , 35(1):73–101, 1964

work page 1964

[15] [15]

H. Jiang. Uniform convergence rates for kernel density estimation. In Proceedings of the 34th International Conference on Machine Learning , pages 1694–1703, 2017

work page 2017

[16] [16]

Kannan, H

R. Kannan, H. Salmasian, and S. Vempala. The spectral me thod for general mixture models. In International Conference on Computational Learning Theor y, pages 444–457. Springer, 2005

work page 2005

[17] [17]

Kim and D

J. Kim and D. Pollard. Cube root asymptotics. Ann. Statist. , 18(1):191–219, 03 1990. 32

work page 1990

[18] [18]

K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation o f mean and covariance. In 57th Annual Symposium on Foundations of Computer Science (FOCS) , pages 665–674, Oct 2016

work page 2016

[19] [19]

Lee and F

D.-T. Lee and F. P. Preparata. Computational geometry— a survey. IEEE Transactions on Computers, (12):1072–1101, 1984

work page 1984

[20] [20]

O. V. Lepskii. On a problem of adaptive estimation in Gau ssian white noise. Theory of Probability & Its Applications , 35(3):454–466, 1991

work page 1991

[21] [21]

F urther investigations of r \’enyi en- tropy power inequalities and an entropic characterization of s-concave densities

Jiange Li, Arnaud Marsiglietti, and James Melbourne. F urther investigations of r \’enyi en- tropy power inequalities and an entropic characterization of s-concave densities. arXiv preprint arXiv:1901.10616, 2019

work page arXiv 1901

[22] [22]

B. G. Lindsay. Mixture models: Theory, geometry and appl ications. In NSF-CBMS Regional Conference Series in Probability and Statistics , pages i–163. JSTOR, 1995

work page 1995

[23] [23]

R. Y. Liu. Bootstrap procedures under some non-iid model s. The Annals of Statistics , 16(4):1696–1708, 1988

work page 1988

[24] [24]

Pensia, V

A. Pensia, V. Jog, and P. Loh. Mean estimation for entang led single-sample distributions. In 2019 IEEE International Symposium on Information Theory (I SIT). IEEE, 2019

work page 2019

[25] [25]

A Unified Approach to Robust Mean Estimation

A. Prasad, S. Balakrishnan, and P. Ravikumar. A uniﬁed ap proach to robust mean estimation. arXiv preprint arXiv:1907.00927 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[26] [26]

Raskutti, M

G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eige nvalue properties for correlated Gaussian designs. Journal of Machine Learning Research , 11(Aug):2241–2259, 2010

work page 2010

[27] [27]

Steinwart and A

I. Steinwart and A. Christmann. Fast learning from non- iid observations. In Advances in NIPS, pages 1768–1776, 2009

work page 2009

[28] [28]

Tsybakov

A.B. Tsybakov. Introduction to Nonparametric Estimation . Springer Series in Statistics. Springer New York, 2008

work page 2008

[29] [29]

S. A. Van de Geer. Empirical Processes in M -Estimation, volume 6. Cambridge University Press, 2000

work page 2000

[30] [30]

Van Der Vaart and J

A. Van Der Vaart and J. A. Wellner. A note on bounds for VC d imensions. Institute of Mathematical Statistics Collections , 5:103, 2009

work page 2009

[31] [31]

Vershynin

R. Vershynin. High-Dimensional Probability: An Introduction with Appli cations in Data Sci- ence. Cambridge University Press, 2018

work page 2018

[32] [32]

Wainwright

M.J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge Series in Statistical and Probabilistic Mathematics. Camb ridge University Press, 2019

work page 2019

[33] [33]

R. S. Wenocur and R. M. Dudley. Some special Vapnik-Cher vonenkis classes. Discrete Math- ematics, 33(3):313–318, 1981

work page 1981

[34] [34]

T. Zhu, P. Xiong, G. Li, and W. Zhou. Correlated diﬀerent ial privacy: Hiding information in non-iid data set. IEEE Transactions on Information Forensics and Security , 10(2):229–242, 2015. 33 A Properties of symmetric distributions In this Appendix, we derive the lemmas concerning propertie s of symmetric distributions (when d = 1) and radially symmetric...

work page 2015

[35] [35]

The desired resu lt then follows by Proposition 8 in Li et al

Note that R(fx,r ) can be written as convolution of P with indicator function of Br, both of which are unimodal and radially symmetric. The desired resu lt then follows by Proposition 8 in Li et al. [ 21], which implies that R(fx,r ) is also unimodal and radially symmetric

work page

[36] [36]

This follows from the nonnegativity of the density

work page

[37] [37]

R∗ r can be written as R∗ r = C ∫ r 0 p(s)sd− 1ds where C is a constant for a ﬁxed dimension

As P is radially symmetric, let the density of P at x be given by p(∥x∥). R∗ r can be written as R∗ r = C ∫ r 0 p(s)sd− 1ds where C is a constant for a ﬁxed dimension. Deﬁne g(r) := R∗ r Cr d =∫ r 0 p(s)sd− 1ds rd forr> 0. Property (iii) is equivalent to showing that d drg(r)< 0. By unimodality of p(·), it follows that g(r)> p(r) d . Diﬀerentiating g(·), ...

work page

[38] [38]

Furthermore, by Lemma 6(i) above, we know that R(fx,r 1)≥ R(fr2,r 1) when∥x∥2≤ r2

Note that any r1-packing of B(0,r 2− r1) has the property that all balls in the packing must be entirely contained within the larger ball Br2. Furthermore, by Lemma 6(i) above, we know that R(fx,r 1)≥ R(fr2,r 1) when∥x∥2≤ r2. Hence, by summing up the densities of all balls in the packing, we obtain R(f0,r 2)≥ P (Br2− r1,r 1)R(fr2,r 1), from which the ﬁrst...

work page

[39] [39]

The second inequality follows by noting that E∥Xi− µ∥2 2 = Tr(Σi) = dσ 2 i

The proof of the ﬁrst inequality is the same as the proof of t he corresponding statement in Lemma 1. The second inequality follows by noting that E∥Xi− µ∥2 2 = Tr(Σi) = dσ 2 i . By Chebyshev’s inequality, we have ˜Ri(f0, 2σ i √ d) = P(∥Xi− µ∥2≤ 2 √ dσi)≥ 3 4, for each i. Thus, B2σ (2k) √ d covers at least 3 4 of the mass of at least 2k distributions, imp...

work page

[40] [40]

The upper bound follows by noting that density at x =|σ| is within constant factor of the density at 0

The lower bound follows by noting that the density at 0 is 1√ 2πσ . The upper bound follows by noting that density at x =|σ| is within constant factor of the density at 0

work page

[41] [41]

The upper bound follows by noting that the density at x = 1 is p(1) = ( n∑ i=1 e− 1 i2c2 √ 2πcin ) ≥ p(0)−O (1 n )

The lower bound follows by noting that the density at x = 0 is p(0) = ( n∑ i=1 1√ 2πcin ) = Θ (logn cn ) . The upper bound follows by noting that the density at x = 1 is p(1) = ( n∑ i=1 e− 1 i2c2 √ 2πcin ) ≥ p(0)−O (1 n )

work page

[42] [42]

Thus the interval [− 1, 1] contains more than 0

For α≥ 1, the upper bound follows from the fact that at least c logn distributions have small variance 1. Thus the interval [− 1, 1] contains more than 0. 6 probability of at least c logn distributions. The lower bound follows by noting that the de nsity at 0 is c logn n 1√ 2π + n− c logn n 1 nα = Θ (logn n ) . Forα< 1, the density at 0 is c logn n 1√ 2π ...

work page

[43] [43]

5VnR ∗r logn +nR∗ r≤ nR∗ r ( 144 √

work page

[44] [44]

5V logn nR∗r + 1 ) ≤ nR∗ r ( 144 √

work page

[45] [45]

2Note that the deﬁnition of Z in Theorem 15 has a factor of 1/√n as opposed to the factor of 1/n here

5V logn 1300V logn + 1 ) < 6nR∗ r. 2Note that the deﬁnition of Z in Theorem 15 has a factor of 1/√n as opposed to the factor of 1/n here. 37 Thus, ntR∗ r 2v > t 12 , so log ( 1 + 2 log ( 1 + ntR∗ r 2v )) ≥ log ( 1 + 2 log ( 1 + t 12 )) ≥ t 50, (28) using the fact that t≤ 1. Now suppose nR∗ r ≥ Ct V 2 logn for the constant Ct = (144 t )2. Note that for t≤ ...

work page

[46] [46]

5V logn t √ nR∗r ≤ 144√ 0

5tnR∗r = 144√ 0. 5V logn t √ nR∗r ≤ 144√ 0. 5V logn t√ 0. 5CtV logn = 144 t√Ct < 1. Now we have all the ingredients required for the application of Theorem 12.9 : P{Z≥ tR∗ r}≤ P{Z≥ EZ + 0. 5tR∗ r} ≤ exp ( − ntR∗ r 4 log ( 1 + 2 log ( 1 + ntR∗ r 2v ))) ≤ exp ( − 1 200nt2R∗ r ) , where the last inequality follows by inequality ( 28). An identical argument c...

work page

[47] [47]

Inequality ( 30) then gives the result

Analogously to Proposition 1, we have rC log n = Θ ( Cσ log n n ) . Inequality ( 30) then gives the result. 40

work page

[48] [48]

We now focus on how to obtain the tighter bound of O(nǫ) for an ǫ> 0, using inequality ( 5)

The bound of ˜O(n) follows by inequality ( 30) and noting that rC log n =O(1) for a ﬁxed C and suﬃciently small c >0. We now focus on how to obtain the tighter bound of O(nǫ) for an ǫ> 0, using inequality ( 5). Let ˜Ri(f ) be the expectation of f under Pi, i.e., ˜Ri(f ) = Ef (Xi). Fix an ǫ> 0. Let r′ =nǫ andr = 1. Then it suﬃces to show that R∗ r− R(fr′,r...

work page

[49] [49]

Then it is easy that R(fr′,r )≤ R∗ r 2

For α< 1, let r′ = Θ (nα ). Then it is easy that R(fr′,r )≤ R∗ r 2 . This follows by observing that the density of a Gaussian distribution decreases by more tha n half at a distance of σ from the mean. Forα≥ 1, let r′ = 10 . Then R∗ r≥ 0. 5 C log n n , as a Gaussian distribution contains about 0. 68 mass within 1 standard deviation of the mean. Moreover, ...

work page

[50] [50]

Moreover, we have the following straightforward relations:

work page

[51] [51]

Deﬁne the following random variables: Z1 = sup f ∈K Rn(f ), Z 2 = sup f ∈J Rn(f ) These relations suﬃce for showing that Z1 < Z2 with constant probability

For every constant C ′, there exists another constant C >0 such that R∗ J +C (√ R∗ J n ) ≥ R∗ 1 +C ′ (√ R∗ 1 n ) . Deﬁne the following random variables: Z1 = sup f ∈K Rn(f ), Z 2 = sup f ∈J Rn(f ) These relations suﬃce for showing that Z1 < Z2 with constant probability. To this end, we would show that with constant probability both (1) Z1 =R∗ 1 +O (√ R∗ 1...

work page

[52] [52]

P(|ˆµ M,r|≥ b− a 2 )≥ c> 0

work page

[53] [53]

Lemmas 14, 15, and 16 give us the required lower bound on the probability of error

P(|ˆµ M,r|≥ b− a 2 |Z≥ k)≥ c> 0. Lemmas 14, 15, and 16 give us the required lower bound on the probability of error. Let ˆµ M, 1, J := arg maxf ∈J Rn(f ). Clearly, we can write P { |ˆµ M,r|≥ nα 2 } = P { Z1 <Z 2,|ˆµ M, 1, J|≥ nα 2 } = ∑ S⊂ [n] P(ES)P ( Z1≤ Z2,|ˆµ M, 1, J|≥ nα 2 ⏐ ⏐ ⏐ ⏐ES ) ≥ ∑ S⊂ [n]:|S|≤ nP(A) P(ES)P ( Z1≤ nR∗ +C √ nR∗,Z 2≥ nR∗ +C √ nR∗,...

work page

[54] [54]

As in the proof of Proposition 1, we have rk = Θ (σk n ) for small k

work page

[55] [55]

By Lemma 1(i), we have r2√ n log n≤ 2σ(4√ n log n) =O (√ n logn)

work page

[56] [56]

Thus, we have r2√ n log n =O ( nα√ n log n n ) =O ( nα − 0

Note that for any ﬁxed k, the value of rk for Example 3 is smaller than the value of rk for Example 1 with σ =nα . Thus, we have r2√ n log n =O ( nα√ n log n n ) =O ( nα − 0. 5 logn ) . E.3 Proof of Lemma 8 We ﬁrst show (i). Note that by Lemma 19, we know that for a ﬁxedi, we have 0∈ [min(Sk,i, max(Sk,i ))] with probability at least 1− 2 exp(−k2/n ). Taki...

work page

[57] [57]

Then the output is the shortest gap estimator itself, so |ˆµ k1,k 2|=|ˆµ S,k 2|

ˆµ S,k 2∈ [min(Sk1), max(Sk1)]. Then the output is the shortest gap estimator itself, so |ˆµ k1,k 2|=|ˆµ S,k 2|

work page

[58] [58]

We will use the following lemma, a slight generalization of L emma 4

ˆµ S,k 2̸∈ [min(Sk1), max(Sk1)]. We will use the following lemma, a slight generalization of L emma 4. 1 from Chierichetti et al. [8]. The lemma states that for suﬃciently large values of k, the k-median contains the true mean, µ ∗ = 0, with high probability. Lemma 19. Let Sk be the output of the k-median algorithm. Then with probability at least 1− 2 exp...

work page

[59] [59]

2.|F|≤ 2 R∗ J

For each f∈F , we have R∗ J 2 ≤ R(f )≤ R∗ J . 2.|F|≤ 2 R∗ J . 3.F coversJ in the sense that ∀f∈J ,∃f1,f 2∈F :f (x)≤ f1(x) +f2(x). It follows that if any interval in J contains at least k points, then at least one interval in F contains at least k 2 points. We construct F of cardinality|F|=⌈ 1 R∗ J ⌉≤ 2 R∗ J , as follows: To create the ﬁrst interval (i = 1...

work page

[60] [60]

Combining the inequalities, we conclude that min ˆµ max {Pi}⊆P (s1,s 2,p ) E[∥ˆµ− µ∥2]≥ s ( min ˆµ max µ Pµ Bin(∥ˆµ− µ∥2≥ s)− 2 exp(−c′logn) )

and the last inequality follows by the assumption p = Ω ( log n n ) . Combining the inequalities, we conclude that min ˆµ max {Pi}⊆P (s1,s 2,p ) E[∥ˆµ− µ∥2]≥ s ( min ˆµ max µ Pµ Bin(∥ˆµ− µ∥2≥ s)− 2 exp(−c′logn) ) . Thus, it suﬃces to ﬁnd s such that the expression minˆµ maxµ Pµ Bin(∥ˆµ− µ∥2≥ s) can be lower- bounded by a constant. For part (i), using stan...

work page

[61] [61]

and the formula ( 39), we have A′ z = (p′)2 1− p′ ( σ1√ 2πσ 2 2 )d ∫ exp (−∥x− µ 2∥2 2 σ 2 2 +∥x− µ 2∥2 2 σ 2 1 − ∥x− µ 1∥2 2 2σ 2 1 ) dx = (p′)2 1− p′   σ1 √ 2σ 2 2 √ 1 σ 2 2 − 1 2σ 2 1   d exp   1 4 ( 1 σ 2 2 − 1 2σ 2 1 )     2µ 2 σ 2 2 − 2µ 2 σ 2 1 + µ 1 σ 2 1     2 2 − µ T 2µ 2 σ 2 2 + µ T 2µ 2 σ 2 1 − µ T 1µ 1 2σ 2 1   = (p′)2 1− p′ ...

work page

[62] [62]

On the other hand, we can also argue that the maximizer is unique

is therefore maximized when β = β ∗. On the other hand, we can also argue that the maximizer is unique . Indeed, suppose β∈ Rd were such that β̸= β ∗. The set S := { {xi}n i=1⊆ (Rd)n :xT i (β− ˆβ ) = 0 ∀i } has Lebesgue measure 0. We can write E [ n∑ i=1 E [ 1 { |yi− xT i β|≤ r } |{xi}n i=1 ] ] = ∫ {xi}∈S E [ 1 { |yi− xT i β|≤ r } |{xi}n i=1 ] dP({xi}) + ...

work page