Estimating location parameters in entangled single-sample distributions
Pith reviewed 2026-05-25 01:56 UTC · model grok-4.3
The pith
A hybrid estimator adapts to heterogeneity to estimate the common mean from non-identical symmetric unimodal distributions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose an estimator that adapts to the level of heterogeneity in the data, achieving near-optimality in both the i.i.d. setting and some heterogeneous settings, where the fraction of low-noise points is as small as log n / n. Our estimator is a hybrid of the modal interval, shorth, and median estimators from classical statistics; however, the key technical contributions rely on novel empirical process theory results that we derive for independent but non-i.i.d. data. In the multivariate setting, we generalize our theory to mean estimation for mixtures of radially symmetric distributions, and derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales
What carries the argument
hybrid of the modal interval, shorth, and median estimators from classical statistics, supported by novel empirical process theory results for independent but non-i.i.d. data
If this is right
- The estimator is near-optimal in the i.i.d. setting.
- It remains near-optimal when the fraction of low-noise points is as small as log n / n.
- In the multivariate case it applies to mean estimation for mixtures of radially symmetric distributions.
- Minimax lower bounds hold for any estimator agnostic to individual data-point scales.
- The method extends to linear regression with computationally feasible polynomial-time versions.
Where Pith is reading between the lines
- The non-i.i.d. empirical process results may apply to other estimation tasks with independent but heterogeneous samples.
- Adaptation to heterogeneity could be explored for estimating other parameters such as variance in similar single-sample settings.
- The lower bounds suggest that access to per-point scale information would be needed to improve rates beyond the agnostic case.
- Polynomial-time versions make the approach suitable for testing on large heterogeneous datasets.
Load-bearing premise
Samples are drawn independently from symmetric, unimodal distributions that share a common mean.
What would settle it
An experiment showing the estimator fails to achieve the claimed near-optimal rate when the fraction of low-noise points drops below log n / n while distributions remain symmetric and unimodal.
Figures
read the original abstract
We consider the problem of estimating the common mean of independently sampled data, where samples are drawn in a possibly non-identical manner from symmetric, unimodal distributions with a common mean. This generalizes the setting of Gaussian mixture modeling, since the number of distinct mixture components may diverge with the number of observations. We propose an estimator that adapts to the level of heterogeneity in the data, achieving near-optimality in both the i.i.d. setting and some heterogeneous settings, where the fraction of ``low-noise'' points is as small as $\frac{\log n}{n}$. Our estimator is a hybrid of the modal interval, shorth, and median estimators from classical statistics; however, the key technical contributions rely on novel empirical process theory results that we derive for independent but non-i.i.d. data. In the multivariate setting, we generalize our theory to mean estimation for mixtures of radially symmetric distributions, and derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we describe an extension of our estimators applicable to linear regression. In the multivariate mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript considers estimating the common mean of n independent samples drawn from symmetric unimodal distributions that may be non-identical (generalizing Gaussian mixtures to diverging components). It proposes a hybrid estimator combining the modal interval, shorth, and median, which adapts to the level of heterogeneity and is claimed to achieve near-optimal rates both in the i.i.d. case and in heterogeneous regimes where the fraction of low-noise observations can be as small as log n / n. The central technical contributions are new empirical process bounds derived for the independent non-i.i.d. setting; the work also provides multivariate extensions to radially symmetric mixtures, minimax lower bounds for scale-agnostic estimators, an extension to linear regression, and polynomial-time computational versions of the estimators.
Significance. If the stated rates and bounds hold, the paper makes a meaningful contribution to robust location estimation by handling extreme heterogeneity with a single adaptive procedure. The novel non-i.i.d. empirical process results and the minimax lower bounds for scale-agnostic estimators are technically useful; the computational feasibility in the multivariate and regression settings strengthens the practical relevance.
minor comments (3)
- The abstract and introduction state the near-optimality claims at a high level; adding a brief outline of the key empirical-process argument (even without full proofs) would improve readability for readers outside the immediate subfield.
- Notation for the hybrid estimator and the heterogeneity parameter should be introduced consistently before the main theorems to avoid forward references.
- In the multivariate section, clarify whether the radial symmetry assumption is used only for the lower bounds or also for the upper bounds on the estimator.
Simulated Author's Rebuttal
We thank the referee for the positive summary and recommendation of minor revision. The assessment correctly identifies the main contributions, including the adaptive hybrid estimator, non-i.i.d. empirical process bounds, multivariate and regression extensions, and minimax lower bounds. No major comments were listed in the report.
Circularity Check
No significant circularity; derivation relies on novel bounds
full rationale
The manuscript derives new empirical process bounds for independent non-i.i.d. symmetric unimodal distributions and combines them with classical modal/shorth/median estimators to obtain an adaptive estimator. The abstract explicitly states that the key technical contributions are these novel results rather than reductions of prior fitted quantities or self-citations. No equations or steps are described that equate a prediction to its own input by construction, import uniqueness from the authors' prior work, or smuggle an ansatz via citation. The adaptation to heterogeneity down to log n / n fraction is presented as following from the new bounds and separately derived minimax lower bounds. This matches the default expectation for a self-contained theoretical derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Data points are independently sampled from symmetric, unimodal distributions sharing a common mean.
Reference graph
Works this paper leans on
-
[1]
C. Abraham, G. Biau, and B. Cadre. On the asymptotic propert ies of a simple estimate of the mode. ESAIM: Probability and Statistics , 8:1–11, 2004
work page 2004
-
[2]
D. Achlioptas and F. McSherry. On spectral learning of mi xtures of distributions. In Interna- tional Conference on Computational Learning Theory , pages 458–469. Springer, 2005
work page 2005
-
[3]
P. K. Agarwal and M. Sharir. Efficient algorithms for geome tric optimization. ACM Computing Surveys, 30(4):412–458, 1998
work page 1998
-
[4]
D. F. Andrews, P. J. Bickel, F. R. Hampel, P. J. Huber, W. H. R ogers, and J. W. Tukey. Robust Estimates of Location: Survey and Advances . Princeton University Press, 1972
work page 1972
-
[5]
S. Arora and R. Kannan. Learning mixtures of arbitrary Ga ussians. In Proceedings of the 33rd annual ACM Symposium on Theory of Computing , pages 247–257, 2001
work page 2001
-
[6]
S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 1 edition, 4 2016
work page 2016
-
[7]
H. Chernoff. Estimation of the mode. Annals of the Institute of Statistical Mathematics , 16(1):31–41, dec 1964
work page 1964
-
[8]
F. Chierichetti, A. Dasgupta, R. Kumar, and S. Lattanzi. Learning entangled single-sample Gaussians. In Proceedings of the 25th Annual Symposium on Discrete Algori thms, SODA, pages 511–522, 2014
work page 2014
- [9]
-
[10]
S. Dasgupta and S. Kpotufe. Optimal rates for k-NN density and mode estimation. In Advances in Neural Information Processing Systems , pages 2555–2563, 2014
work page 2014
- [11]
-
[12]
D. Eppstein and J. Erickson. Iterated nearest neighbor s and finding minimal polytopes. Discrete & Computational Geometry , 11(3):321–350, 1994
work page 1994
-
[13]
S. R. Flaxman, D. B. Neill, and A. J. Smola. Gaussian proce sses for independence tests with non-iid data in causal inference. ACM Transactions on TIST , 7(2):22, 2016
work page 2016
-
[14]
P. J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics , 35(1):73–101, 1964
work page 1964
-
[15]
H. Jiang. Uniform convergence rates for kernel density estimation. In Proceedings of the 34th International Conference on Machine Learning , pages 1694–1703, 2017
work page 2017
- [16]
- [17]
-
[18]
K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation o f mean and covariance. In 57th Annual Symposium on Foundations of Computer Science (FOCS) , pages 665–674, Oct 2016
work page 2016
- [19]
-
[20]
O. V. Lepskii. On a problem of adaptive estimation in Gau ssian white noise. Theory of Probability & Its Applications , 35(3):454–466, 1991
work page 1991
-
[21]
Jiange Li, Arnaud Marsiglietti, and James Melbourne. F urther investigations of r \’enyi en- tropy power inequalities and an entropic characterization of s-concave densities. arXiv preprint arXiv:1901.10616, 2019
-
[22]
B. G. Lindsay. Mixture models: Theory, geometry and appl ications. In NSF-CBMS Regional Conference Series in Probability and Statistics , pages i–163. JSTOR, 1995
work page 1995
-
[23]
R. Y. Liu. Bootstrap procedures under some non-iid model s. The Annals of Statistics , 16(4):1696–1708, 1988
work page 1988
- [24]
-
[25]
A Unified Approach to Robust Mean Estimation
A. Prasad, S. Balakrishnan, and P. Ravikumar. A unified ap proach to robust mean estimation. arXiv preprint arXiv:1907.00927 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[26]
G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eige nvalue properties for correlated Gaussian designs. Journal of Machine Learning Research , 11(Aug):2241–2259, 2010
work page 2010
-
[27]
I. Steinwart and A. Christmann. Fast learning from non- iid observations. In Advances in NIPS, pages 1768–1776, 2009
work page 2009
- [28]
-
[29]
S. A. Van de Geer. Empirical Processes in M -Estimation, volume 6. Cambridge University Press, 2000
work page 2000
-
[30]
A. Van Der Vaart and J. A. Wellner. A note on bounds for VC d imensions. Institute of Mathematical Statistics Collections , 5:103, 2009
work page 2009
- [31]
-
[32]
M.J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge Series in Statistical and Probabilistic Mathematics. Camb ridge University Press, 2019
work page 2019
-
[33]
R. S. Wenocur and R. M. Dudley. Some special Vapnik-Cher vonenkis classes. Discrete Math- ematics, 33(3):313–318, 1981
work page 1981
-
[34]
T. Zhu, P. Xiong, G. Li, and W. Zhou. Correlated different ial privacy: Hiding information in non-iid data set. IEEE Transactions on Information Forensics and Security , 10(2):229–242, 2015. 33 A Properties of symmetric distributions In this Appendix, we derive the lemmas concerning propertie s of symmetric distributions (when d = 1) and radially symmetric...
work page 2015
-
[35]
The desired resu lt then follows by Proposition 8 in Li et al
Note that R(fx,r ) can be written as convolution of P with indicator function of Br, both of which are unimodal and radially symmetric. The desired resu lt then follows by Proposition 8 in Li et al. [ 21], which implies that R(fx,r ) is also unimodal and radially symmetric
-
[36]
This follows from the nonnegativity of the density
-
[37]
R∗ r can be written as R∗ r = C ∫ r 0 p(s)sd− 1ds where C is a constant for a fixed dimension
As P is radially symmetric, let the density of P at x be given by p(∥x∥). R∗ r can be written as R∗ r = C ∫ r 0 p(s)sd− 1ds where C is a constant for a fixed dimension. Define g(r) := R∗ r Cr d =∫ r 0 p(s)sd− 1ds rd forr> 0. Property (iii) is equivalent to showing that d drg(r)< 0. By unimodality of p(·), it follows that g(r)> p(r) d . Differentiating g(·), ...
-
[38]
Furthermore, by Lemma 6(i) above, we know that R(fx,r 1)≥ R(fr2,r 1) when∥x∥2≤ r2
Note that any r1-packing of B(0,r 2− r1) has the property that all balls in the packing must be entirely contained within the larger ball Br2. Furthermore, by Lemma 6(i) above, we know that R(fx,r 1)≥ R(fr2,r 1) when∥x∥2≤ r2. Hence, by summing up the densities of all balls in the packing, we obtain R(f0,r 2)≥ P (Br2− r1,r 1)R(fr2,r 1), from which the first...
-
[39]
The second inequality follows by noting that E∥Xi− µ∥2 2 = Tr(Σi) = dσ 2 i
The proof of the first inequality is the same as the proof of t he corresponding statement in Lemma 1. The second inequality follows by noting that E∥Xi− µ∥2 2 = Tr(Σi) = dσ 2 i . By Chebyshev’s inequality, we have ˜Ri(f0, 2σ i √ d) = P(∥Xi− µ∥2≤ 2 √ dσi)≥ 3 4, for each i. Thus, B2σ (2k) √ d covers at least 3 4 of the mass of at least 2k distributions, imp...
-
[40]
The lower bound follows by noting that the density at 0 is 1√ 2πσ . The upper bound follows by noting that density at x =|σ| is within constant factor of the density at 0
-
[41]
The lower bound follows by noting that the density at x = 0 is p(0) = ( n∑ i=1 1√ 2πcin ) = Θ (logn cn ) . The upper bound follows by noting that the density at x = 1 is p(1) = ( n∑ i=1 e− 1 i2c2 √ 2πcin ) ≥ p(0)−O (1 n )
-
[42]
Thus the interval [− 1, 1] contains more than 0
For α≥ 1, the upper bound follows from the fact that at least c logn distributions have small variance 1. Thus the interval [− 1, 1] contains more than 0. 6 probability of at least c logn distributions. The lower bound follows by noting that the de nsity at 0 is c logn n 1√ 2π + n− c logn n 1 nα = Θ (logn n ) . Forα< 1, the density at 0 is c logn n 1√ 2π ...
-
[43]
5VnR ∗r logn +nR∗ r≤ nR∗ r ( 144 √
-
[44]
5V logn nR∗r + 1 ) ≤ nR∗ r ( 144 √
-
[45]
5V logn 1300V logn + 1 ) < 6nR∗ r. 2Note that the definition of Z in Theorem 15 has a factor of 1/√n as opposed to the factor of 1/n here. 37 Thus, ntR∗ r 2v > t 12 , so log ( 1 + 2 log ( 1 + ntR∗ r 2v )) ≥ log ( 1 + 2 log ( 1 + t 12 )) ≥ t 50, (28) using the fact that t≤ 1. Now suppose nR∗ r ≥ Ct V 2 logn for the constant Ct = (144 t )2. Note that for t≤ ...
-
[46]
5tnR∗r = 144√ 0. 5V logn t √ nR∗r ≤ 144√ 0. 5V logn t√ 0. 5CtV logn = 144 t√Ct < 1. Now we have all the ingredients required for the application of Theorem 12.9 : P{Z≥ tR∗ r}≤ P{Z≥ EZ + 0. 5tR∗ r} ≤ exp ( − ntR∗ r 4 log ( 1 + 2 log ( 1 + ntR∗ r 2v ))) ≤ exp ( − 1 200nt2R∗ r ) , where the last inequality follows by inequality ( 28). An identical argument c...
-
[47]
Inequality ( 30) then gives the result
Analogously to Proposition 1, we have rC log n = Θ ( Cσ log n n ) . Inequality ( 30) then gives the result. 40
-
[48]
We now focus on how to obtain the tighter bound of O(nǫ) for an ǫ> 0, using inequality ( 5)
The bound of ˜O(n) follows by inequality ( 30) and noting that rC log n =O(1) for a fixed C and sufficiently small c >0. We now focus on how to obtain the tighter bound of O(nǫ) for an ǫ> 0, using inequality ( 5). Let ˜Ri(f ) be the expectation of f under Pi, i.e., ˜Ri(f ) = Ef (Xi). Fix an ǫ> 0. Let r′ =nǫ andr = 1. Then it suffices to show that R∗ r− R(fr′,r...
-
[49]
Then it is easy that R(fr′,r )≤ R∗ r 2
For α< 1, let r′ = Θ (nα ). Then it is easy that R(fr′,r )≤ R∗ r 2 . This follows by observing that the density of a Gaussian distribution decreases by more tha n half at a distance of σ from the mean. Forα≥ 1, let r′ = 10 . Then R∗ r≥ 0. 5 C log n n , as a Gaussian distribution contains about 0. 68 mass within 1 standard deviation of the mean. Moreover, ...
-
[50]
Moreover, we have the following straightforward relations:
-
[51]
For every constant C ′, there exists another constant C >0 such that R∗ J +C (√ R∗ J n ) ≥ R∗ 1 +C ′ (√ R∗ 1 n ) . Define the following random variables: Z1 = sup f ∈K Rn(f ), Z 2 = sup f ∈J Rn(f ) These relations suffice for showing that Z1 < Z2 with constant probability. To this end, we would show that with constant probability both (1) Z1 =R∗ 1 +O (√ R∗ 1...
-
[52]
P(|ˆµ M,r|≥ b− a 2 )≥ c> 0
-
[53]
Lemmas 14, 15, and 16 give us the required lower bound on the probability of error
P(|ˆµ M,r|≥ b− a 2 |Z≥ k)≥ c> 0. Lemmas 14, 15, and 16 give us the required lower bound on the probability of error. Let ˆµ M, 1, J := arg maxf ∈J Rn(f ). Clearly, we can write P { |ˆµ M,r|≥ nα 2 } = P { Z1 <Z 2,|ˆµ M, 1, J|≥ nα 2 } = ∑ S⊂ [n] P(ES)P ( Z1≤ Z2,|ˆµ M, 1, J|≥ nα 2 ⏐ ⏐ ⏐ ⏐ES ) ≥ ∑ S⊂ [n]:|S|≤ nP(A) P(ES)P ( Z1≤ nR∗ +C √ nR∗,Z 2≥ nR∗ +C √ nR∗,...
-
[54]
As in the proof of Proposition 1, we have rk = Θ (σk n ) for small k
-
[55]
By Lemma 1(i), we have r2√ n log n≤ 2σ(4√ n log n) =O (√ n logn)
-
[56]
Thus, we have r2√ n log n =O ( nα√ n log n n ) =O ( nα − 0
Note that for any fixed k, the value of rk for Example 3 is smaller than the value of rk for Example 1 with σ =nα . Thus, we have r2√ n log n =O ( nα√ n log n n ) =O ( nα − 0. 5 logn ) . E.3 Proof of Lemma 8 We first show (i). Note that by Lemma 19, we know that for a fixedi, we have 0∈ [min(Sk,i, max(Sk,i ))] with probability at least 1− 2 exp(−k2/n ). Taki...
-
[57]
Then the output is the shortest gap estimator itself, so |ˆµ k1,k 2|=|ˆµ S,k 2|
ˆµ S,k 2∈ [min(Sk1), max(Sk1)]. Then the output is the shortest gap estimator itself, so |ˆµ k1,k 2|=|ˆµ S,k 2|
-
[58]
We will use the following lemma, a slight generalization of L emma 4
ˆµ S,k 2̸∈ [min(Sk1), max(Sk1)]. We will use the following lemma, a slight generalization of L emma 4. 1 from Chierichetti et al. [8]. The lemma states that for sufficiently large values of k, the k-median contains the true mean, µ ∗ = 0, with high probability. Lemma 19. Let Sk be the output of the k-median algorithm. Then with probability at least 1− 2 exp...
-
[59]
For each f∈F , we have R∗ J 2 ≤ R(f )≤ R∗ J . 2.|F|≤ 2 R∗ J . 3.F coversJ in the sense that ∀f∈J ,∃f1,f 2∈F :f (x)≤ f1(x) +f2(x). It follows that if any interval in J contains at least k points, then at least one interval in F contains at least k 2 points. We construct F of cardinality|F|=⌈ 1 R∗ J ⌉≤ 2 R∗ J , as follows: To create the first interval (i = 1...
-
[60]
and the last inequality follows by the assumption p = Ω ( log n n ) . Combining the inequalities, we conclude that min ˆµ max {Pi}⊆P (s1,s 2,p ) E[∥ˆµ− µ∥2]≥ s ( min ˆµ max µ Pµ Bin(∥ˆµ− µ∥2≥ s)− 2 exp(−c′logn) ) . Thus, it suffices to find s such that the expression minˆµ maxµ Pµ Bin(∥ˆµ− µ∥2≥ s) can be lower- bounded by a constant. For part (i), using stan...
-
[61]
and the formula ( 39), we have A′ z = (p′)2 1− p′ ( σ1√ 2πσ 2 2 )d ∫ exp (−∥x− µ 2∥2 2 σ 2 2 +∥x− µ 2∥2 2 σ 2 1 − ∥x− µ 1∥2 2 2σ 2 1 ) dx = (p′)2 1− p′ σ1 √ 2σ 2 2 √ 1 σ 2 2 − 1 2σ 2 1 d exp 1 4 ( 1 σ 2 2 − 1 2σ 2 1 ) 2µ 2 σ 2 2 − 2µ 2 σ 2 1 + µ 1 σ 2 1 2 2 − µ T 2µ 2 σ 2 2 + µ T 2µ 2 σ 2 1 − µ T 1µ 1 2σ 2 1 = (p′)2 1− p′ ...
-
[62]
On the other hand, we can also argue that the maximizer is unique
is therefore maximized when β = β ∗. On the other hand, we can also argue that the maximizer is unique . Indeed, suppose β∈ Rd were such that β̸= β ∗. The set S := { {xi}n i=1⊆ (Rd)n :xT i (β− ˆβ ) = 0 ∀i } has Lebesgue measure 0. We can write E [ n∑ i=1 E [ 1 { |yi− xT i β|≤ r } |{xi}n i=1 ] ] = ∫ {xi}∈S E [ 1 { |yi− xT i β|≤ r } |{xi}n i=1 ] dP({xi}) + ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.