Estimating the Wasserstein barycenter of one-dimensional distributions under sparse sampling
Pith reviewed 2026-06-27 15:15 UTC · model grok-4.3
The pith
The marginal-constructed barycenter estimator recovers the population mean of unit-level quantiles from sparse samples by estimating their distribution via marginal CDF mixtures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The distribution of latent unit-level quantiles at a fixed level equals a functional of the marginal distributions of the unit-level CDF values at that level; these marginals are estimable by binomial mixture methods, so their functional yields a consistent, asymptotically normal estimator for the Wasserstein barycenter quantile that is less biased than the empirical version when per-unit samples are small.
What carries the argument
The marginal-constructed barycenter (MCB) estimator, formed by taking the mean of the distribution of latent unit-level quantiles recovered from binomial mixture estimates of the marginal unit-level CDF distributions.
If this is right
- The MCB estimator is pointwise consistent for the population mean of the unit-level quantiles.
- It is asymptotically normal under the conditions established in the paper.
- It substantially reduces bias relative to the empirical Wasserstein barycenter when sampling per unit is sparse.
- The estimator can summarize within-participant distributions of viral sequence features from the HVTN 502/503 trials when only a few sequences are available per participant.
Where Pith is reading between the lines
- If the marginal representation extends, the same binomial-mixture step could be used to estimate other features of the distribution of quantiles, such as its spread.
- The approach might apply to estimating other population functionals that involve averages of inverse CDFs under sparse per-unit observation.
- Analogous marginal constructions could be sought for Wasserstein barycenters on non-real-line spaces.
Load-bearing premise
The distribution of latent unit-level quantiles at a given quantile level can be written in terms of the marginal distributions of the unit-level CDF values.
What would settle it
A dataset or simulation in which the per-unit sample size remains fixed while the number of units grows and the MCB estimator fails to converge to the true barycenter quantile despite accurate recovery of the marginal CDF distributions.
Figures
read the original abstract
We study distributional data under sparse sampling where each unit is represented by a probability distribution on the real line observed only through a small i.i.d.~sample. A natural notion of central tendency for one-dimensional distributional data is the Wasserstein barycenter, whose quantile function is the pointwise average of the unit-level quantile functions. We focus on pointwise estimation of the Wasserstein barycenter quantile function: at a given quantile level, the target is the population mean of the corresponding unit-level quantiles. A naive plug-in estimator is the empirical Wasserstein barycenter, which treats observed unit-level empirical distributions as the true latent unit-level distributions. Under sparse sampling, however, this estimator can be severely biased. We propose an approach that avoids directly estimating either the unit-level distributions or the full population law of distributions. We start with the more ambitious goal of characterizing the distribution of latent unit-level quantiles at a given quantile level. We show that this distribution can be written in terms of the marginal distributions of the unit-level CDF values, which can be estimated using binomial mixture methods. This motivates our estimator, the marginal-constructed barycenter (MCB) estimator, obtained by taking the mean of the estimated distribution of latent unit-level quantiles. We establish conditions under which the MCB estimator is pointwise consistent and asymptotically normal, and show through simulations that it can substantially outperform the empirical Wasserstein barycenter under sparse sampling. We illustrate the method in an analysis of HIV-1 sequence data from the HVTN 502/503 vaccine efficacy trials, using the barycenter to summarize and compare within-participant distributions of viral sequence features when only a small number of sequences are available per participant.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the marginal-constructed barycenter (MCB) estimator for the quantile function of the Wasserstein barycenter of one-dimensional distributions observed under sparse sampling. It exploits the identity P(Q_i(p) ≤ x) = P(F_i(x) ≥ p) to express the distribution of latent unit-level quantiles in terms of marginal distributions of unit-level CDF values, which are estimated via binomial mixture methods; the MCB is then the mean of this estimated distribution. The authors state that they establish conditions for pointwise consistency and asymptotic normality of the MCB, demonstrate via simulations that it substantially outperforms the empirical Wasserstein barycenter under sparse sampling, and illustrate the method on HIV-1 sequence data from the HVTN 502/503 trials.
Significance. If the stated consistency and normality results hold under the supplied conditions, the MCB provides a bias-reduced estimator for distributional central tendency that avoids direct estimation of full unit-level distributions or the population law of distributions. This is a practical contribution for settings with limited per-unit samples. The approach receives credit for correctly leveraging a definitional identity without hidden dependence assumptions and for including both simulation comparisons and a real-data application.
major comments (2)
- [asymptotic results section] The section establishing asymptotic normality: the explicit regularity conditions on the binomial mixture estimators (including rates for estimating the marginal CDF distributions and any requirements on the number of samples per unit or support assumptions) must be stated with sufficient precision to verify the claimed asymptotic normality; without these, the support for the central theoretical claim remains incomplete.
- [consistency theorem] The consistency theorem: the paper asserts pointwise consistency under certain conditions, but the load-bearing step of propagating the mixture estimation error to the mean of the induced quantile distribution requires an explicit error bound or convergence argument that is not verifiable from the high-level description.
minor comments (2)
- [simulations] The simulation section should report the exact distributions, sample sizes per unit, and number of Monte Carlo replications to allow direct reproduction of the reported outperformance.
- [methodology] Notation for the binomial mixture parameters should be standardized and cross-referenced to the estimation procedure to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and agree that the theoretical sections require greater precision in the stated conditions and arguments.
read point-by-point responses
-
Referee: [asymptotic results section] The section establishing asymptotic normality: the explicit regularity conditions on the binomial mixture estimators (including rates for estimating the marginal CDF distributions and any requirements on the number of samples per unit or support assumptions) must be stated with sufficient precision to verify the claimed asymptotic normality; without these, the support for the central theoretical claim remains incomplete.
Authors: We agree that the regularity conditions on the binomial mixture estimators must be stated with greater precision, including explicit rates for marginal CDF estimation, requirements on per-unit sample sizes, and support assumptions, to rigorously support the asymptotic normality claim. In the revised manuscript we will expand the asymptotic results section with a dedicated subsection listing these conditions and the associated convergence rates that close the argument. revision: yes
-
Referee: [consistency theorem] The consistency theorem: the paper asserts pointwise consistency under certain conditions, but the load-bearing step of propagating the mixture estimation error to the mean of the induced quantile distribution requires an explicit error bound or convergence argument that is not verifiable from the high-level description.
Authors: We acknowledge that the consistency proof requires an explicit error bound showing how the mixture estimation error propagates to the mean of the induced quantile distribution. In the revision we will strengthen the consistency theorem by inserting a detailed convergence argument that derives the required bound from the uniform consistency of the binomial mixture estimators. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's central construction uses the identity P(Q_i(p) ≤ x) = P(F_i(x) ≥ p), which is a direct consequence of the definitions of the quantile and CDF functions and holds without reference to the target estimator or any fitted quantities. Marginal distributions of the F_i(x) are recovered via binomial mixture methods applied to independent observations, after which the mean of the induced distribution of Q_i(p) is taken to form the MCB estimator. Consistency and asymptotic normality are stated as separate results under explicit conditions. No step reduces a claimed prediction to a fitted input by construction, no self-citation is load-bearing for the core identity or estimator, and the derivation remains self-contained against external statistical benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- mixture parameters in binomial model
axioms (1)
- domain assumption The distribution of latent unit-level quantiles can be written in terms of the marginal distributions of the unit-level CDF values
Reference graph
Works this paper leans on
-
[1]
Iao, Su I and Zhou, Yidong and Müller, Hans-Georg , title =. Journal of the American Statistical Association , year =. doi:10.1080/01621459.2025.2507982 , url =
-
[2]
Molenberghs, Geert and Verbeke, Geert , title =
-
[3]
Parametric convergence rate of some nonparametric estimators in mixtures of power series distributions , year =. 2508.00163 , archivePrefix =
-
[4]
and Wolfowitz, J
Kiefer, J. and Wolfowitz, J. , title =. The Annals of Mathematical Statistics , year =
-
[5]
The Annals of Statistics , year =
Geman, Stuart and Hwang, Chii-Ruey , title =. The Annals of Statistics , year =
-
[6]
The Annals of Statistics , author =
Chen, Jiahua , title =. The Annals of Statistics , year =. doi:10.1214/aos/1176324464 , url =
-
[7]
, title =
Lindsay, Bruce G. , title =. NSF-CBMS Regional Conference Series in Probability and Statistics , year =
-
[8]
The Annals of Statistics , year =
Shen, Xiaotong , title =. The Annals of Statistics , year =. doi:10.1214/aos/1030741085 , url =
-
[9]
Wood, G. R. , title =. The Annals of Statistics , year =. doi:10.1214/aos/1017939148 , url =
-
[10]
and Wasserman, Larry , title =
Genovese, Christopher R. and Wasserman, Larry , title =. The Annals of Statistics , year =
-
[11]
Scandinavian Journal of Statistics , year =
Böhning, Dankmar and Patilea, Valentin , title =. Scandinavian Journal of Statistics , year =
-
[12]
The Annals of Statistics , year =
Hall, Peter and Meister, Alexander , title =. The Annals of Statistics , year =. doi:10.1214/009053607000000028 , url =
-
[13]
Srivastava, Anuj and Klassen, Eric P. , title =. 2016 , address =. doi:10.1007/978-1-4939-4020-2 , url =
-
[14]
Butucea, C. and Comte, F. , title =. Bernoulli , year =. doi:10.3150/08-BEJ146 , url =
-
[15]
The Annals of Statistics , year =
Overgaard, Morten and Parner, Erik Thorlund and Pedersen, Jan , title =. The Annals of Statistics , year =. doi:10.1214/16-AOS1516 , url =
-
[16]
Panaretos, Victor M. and Zemel, Yoav , title =. 2020 , address =. doi:10.1007/978-3-030-38438-8 , url =
-
[17]
Electronic Journal of Statistics , year =
Westling, Ted and van der Laan, Mark J and Carone, Marco , title =. Electronic Journal of Statistics , year =
-
[18]
ESAIM: Proceedings and Surveys , year =
Bigot, Jérémie , title =. ESAIM: Proceedings and Surveys , year =. doi:10.1051/proc/202068001 , url =
-
[19]
QIU, HONGXIANG and LUEDTKE, ALEX and CARONE, MARCO , title =. Bernoulli , year =. doi:10.3150/20-BEJ1309 , url =
-
[20]
Journal of the Royal Statistical Society Series B: Statistical Methodology , year =
Lin, Zhenhua and Kong, Dehan and Wang, Linbo , title =. Journal of the Royal Statistical Society Series B: Statistical Methodology , year =. doi:10.1093/jrsssb/qkad008 , url =
-
[21]
Advances in Neural Information Processing Systems , year =
Hein, Matthias , title =. Advances in Neural Information Processing Systems , year =
-
[22]
Wasserstein regression with empirical measures and density estimation for sparse data , journal =
Zhou, Yidong and Müller, Hans-Georg , title =. Biometrics , year =. doi:10.1093/biomtc/ujae127 , url =
-
[23]
2025 , month = sep, howpublished =
Byambadalai, Undral and Hirata, Tomu and Oka, Tatsushi and Yasui, Shota , title =. 2025 , month = sep, howpublished =. doi:10.48550/arXiv.2509.15594 , url =. 2509.15594 , archivePrefix =
-
[24]
Lee, JungHo and Baćak, Valerio and Kennedy, Edward H. , title =. 2025 , month = jul, howpublished =. doi:10.48550/arXiv.2507.23140 , url =. 2507.23140 , archivePrefix =
-
[25]
SIAM Journal on Mathematical Analysis , year =
Agueh, Martial and Carlier, Guillaume , title =. SIAM Journal on Mathematical Analysis , year =
-
[26]
Electronic Journal of Statistics , year =
Bigot, Jérémie and Gouet, Raúl and Klein, Thierry and López, Alfredo , title =. Electronic Journal of Statistics , year =. doi:10.1214/18-EJS1400 , url =
-
[27]
Lecture Notes-Monograph Series , year =
Stute, Winfried , title =. Lecture Notes-Monograph Series , year =
-
[28]
Schmon, S M and Deligiannidis, G and Doucet, A and Pitt, M K , title =. Biometrika , year =. doi:10.1093/biomet/asaa044 , url =
-
[29]
Statistics in medicine , volume=
Tests for informative cluster size using a novel balanced bootstrap scheme , author=. Statistics in medicine , volume=. 2017 , publisher=
2017
-
[30]
Crauel, Hans , title =. 2002 , edition =. doi:10.1201/b12601 , url =
-
[31]
Annales de l'Institut Henri Poincaré, Probabilités et Statistiques , year =
Bigot, Jérémie and Gouet, Raúl and Klein, Thierry and López, Alfredo , title =. Annales de l'Institut Henri Poincaré, Probabilités et Statistiques , year =. doi:10.1214/15-AIHP706 , url =
-
[32]
Kosorok, Michael R. and Ma, Shuangge , title =. The Annals of Statistics , year =. doi:10.1214/009053606000001433 , url =
-
[33]
Maximum Likelihood Estimation for Learning Populations of Parameters
Vinayak, Ramya Korlakai and Kong, Weihao and Valiant, Gregory and Kakade, Sham M. , title =. 2019 , month = feb, howpublished =. doi:10.48550/arXiv.1902.04553 , url =. 1902.04553 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.04553 2019
-
[34]
Optimal transport: Old and new , publisher =
Villani, Cédric , editor =. Optimal transport: Old and new , publisher =. 2009 , address =
2009
-
[35]
, title =
Karr, Alan F. , title =. Mathematics of Operations Research , year =
-
[36]
Mathematics of Operations Research , year =
Winkler, Gerhard , title =. Mathematics of Operations Research , year =
-
[37]
and Cressie, Noel , title =
Lord, Frederic M. and Cressie, Noel , title =. Sankhyā: The Indian Journal of Statistics, Series B (1960-2002) , year =
1960
-
[38]
2025 , month = jan, howpublished =
Zhou, Yidong and Kurisu, Daisuke and Otsu, Taisuke and Müller, Hans-Georg , title =. 2025 , month = jan, howpublished =. doi:10.48550/arXiv.2501.17436 , url =. 2501.17436 , archivePrefix =
-
[39]
Ye, Yuting and Bickel, Peter J. , title =. 2021 , month = aug, howpublished =. doi:10.48550/arXiv.2107.13756 , url =. 2107.13756 , archivePrefix =
-
[40]
2012 , address =
Lasserre, Jean-Bernard , title =. 2012 , address =
2012
-
[41]
Kosorok, Michael R , title =
-
[42]
Goberna, Miguel A and López-Cerdá, MA , title =
-
[43]
European Journal of Operational Research , year =
López, Marco and Still, Georg , title =. European Journal of Operational Research , year =
-
[44]
The Annals of Statistics , year =
Ferguson, Thomas S , title =. The Annals of Statistics , year =
-
[45]
The Annals of Statistics , year =
Petersen, Alexander and Müller, Hans-Georg , title =. The Annals of Statistics , year =
-
[46]
Journal of Statistical Software , year =
Koenker, Roger and Gu, Jiaying , title =. Journal of Statistical Software , year =
-
[47]
Journal of Statistical Software , year =
Narasimhan, Balasubramanian and Efron, Bradley , title =. Journal of Statistical Software , year =
-
[48]
Journal of Hydrology , year =
Sagrillo, Murilo and Guerra, Renata Rojas and Bayer, Fábio M , title =. Journal of Hydrology , year =
-
[49]
Journal of Mathematical Imaging and Vision , year =
Bonneel, Nicolas and Rabin, Julien and Peyré, Gabriel and Pfister, Hanspeter , title =. Journal of Mathematical Imaging and Vision , year =
-
[50]
Current Opinion in HIV and AIDS , year =
Gray, GE and Buchbinder, S and Duerr, Ann , title =. Current Opinion in HIV and AIDS , year =
-
[51]
The Lancet , year =
Buchbinder, S and Mehrotra, Devan V and Duerr, Ann and Fitzgerald, Daniel W and Mogg, Robin and Li, David and Gilbert, Peter B and Lama, Javier R and Marmor, Michael and Del Rio, Carlos and others , title =. The Lancet , year =
-
[52]
The Lancet Infectious Diseases , year =
Gray, GE and Allen, Mary and Moodie, Zoe and Churchyard, Gavin and Bekker, Linda-Gail and Nchabeleng, Maphoshane and Mlisana, Koleka and Metch, Barbara and de Bruyn, Guy and Latka, Mary H and others , title =. The Lancet Infectious Diseases , year =
-
[53]
Nature Medicine , year =
Rolland, M and Tovanabutra, Sodsai and Decamp, Allan C and Frahm, Nicole and Gilbert, Peter B and Sanders-Buell, Eric and Heath, Laura and Magaret, Craig A and Bose, Meera and Bradfield, Andrea and others , title =. Nature Medicine , year =
-
[54]
Vaccine , year =
Hertz, T and Logan, MG and Rolland, M and Magaret, CA and Rademeyer, C and Fiore-Gartland, A and Edlefsen, PT and DeCamp, A and Ahmed, H and Ngandu, N and others , title =. Vaccine , year =
-
[55]
Convergence , year =
Bates, Douglas and Maechler, Martin and Bolker, Ben and Walker, Steven and Christensen, Rune Haubo Bojesen and Singmann, Henrik and Dai, Bin and Grothendieck, Gabor and Green, Peter and Bolker, Maintainer Ben , title =. Convergence , year =
-
[56]
Parasitology , year =
Elston, DA and Moss, Robert and Boulinier, Thierry and Arrowsmith, Claire and Lambin, Xavier , title =. Parasitology , year =
-
[57]
Health Physics , year =
Price, Phillip N and Nero, Anthony V and Gelman, Andrew , title =. Health Physics , year =
-
[58]
Gelman, Andrew , title =
-
[59]
Proceedings of the National Academy of Sciences , year =
Juraska, Michal and Bai, Hongjun and deCamp, Allan C and Magaret, Craig A and Li, Li and Gillespie, Kevin and Carpp, Lindsay N and Giorgi, Elena E and Ludwig, James and Molitor, Cindy and others , title =. Proceedings of the National Academy of Sciences , year =
-
[60]
Edlefsen, Paul T. and Rolland, Morgane and Hertz, Tomer and Tovanabutra, Sodsai and Gartland, Andrew J and deCamp, Allan C and Magaret, Craig A and Ahmed, Hasan and Gottardo, Raphael and Juraska, Michal and others , title =. PLOS Computational Biology , year =
-
[61]
DeCamp, Allan C. and Rolland, Morgane and Edlefsen, Paul T and Sanders-Buell, Eric and Hall, Breana and Magaret, Craig A and Fiore-Gartland, Andrew J and Juraska, Michal and Carpp, Lindsay N and Karuna, Shelly T and others , title =. PLOS One , year =
-
[62]
2002 , volume =
Raudenbush, Stephen W and Bryk, Anthony S , title =. 2002 , volume =
2002
-
[63]
2019 , volume =
Bobkov, Sergey and Ledoux, Michel , title =. 2019 , volume =
2019
-
[64]
Aliprantis and Kim C
Charalambos D. Aliprantis and Kim C. Border , title =. 2006 , address =
2006
-
[65]
2008 , volume =
Villani, Cédric and others , title =. 2008 , volume =
2008
-
[66]
2000 , volume =
Asymptotic statistics , publisher =. 2000 , volume =
2000
-
[67]
The Annals of Probability , year =
Massart, Pascal , title =. The Annals of Probability , year =
-
[68]
and Mattocks, Calum and Faraway, Julian J
Augustin, Nicole H. and Mattocks, Calum and Faraway, Julian J. and Greven, Sonja and Ness, Andy R. , title =. Statistical Methods in Medical Research , year =
-
[69]
and Volfson, Dmitri and Hillel, Inbar and Urbanek, Jacek and Hausdorff, Jeffrey M
Ghosal, Rahul and Varma, Vijay R. and Volfson, Dmitri and Hillel, Inbar and Urbanek, Jacek and Hausdorff, Jeffrey M. and Watts, Amber and Zipunnikov, Vadim , title =. Biostatistics (Oxford, England) , year =
-
[70]
and Quer, Josep and Domingo, Esteban , title =
Gregori, Josep and Perales, Celia and Rodriguez-Frias, Francisco and Esteban, Juan I. and Quer, Josep and Domingo, Esteban , title =. Virology , year =
-
[71]
and Deng, Wenjie and Giorgi, Elena E
Mullins, James I. and Deng, Wenjie and Giorgi, Elena E. and Magaret, Craig A. and Rolland, Morgane and Bhattacharya, Tanmoy and Westfall, Dylan H. and Yssel, Anna E.J. and Bumgarner, Roger E. and Murrell, Ben and Ndung’u, Thumbi and Robb, Merlin L. and Rossenkhan, Raabya and Edlefsen, Paul T. and Dong, Krista L. and Chen, Lennie and Gwashu-Nyangiri, Asand...
-
[72]
Learning Populations of Parameters
Tian, Kevin and Kong, Weihao and Valiant, Gregory , title =. 2017 , month = nov, howpublished =. doi:10.48550/arXiv.1709.02707 , url =. 1709.02707 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1709.02707 2017
-
[73]
Efron, Bradley , title =. Biometrika , year =. doi:10.1093/biomet/asv068 , url =
-
[74]
and Datta, Somnath and Satten, Glen A
Williamson, John M. and Datta, Somnath and Satten, Glen A. , title =. Biometrics , year =. doi:10.1111/1541-0420.00005 , url =
-
[75]
Hoffman, E. B. and Sen, P. K. and Weinberg, C. R. , title =. Biometrika , year =. doi:10.1093/biomet/88.4.1121 , url =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.