Delaunay Weighted Two-sample Test for High-dimensional Data by Incorporating Geometric Information
Pith reviewed 2026-05-24 02:13 UTC · model grok-4.3
The pith
A nonparametric test statistic from the Delaunay weight matrix is asymptotically normal under the null of equal distributions and consistent under alternatives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a novel nonparametric test statistic constructed from the Delaunay weight matrix on the learned manifold; this statistic has asymptotic normality under the null hypothesis that the two high-dimensional samples arise from the same distribution and is consistent under the alternative that the distributions differ.
What carries the argument
Delaunay weight matrix obtained from triangulation on the estimated low-dimensional manifold, encoding geometric proximity that includes both distance and direction.
If this is right
- The test gains power relative to distance-only methods when distribution differences have a directional component on the manifold.
- Moderate inaccuracies in manifold recovery do not destroy the asymptotic guarantees or practical performance.
- The procedure yields a usable test for real high-dimensional problems such as protein-expression comparisons.
- Large-sample p-values can be obtained directly from the normal limit without resampling.
Where Pith is reading between the lines
- The directional sensitivity may make the approach useful for problems involving oriented data or shape differences.
- The same weight construction could be inserted into other nonparametric procedures that currently rely on pairwise similarities.
- Performance comparisons with alternative manifold approximations or triangulation methods would clarify the method's sensitivity to implementation choices.
Load-bearing premise
The observed high-dimensional points lie on a low-dimensional manifold whose structure can be recovered accurately enough from the sample to produce reliable Delaunay weights.
What would settle it
A dataset generated from a distribution without low-dimensional manifold support in which the Delaunay-weighted statistic fails to converge in distribution to normality under the null of equal samples.
Figures
read the original abstract
Two-sample hypothesis testing is a fundamental problem with various applications, which faces new challenges in the high-dimensional context. To mitigate the issue of the curse of dimensionality, high-dimensional data are typically assumed to lie on a low-dimensional manifold. To incorporate geometric information in the data, we propose to apply the Delaunay triangulation and develop the Delaunay weight to measure the geometric proximity among data points. In contrast to existing similarity measures that only utilize pairwise distances, the Delaunay weight can take both the distance and direction information into account. A detailed computation procedure is developed to learn the unknown manifold and approximate the Delaunay weight. We further propose a novel nonparametric test statistic using the Delaunay weight matrix. Asymptotic normality under the null and consistency under the alternative of the test statistic are developed. Applied on simulated data, the new test shows robustness to the learning of the unknown manifold and exhibits substantial power gain if the distributions differ directions. The proposed test also shows great power on a real dataset of mice protein expression levels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a nonparametric two-sample test for high-dimensional data assumed to lie on a low-dimensional manifold. It defines Delaunay weights via triangulation to incorporate both distance and directional geometric information (beyond pairwise distances), develops a manifold-learning procedure to estimate these weights from data, constructs a test statistic T_n from the estimated weight matrix, and claims to establish asymptotic normality of T_n/σ_n under the null and consistency under the alternative. Simulations indicate robustness to manifold estimation and power gains when alternatives differ in direction; an application to mice protein expression data is included.
Significance. If the asymptotic claims hold with estimated weights, the method would offer a geometrically aware nonparametric test that can detect directional differences in high-dimensional settings where standard distance-based approaches lose power, addressing a practical gap in manifold-aware inference.
major comments (2)
- [Asymptotics / Theorem on normality] Asymptotic normality and consistency section (theorems establishing T_n/σ_n → N(0,1) under H0 and divergence under H1): the central argument requires that the estimated Delaunay weight matrix Ŵ satisfies ||Ŵ − W|| = o_p(1/√n) (or an analogous rate) so that the perturbation to the quadratic form or U-statistic remains negligible. No quantitative error bounds, convergence rates for the manifold-learning step, or perturbation lemma are supplied to verify this condition holds under the stated high-dimensional regime.
- [Manifold learning procedure] Manifold learning and weight approximation procedure (the section detailing the computation of Ŵ): the procedure is described but supplies neither explicit rates on the geometric approximation error nor verification that the resulting error is small enough relative to 1/√n; in high dimensions this rate is typically slower, rendering the CLT claim load-bearing and unverified.
minor comments (2)
- [Abstract] The abstract states that 'asymptotic normality ... are developed' but does not indicate whether the proofs are fully rigorous or rely on additional unstated assumptions; a brief pointer to the key technical conditions would improve clarity.
- [Methods / Weight definition] Notation for the Delaunay weight matrix W and its estimator Ŵ should be introduced with an explicit definition (e.g., via an equation) before the test statistic is defined, to avoid ambiguity when reading the asymptotic statements.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on the asymptotic claims and manifold learning procedure. We respond point by point below.
read point-by-point responses
-
Referee: [Asymptotics / Theorem on normality] Asymptotic normality and consistency section (theorems establishing T_n/σ_n → N(0,1) under H0 and divergence under H1): the central argument requires that the estimated Delaunay weight matrix Ŵ satisfies ||Ŵ − W|| = o_p(1/√n) (or an analogous rate) so that the perturbation to the quadratic form or U-statistic remains negligible. No quantitative error bounds, convergence rates for the manifold-learning step, or perturbation lemma are supplied to verify this condition holds under the stated high-dimensional regime.
Authors: We agree that the asymptotic normality result with estimated weights requires ||Ŵ − W|| = o_p(1/√n) to ensure the perturbation term is negligible. The theorems in the manuscript are stated under the assumption that the manifold-learning step achieves a rate sufficient for this condition. We did not supply explicit quantitative bounds or a dedicated perturbation lemma. In the revision we will add a lemma that bounds the effect of the weight estimation error on the test statistic and discuss the required rates under the low-dimensional manifold assumption. revision: yes
-
Referee: [Manifold learning procedure] Manifold learning and weight approximation procedure (the section detailing the computation of Ŵ): the procedure is described but supplies neither explicit rates on the geometric approximation error nor verification that the resulting error is small enough relative to 1/√n; in high dimensions this rate is typically slower, rendering the CLT claim load-bearing and unverified.
Authors: The referee is correct that the manifold-learning section describes the algorithmic steps without explicit error rates or verification against the 1/√n threshold. We will revise the section to cite standard convergence results for manifold estimation (under fixed or slowly growing intrinsic dimension) and show that these rates meet the o_p(1/√n) requirement when the manifold dimension is controlled. Additional assumptions needed for the high-dimensional regime will be stated explicitly. revision: yes
Circularity Check
No significant circularity detected; derivation remains self-contained
full rationale
The paper defines a Delaunay-weighted test statistic from a manifold-learning procedure to approximate weights, then states asymptotic normality under the null and consistency under the alternative. No step reduces a claimed prediction or first-principles result to its own inputs by construction, nor renames a fitted quantity as a prediction. No load-bearing self-citation chain or uniqueness theorem imported from the authors' prior work appears in the provided text. The manifold approximation is treated as an independent computational step whose error is assumed controlled, without the statistic itself being defined in terms of its own asymptotic properties. This is the typical non-circular case where the central claims rest on separate statistical arguments rather than tautological re-expression of fitted inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High-dimensional data lie on a low-dimensional manifold
invented entities (1)
-
Delaunay weight
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose to apply the Delaunay triangulation and develop the Delaunay weight to measure the geometric proximity among data points... asymptotic normality under the null and consistency under the alternative
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
high-dimensional data are typically assumed to lie on a low-dimensional manifold... Delaunay triangulation on a Riemannian manifold
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Abadie, A. and L'Hour, J. (2021). A Penalized Synthetic Control Estimator for Disaggregated Data . Journal of the American Statistical Association , 116(536):1817--1834
work page 2021
-
[2]
Arias-Castro, E., Pelletier, B., and Saligrama, V. (2018). Remember the Curse of Dimensionality: The Case of Goodness-of-fit Testing in Arbitrary Dimension . Journal of Nonparametric Statistics , 30(2):448--471
work page 2018
-
[3]
Bhattacharya, R. and Patrangenaru, V. (2003). Large sample theory of intrinsic and extrinsic sample means on manifolds. The Annals of Statistics , 31(1):1--29
work page 2003
-
[4]
Bickel, P. J. (1969). A Distribution Free Version of the Smirnov Two Sample Test in the p -Variate Case . The Annals of Mathematical Statistics , 40(1):1--23
work page 1969
-
[5]
Cai, T. T., Liu, W., and Xia, Y. (2013a). Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings . Journal of the American Statistical Association , 108(501):265--277
-
[6]
Cai, T. T., Liu, W., and Xia, Y. (2013b). Two-sample Test of High Dimensional Means under Dependence . Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 76(2):349--372
-
[7]
Cao, Y., Nemirovski, A., Xie, Y., Guigues, V., and Juditsky, A. (2018). Change Detection via Affine and Quadratic Detectors . Electronic Journal of Statistics , 12(1):1--57
work page 2018
-
[8]
Chang, J., Zheng, C., Zhou, W.-X., and Zhou, W. (2017). Simulation-based Hypothesis Testing of High Dimensional Means under Covariance Heterogeneity . Biometrics , 73(4):1300--1310
work page 2017
-
[9]
Chang, T. H., Watson, L. T., Lux, T. C. H., Butt, A. R., Cameron, K. W., and Hong, Y. (2020). Algorithm 1012: DELAUNAYSPARSE : I nterpolation via a Sparse Subset of the D elaunay Triangulation in Medium to High Dimensions . ACM Transactions on Mathematical Software , 46(4):1--20
work page 2020
-
[10]
Chen, H. (2019). Sequential Change-point Detection based on Nearest Neighbors . The Annals of Statistics , 47(3):1381--1407
work page 2019
-
[11]
Chen, H., Chen, X., and Su, Y. (2018). A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data . Journal of the American Statistical Association , 113(523):1146--1155
work page 2018
-
[12]
Chen, H. and Friedman, J. H. (2017). A New Graph-Based Two-Sample Test for Multivariate and Object Data . Journal of the American Statistical Association , 112(517):397--409
work page 2017
-
[13]
Cheng, X. and Xie, Y. (2021). Kernel Two-Sample Tests for Manifold Data . arXiv:2105.03425
-
[14]
Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A Kernel Test of Goodness of Fit . In Proceedings of The 33rd International Conference on Machine Learning , pages 2606--2615, New York, New York, USA. PMLR
work page 2016
-
[15]
Dimeglio, C., Gall \' o n, S., Loubes, J.-M., and Maza, E. (2014). A Robust Algorithm for Template Curve Estimation based on Manifold Embedding . Computational Statistics & Data Analysis , 70:373--386
work page 2014
-
[16]
Facco, E., d'Errico, M., Rodriguez, A., and Laio, A. (2017). Estimating the Intrinsic Dimension of Datasets by a Minimal Neighborhood Information . Scientific Reports , 7:12140
work page 2017
-
[17]
Friedman, J. H. and Rafsky, L. C. (1979). Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests . The Annals of Statistics , 7(4):697--717
work page 1979
-
[18]
Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch \"o lkopf, B., and Smola, A. (2012). A Kernel Two-Sample Test . Journal of Machine Learning Research , 13(25):723--773
work page 2012
-
[19]
Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. (2009). A Fast, Consistent Kernel Two-Sample Test . In Proceedings of the 22nd International Conference on Neural Information Processing Systems , volume 22, page 673–681, Red Hook, New York, USA. Curran Associates, Inc
work page 2009
-
[20]
Hall, P. and Tajvidi, N. (2002). Permutation Tests for Equality of Distributions in High-dimensional Settings . Biometrika , 89(2):359--374
work page 2002
-
[21]
Hediger, S., Michel, L., and Näf, J. (2022). On the Use of Random Forest for Two-sample Testing . Computational Statistics & Data Analysis , 170:107435
work page 2022
-
[22]
Henze, N. (1988). A Multivariate Two-Sample Test Based on the Number of Nearest Neighbor Type Coincidences . The Annals of Statistics , 16(2):772--783
work page 1988
-
[23]
Henze, N. and Penrose, M. D. (1999). On the Multivariate Runs Test . The Annals of Statistics , 27(1):290--298
work page 1999
-
[24]
Jure c kov \' a , J. and Kalina, J. (2012). Nonparametric Multivariate Rank Tests and Their Unbiasedness . Bernoulli , 18(1):229--251
work page 2012
-
[25]
Kim, I., Lee, A. B., and Lei, J. (2019). Global and Local Two-sample Tests via Regression . Electronic Journal of Statistics , 13(2):5253--5305
work page 2019
-
[26]
Kolmogorov, A. N. (1933). Sulla Determinazione Empirica di Una Legge di Distribuzione . Giornale dell’Istituto Italiano degli Attuari , 4:83--91
work page 1933
-
[27]
Kruskal, J. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika , 29:1--27
work page 1964
-
[28]
Leibon, G. and Letscher, D. (2000). Delaunay Triangulations and Voronoi Diagrams for Riemannian Manifolds . In Proceedings of the sixteenth annual symposium on Computational geometry , pages 341--349, New York, New York, USA. Association for Computing Machinery
work page 2000
-
[29]
Liu, W., Yu, X., Zhong, W., and Li, R. (2022). Projection Test for Mean Vector in High Dimensions . Journal of the American Statistical Association
work page 2022
-
[30]
Mann, H. B. and Whitney, D. R. (1947). On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other . The Annals of Mathematical Statistics , 18(1):50--60
work page 1947
-
[31]
Marozzi, M. (2015). Multivariate Multidistance Tests for High-dimensional Low Sample Size Case-control Studies . Statistics in Medicine , 34(9):1511--1526
work page 2015
-
[32]
Moon, K. R., Stanley III, J. S., Burkhardt, D., van Dijk, D., Wolf, G., and Krishnaswamy, S. (2018). Manifold Learning-based Methods for Analyzing Single-cell RNA-sequencing Data . Current Opinion in Systems Biology , 7:36--46
work page 2018
-
[33]
Morgan, K. L. and Rubin, D. B. (2012). Rerandomization to Improve Covariate Balance in Experiments . The Annals of Statistics , 40(2):1263--1282
work page 2012
-
[34]
Pless, R. and Souvenir, R. (2009). A Survey of Manifold Learning for Images . IPSJ Transactions on Computer Vision and Applications , 1:83--94
work page 2009
-
[35]
Rosenbaum, P. R. (2005). An Exact Distribution-free Test Comparing Wwo Multivariate Distributions based on Adjacency . Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 67(4):515--530
work page 2005
-
[36]
Schilling, M. F. (1986). Multivariate Two-sample Tests based on Nearest Neighbors . Journal of the American Statistical Association , 81(395):799--806
work page 1986
-
[37]
Sibson, R. (1978). Locally Equiangular Triangulations . The Computer Journal , 21(3):243--245
work page 1978
-
[38]
Smirnov, N. (1948). Table for Estimating the Goodness of Fit of Empirical Distributions . The Annals of Mathematical Statistics , 19(2):279--281
work page 1948
-
[39]
Sz \'e kely, G. J. and Rizzo, M. L. (2004). Testing for Equal Distributions in High Dimension . InterStat , November(5)
work page 2004
-
[40]
Wald, A. and Wolfowitz, J. (1940). On a Test Whether Two Samples are from the Same Population . The Annals of Mathematical Statistics , 11(2):147--162
work page 1940
-
[41]
Wang, W., Lin, N., and Tang, X. (2019). Robust Two-sample Test of High-dimensional Mean Vectors under Dependence . Journal of Multivariate Analysis , 169:312--329
work page 2019
-
[42]
Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods . Biometrics Bulletin , 1(6):80--83
work page 1945
-
[43]
Yan, J. and Zhang, X. (2023). Kernel two-sample tests in high dimensions: interplay between moment discrepancy and dimension-and-sample orders. Biometrika , 110(2):411--430
work page 2023
-
[44]
Zhang, Z., Song, Y., and Qi, H. (2017). Age Progression/Regression by Conditional Adversarial Autoencoder . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4352--4360. IEEE
work page 2017
-
[45]
Zhao, D., Wang, J., Lin, H., Chu, Y., Wang, Y., Zhang, Y., and Yang, Z. (2021). Sentence Representation with Manifold Learning for Biomedical Texts . Knowledge-Based Systems , 218:106869
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.