pith. sign in

arxiv: 2404.03198 · v2 · submitted 2024-04-04 · 📊 stat.ME

Delaunay Weighted Two-sample Test for High-dimensional Data by Incorporating Geometric Information

Pith reviewed 2026-05-24 02:13 UTC · model grok-4.3

classification 📊 stat.ME
keywords two-sample testDelaunay triangulationhigh-dimensional datamanifold learningnonparametric statisticsgeometric proximityasymptotic normalityconsistency
0
0 comments X

The pith

A nonparametric test statistic from the Delaunay weight matrix is asymptotically normal under the null of equal distributions and consistent under alternatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-dimensional data are assumed to lie on a low-dimensional manifold, and the paper uses that structure to build a two-sample test. Instead of relying only on pairwise distances, it defines weights via Delaunay triangulation that capture both distance and directional information among points. A computational procedure estimates the manifold from the samples and approximates these weights. The resulting test statistic is proved to follow a normal distribution when the two groups come from the same distribution and to detect departures reliably otherwise. Simulations indicate robustness to manifold estimation error and extra power when differences appear in directions rather than magnitudes alone.

Core claim

The paper establishes a novel nonparametric test statistic constructed from the Delaunay weight matrix on the learned manifold; this statistic has asymptotic normality under the null hypothesis that the two high-dimensional samples arise from the same distribution and is consistent under the alternative that the distributions differ.

What carries the argument

Delaunay weight matrix obtained from triangulation on the estimated low-dimensional manifold, encoding geometric proximity that includes both distance and direction.

If this is right

  • The test gains power relative to distance-only methods when distribution differences have a directional component on the manifold.
  • Moderate inaccuracies in manifold recovery do not destroy the asymptotic guarantees or practical performance.
  • The procedure yields a usable test for real high-dimensional problems such as protein-expression comparisons.
  • Large-sample p-values can be obtained directly from the normal limit without resampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The directional sensitivity may make the approach useful for problems involving oriented data or shape differences.
  • The same weight construction could be inserted into other nonparametric procedures that currently rely on pairwise similarities.
  • Performance comparisons with alternative manifold approximations or triangulation methods would clarify the method's sensitivity to implementation choices.

Load-bearing premise

The observed high-dimensional points lie on a low-dimensional manifold whose structure can be recovered accurately enough from the sample to produce reliable Delaunay weights.

What would settle it

A dataset generated from a distribution without low-dimensional manifold support in which the Delaunay-weighted statistic fails to converge in distribution to normality under the null of equal samples.

Figures

Figures reproduced from arXiv: 2404.03198 by Guosheng Yin, Jiaqi Gu, Ruoxu Tan.

Figure 1
Figure 1. Figure 1: (a) Graphical illustration of the empty ball property of the Delaunay triangulation [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Graphical illustration of the Delaunay simplices (triangles) [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The Delaunay weight matrix ΓZ computed from the Z in [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Graphical illustration of the global advantage of the Delaunay weight matrix in [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Graphical illustration of the stereographic projected [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Rejection proportions of different implementations of the Delaunay weighted test [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Low-dimensional Euclidean representations of protein expression level of 1077 [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The rejection proportion of our Delaunay weighted test and other approaches over [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗
read the original abstract

Two-sample hypothesis testing is a fundamental problem with various applications, which faces new challenges in the high-dimensional context. To mitigate the issue of the curse of dimensionality, high-dimensional data are typically assumed to lie on a low-dimensional manifold. To incorporate geometric information in the data, we propose to apply the Delaunay triangulation and develop the Delaunay weight to measure the geometric proximity among data points. In contrast to existing similarity measures that only utilize pairwise distances, the Delaunay weight can take both the distance and direction information into account. A detailed computation procedure is developed to learn the unknown manifold and approximate the Delaunay weight. We further propose a novel nonparametric test statistic using the Delaunay weight matrix. Asymptotic normality under the null and consistency under the alternative of the test statistic are developed. Applied on simulated data, the new test shows robustness to the learning of the unknown manifold and exhibits substantial power gain if the distributions differ directions. The proposed test also shows great power on a real dataset of mice protein expression levels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a nonparametric two-sample test for high-dimensional data assumed to lie on a low-dimensional manifold. It defines Delaunay weights via triangulation to incorporate both distance and directional geometric information (beyond pairwise distances), develops a manifold-learning procedure to estimate these weights from data, constructs a test statistic T_n from the estimated weight matrix, and claims to establish asymptotic normality of T_n/σ_n under the null and consistency under the alternative. Simulations indicate robustness to manifold estimation and power gains when alternatives differ in direction; an application to mice protein expression data is included.

Significance. If the asymptotic claims hold with estimated weights, the method would offer a geometrically aware nonparametric test that can detect directional differences in high-dimensional settings where standard distance-based approaches lose power, addressing a practical gap in manifold-aware inference.

major comments (2)
  1. [Asymptotics / Theorem on normality] Asymptotic normality and consistency section (theorems establishing T_n/σ_n → N(0,1) under H0 and divergence under H1): the central argument requires that the estimated Delaunay weight matrix Ŵ satisfies ||Ŵ − W|| = o_p(1/√n) (or an analogous rate) so that the perturbation to the quadratic form or U-statistic remains negligible. No quantitative error bounds, convergence rates for the manifold-learning step, or perturbation lemma are supplied to verify this condition holds under the stated high-dimensional regime.
  2. [Manifold learning procedure] Manifold learning and weight approximation procedure (the section detailing the computation of Ŵ): the procedure is described but supplies neither explicit rates on the geometric approximation error nor verification that the resulting error is small enough relative to 1/√n; in high dimensions this rate is typically slower, rendering the CLT claim load-bearing and unverified.
minor comments (2)
  1. [Abstract] The abstract states that 'asymptotic normality ... are developed' but does not indicate whether the proofs are fully rigorous or rely on additional unstated assumptions; a brief pointer to the key technical conditions would improve clarity.
  2. [Methods / Weight definition] Notation for the Delaunay weight matrix W and its estimator Ŵ should be introduced with an explicit definition (e.g., via an equation) before the test statistic is defined, to avoid ambiguity when reading the asymptotic statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the asymptotic claims and manifold learning procedure. We respond point by point below.

read point-by-point responses
  1. Referee: [Asymptotics / Theorem on normality] Asymptotic normality and consistency section (theorems establishing T_n/σ_n → N(0,1) under H0 and divergence under H1): the central argument requires that the estimated Delaunay weight matrix Ŵ satisfies ||Ŵ − W|| = o_p(1/√n) (or an analogous rate) so that the perturbation to the quadratic form or U-statistic remains negligible. No quantitative error bounds, convergence rates for the manifold-learning step, or perturbation lemma are supplied to verify this condition holds under the stated high-dimensional regime.

    Authors: We agree that the asymptotic normality result with estimated weights requires ||Ŵ − W|| = o_p(1/√n) to ensure the perturbation term is negligible. The theorems in the manuscript are stated under the assumption that the manifold-learning step achieves a rate sufficient for this condition. We did not supply explicit quantitative bounds or a dedicated perturbation lemma. In the revision we will add a lemma that bounds the effect of the weight estimation error on the test statistic and discuss the required rates under the low-dimensional manifold assumption. revision: yes

  2. Referee: [Manifold learning procedure] Manifold learning and weight approximation procedure (the section detailing the computation of Ŵ): the procedure is described but supplies neither explicit rates on the geometric approximation error nor verification that the resulting error is small enough relative to 1/√n; in high dimensions this rate is typically slower, rendering the CLT claim load-bearing and unverified.

    Authors: The referee is correct that the manifold-learning section describes the algorithmic steps without explicit error rates or verification against the 1/√n threshold. We will revise the section to cite standard convergence results for manifold estimation (under fixed or slowly growing intrinsic dimension) and show that these rates meet the o_p(1/√n) requirement when the manifold dimension is controlled. Additional assumptions needed for the high-dimensional regime will be stated explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation remains self-contained

full rationale

The paper defines a Delaunay-weighted test statistic from a manifold-learning procedure to approximate weights, then states asymptotic normality under the null and consistency under the alternative. No step reduces a claimed prediction or first-principles result to its own inputs by construction, nor renames a fitted quantity as a prediction. No load-bearing self-citation chain or uniqueness theorem imported from the authors' prior work appears in the provided text. The manifold approximation is treated as an independent computational step whose error is assumed controlled, without the statistic itself being defined in terms of its own asymptotic properties. This is the typical non-circular case where the central claims rest on separate statistical arguments rather than tautological re-expression of fitted inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that data reside on a learnable low-dimensional manifold and on the ad-hoc construction of the Delaunay weight; no free parameters or invented entities beyond the weight itself are mentioned in the abstract.

axioms (1)
  • domain assumption High-dimensional data lie on a low-dimensional manifold
    Explicitly stated in the abstract as the premise used to mitigate the curse of dimensionality.
invented entities (1)
  • Delaunay weight no independent evidence
    purpose: Measure geometric proximity among data points using both distance and direction via triangulation
    Newly defined quantity introduced to replace pairwise-distance similarity measures.

pith-pipeline@v0.9.0 · 5707 in / 1237 out tokens · 22863 ms · 2026-05-24T02:13:03.711757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    and L'Hour, J

    Abadie, A. and L'Hour, J. (2021). A Penalized Synthetic Control Estimator for Disaggregated Data . Journal of the American Statistical Association , 116(536):1817--1834

  2. [2]

    Arias-Castro, E., Pelletier, B., and Saligrama, V. (2018). Remember the Curse of Dimensionality: The Case of Goodness-of-fit Testing in Arbitrary Dimension . Journal of Nonparametric Statistics , 30(2):448--471

  3. [3]

    and Patrangenaru, V

    Bhattacharya, R. and Patrangenaru, V. (2003). Large sample theory of intrinsic and extrinsic sample means on manifolds. The Annals of Statistics , 31(1):1--29

  4. [4]

    Bickel, P. J. (1969). A Distribution Free Version of the Smirnov Two Sample Test in the p -Variate Case . The Annals of Mathematical Statistics , 40(1):1--23

  5. [5]

    T., Liu, W., and Xia, Y

    Cai, T. T., Liu, W., and Xia, Y. (2013a). Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings . Journal of the American Statistical Association , 108(501):265--277

  6. [6]

    T., Liu, W., and Xia, Y

    Cai, T. T., Liu, W., and Xia, Y. (2013b). Two-sample Test of High Dimensional Means under Dependence . Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 76(2):349--372

  7. [7]

    Cao, Y., Nemirovski, A., Xie, Y., Guigues, V., and Juditsky, A. (2018). Change Detection via Affine and Quadratic Detectors . Electronic Journal of Statistics , 12(1):1--57

  8. [8]

    Chang, J., Zheng, C., Zhou, W.-X., and Zhou, W. (2017). Simulation-based Hypothesis Testing of High Dimensional Means under Covariance Heterogeneity . Biometrics , 73(4):1300--1310

  9. [9]

    H., Watson, L

    Chang, T. H., Watson, L. T., Lux, T. C. H., Butt, A. R., Cameron, K. W., and Hong, Y. (2020). Algorithm 1012: DELAUNAYSPARSE : I nterpolation via a Sparse Subset of the D elaunay Triangulation in Medium to High Dimensions . ACM Transactions on Mathematical Software , 46(4):1--20

  10. [10]

    Chen, H. (2019). Sequential Change-point Detection based on Nearest Neighbors . The Annals of Statistics , 47(3):1381--1407

  11. [11]

    Chen, H., Chen, X., and Su, Y. (2018). A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data . Journal of the American Statistical Association , 113(523):1146--1155

  12. [12]

    and Friedman, J

    Chen, H. and Friedman, J. H. (2017). A New Graph-Based Two-Sample Test for Multivariate and Object Data . Journal of the American Statistical Association , 112(517):397--409

  13. [13]

    and Xie, Y

    Cheng, X. and Xie, Y. (2021). Kernel Two-Sample Tests for Manifold Data . arXiv:2105.03425

  14. [14]

    Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A Kernel Test of Goodness of Fit . In Proceedings of The 33rd International Conference on Machine Learning , pages 2606--2615, New York, New York, USA. PMLR

  15. [15]

    Dimeglio, C., Gall \' o n, S., Loubes, J.-M., and Maza, E. (2014). A Robust Algorithm for Template Curve Estimation based on Manifold Embedding . Computational Statistics & Data Analysis , 70:373--386

  16. [16]

    Facco, E., d'Errico, M., Rodriguez, A., and Laio, A. (2017). Estimating the Intrinsic Dimension of Datasets by a Minimal Neighborhood Information . Scientific Reports , 7:12140

  17. [17]

    Friedman, J. H. and Rafsky, L. C. (1979). Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests . The Annals of Statistics , 7(4):697--717

  18. [18]

    M., Rasch, M

    Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch \"o lkopf, B., and Smola, A. (2012). A Kernel Two-Sample Test . Journal of Machine Learning Research , 13(25):723--773

  19. [19]

    Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. (2009). A Fast, Consistent Kernel Two-Sample Test . In Proceedings of the 22nd International Conference on Neural Information Processing Systems , volume 22, page 673–681, Red Hook, New York, USA. Curran Associates, Inc

  20. [20]

    and Tajvidi, N

    Hall, P. and Tajvidi, N. (2002). Permutation Tests for Equality of Distributions in High-dimensional Settings . Biometrika , 89(2):359--374

  21. [21]

    Hediger, S., Michel, L., and Näf, J. (2022). On the Use of Random Forest for Two-sample Testing . Computational Statistics & Data Analysis , 170:107435

  22. [22]

    Henze, N. (1988). A Multivariate Two-Sample Test Based on the Number of Nearest Neighbor Type Coincidences . The Annals of Statistics , 16(2):772--783

  23. [23]

    and Penrose, M

    Henze, N. and Penrose, M. D. (1999). On the Multivariate Runs Test . The Annals of Statistics , 27(1):290--298

  24. [24]

    and Kalina, J

    Jure c kov \' a , J. and Kalina, J. (2012). Nonparametric Multivariate Rank Tests and Their Unbiasedness . Bernoulli , 18(1):229--251

  25. [25]

    B., and Lei, J

    Kim, I., Lee, A. B., and Lei, J. (2019). Global and Local Two-sample Tests via Regression . Electronic Journal of Statistics , 13(2):5253--5305

  26. [26]

    Kolmogorov, A. N. (1933). Sulla Determinazione Empirica di Una Legge di Distribuzione . Giornale dell’Istituto Italiano degli Attuari , 4:83--91

  27. [27]

    Kruskal, J. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika , 29:1--27

  28. [28]

    and Letscher, D

    Leibon, G. and Letscher, D. (2000). Delaunay Triangulations and Voronoi Diagrams for Riemannian Manifolds . In Proceedings of the sixteenth annual symposium on Computational geometry , pages 341--349, New York, New York, USA. Association for Computing Machinery

  29. [29]

    Liu, W., Yu, X., Zhong, W., and Li, R. (2022). Projection Test for Mean Vector in High Dimensions . Journal of the American Statistical Association

  30. [30]

    Mann, H. B. and Whitney, D. R. (1947). On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other . The Annals of Mathematical Statistics , 18(1):50--60

  31. [31]

    Marozzi, M. (2015). Multivariate Multidistance Tests for High-dimensional Low Sample Size Case-control Studies . Statistics in Medicine , 34(9):1511--1526

  32. [32]

    R., Stanley III, J

    Moon, K. R., Stanley III, J. S., Burkhardt, D., van Dijk, D., Wolf, G., and Krishnaswamy, S. (2018). Manifold Learning-based Methods for Analyzing Single-cell RNA-sequencing Data . Current Opinion in Systems Biology , 7:36--46

  33. [33]

    Morgan, K. L. and Rubin, D. B. (2012). Rerandomization to Improve Covariate Balance in Experiments . The Annals of Statistics , 40(2):1263--1282

  34. [34]

    and Souvenir, R

    Pless, R. and Souvenir, R. (2009). A Survey of Manifold Learning for Images . IPSJ Transactions on Computer Vision and Applications , 1:83--94

  35. [35]

    Rosenbaum, P. R. (2005). An Exact Distribution-free Test Comparing Wwo Multivariate Distributions based on Adjacency . Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 67(4):515--530

  36. [36]

    Schilling, M. F. (1986). Multivariate Two-sample Tests based on Nearest Neighbors . Journal of the American Statistical Association , 81(395):799--806

  37. [37]

    Sibson, R. (1978). Locally Equiangular Triangulations . The Computer Journal , 21(3):243--245

  38. [38]

    Smirnov, N. (1948). Table for Estimating the Goodness of Fit of Empirical Distributions . The Annals of Mathematical Statistics , 19(2):279--281

  39. [39]

    Sz \'e kely, G. J. and Rizzo, M. L. (2004). Testing for Equal Distributions in High Dimension . InterStat , November(5)

  40. [40]

    and Wolfowitz, J

    Wald, A. and Wolfowitz, J. (1940). On a Test Whether Two Samples are from the Same Population . The Annals of Mathematical Statistics , 11(2):147--162

  41. [41]

    Wang, W., Lin, N., and Tang, X. (2019). Robust Two-sample Test of High-dimensional Mean Vectors under Dependence . Journal of Multivariate Analysis , 169:312--329

  42. [42]

    Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods . Biometrics Bulletin , 1(6):80--83

  43. [43]

    and Zhang, X

    Yan, J. and Zhang, X. (2023). Kernel two-sample tests in high dimensions: interplay between moment discrepancy and dimension-and-sample orders. Biometrika , 110(2):411--430

  44. [44]

    Zhang, Z., Song, Y., and Qi, H. (2017). Age Progression/Regression by Conditional Adversarial Autoencoder . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4352--4360. IEEE

  45. [45]

    Zhao, D., Wang, J., Lin, H., Chu, Y., Wang, Y., Zhang, Y., and Yang, Z. (2021). Sentence Representation with Manifold Learning for Biomedical Texts . Knowledge-Based Systems , 218:106869