pith. sign in

arxiv: 2305.00578 · v3 · pith:KVAWN2HWnew · submitted 2023-04-30 · 📊 stat.ME

High-Dimensional Clustering via Nearest-Neighbor Asymmetry

Pith reviewed 2026-05-24 08:37 UTC · model grok-4.3

classification 📊 stat.ME
keywords high-dimensional clusteringnearest-neighbor graphasymmetric separationdispersion differencespermutation standardizationdirected graph clusteringscale separationgene expression clustering
0
0 comments X

The pith

Clustering high-dimensional data succeeds by scoring partitions on asymmetric neighbor counts in a directed k-nearest-neighbor graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that groups in high dimensions can separate through differences in dispersion rather than location, producing one-way neighbor patterns that a directed graph records. NAC builds this graph once, then scores candidate partitions with two permutation-standardized quantities: a weighted count of within-cluster edges and a contrast that measures directed imbalance between clusters. Their combination adapts automatically to mean separation, scale separation, or both, without fitting a mixture model or projecting to low dimensions. Simulations confirm competitive performance on location differences and clear gains when asymmetry dominates; gene-expression examples illustrate the same behavior on real small-sample data.

Core claim

The NAC objective, formed by adding a standardized within-edge enrichment statistic and a standardized contrast statistic on the directed kNN graph, recovers the correct partition by targeting complementary nearest-neighbor patterns that arise under both location and scale separation; population analysis shows each statistic isolates its target signal after permutation centering.

What carries the argument

The directed k-nearest-neighbor graph together with its two permutation-standardized statistics (weighted within-edge count and directed contrast) that are summed to form the clustering objective.

If this is right

  • NAC remains competitive with standard methods when clusters differ only in location.
  • NAC gains clear advantage precisely when nearest-neighbor asymmetry induced by scale differences is the main signal.
  • The method requires neither a parametric mixture model nor an explicit low-dimensional embedding.
  • The same two standardized statistics apply directly to small-sample high-dimensional problems such as gene-expression matrices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The permutation-based standardization may allow the same machinery to flag other forms of local heterogeneity beyond scale, such as differing tail behavior.
  • Because the graph is built once, NAC could serve as a fast post-processing step after any initial embedding that preserves local distances.
  • In regimes where multiple separation mechanisms coexist, the relative weight of the two statistics could be tuned by cross-validation on held-out edge counts.

Load-bearing premise

The directed kNN graph must carry the dominant separation signal in its asymmetric neighbor counts, and permutation standardization must remove finite-sample noise without needing Euclidean well-separation.

What would settle it

Generate data from two components that differ only in dispersion; if the NAC objective selects the true labels no more often than a random partition across repeated draws, the central claim fails.

Figures

Figures reproduced from arXiv: 2305.00578 by Hao Chen, Xiancheng Lin.

Figure 1
Figure 1. Figure 1: Examples of different shapes of two-dimensional data, with different colors representing [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Clustering results based on the standard [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Plots of 𝑝11 (black) and 𝑝22 (blue) considering 𝑘 nearest neighbors under different settings. The dashed lines represent the baseline when the two clusters are from the same distribution 𝑝 (0) 11 = (𝑚 − 1)/(𝑁 − 1) (black dashed line) and 𝑝 (0) 22 = (𝑛 − 1)/(𝑁 − 1) (blue dashed line). The two dashed lines overlap when 𝑚 = 𝑛. When the two distributions differ in scale, 𝑝11 is larger than 𝑝 (0) 11 , but 𝑝22 i… view at source ↗
Figure 4
Figure 4. Figure 4: Values of 𝑍𝑤, 𝑍𝑑, and 𝑀𝜅 (top panel) and mis-clustering rate (bottom panel) over 𝑘. 𝑚 = 𝑛 = 50. varying 𝜆 from 0 to 1 in different settings. In two-sample testing, Zhu & Chen (2021) suggested using 𝑘 = [𝑁 0.5 ] with the 𝑘-MST, while Zhou & Chen (2022) suggested 𝜆 = 0.65 for the 𝑘-NN graph. Here, we explore the relation between 𝑘 and mis-clustering rate. In particular, we aim to find out whether the 𝑘 that … view at source ↗
Figure 5
Figure 5. Figure 5: Values of 𝑍𝑤, 𝑍𝑑, and 𝑀𝜅 (top panel) and mis-clustering rate (bottom panel) over 𝑘. 𝑚 = 30, 𝑛 = 70. 2.6. Choice of 𝜅 From Section 2.4, max𝑥 𝑍𝑤 (𝑥) can be used to cluster scenario (1), and max𝑥 𝑍𝑑 (𝑥) is suitable for scenarios (2) and (3). Since we don’t know which scenario we are dealing with in reality, 𝑀𝜅 (𝑥) = max(𝑍𝑤 (𝑥), 𝜅𝑍𝑑 (𝑥)) was proposed for these three scenarios. Ideally, when the scenario is (1)… view at source ↗
Figure 6
Figure 6. Figure 6: Plots of ratio (𝑟 = 𝑍𝑤/𝑍𝑑) versus signal strength. Left panel: location difference with 𝑏 = 1, and 𝑎 increasing from 0.05 to 0.3. The blue dashed line at 𝑟 = 1.8 represents the minimum value of the boxplot when 𝑎 = 0.2. The “mis rate” indicates the mean mis-clustering rate when using 𝑍𝑤 as the criterion, with green representing higher errors and orange indicating lower errors. Right panel: scale difference… view at source ↗
Figure 7
Figure 7. Figure 7: Mis-clustering rates for Gaussian mixtures [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mis-clustering rates for multivariate 𝑡-distribution mixtures 𝑡20 (0, Σ) and 𝑡20 (𝑎1, 𝑏Σ) [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

High-dimensional clustering often relies on geometric or local-similarity structure, but the dominant separation between groups may not always be location-based. Differences in dispersion can create asymmetric local-neighborhood patterns: points from a more dispersed component may be closer to points in a more concentrated component than to points from their own component. We turn this high-dimensional phenomenon into a clustering principle. The proposed method, NAC (Nearest-neighbor Asymmetry Clustering), constructs a directed $k$-nearest-neighbor graph and evaluates candidate partitions using two permutation-standardized statistics: a weighted within-edge statistic that captures overall within-cluster enrichment and a contrast statistic that captures asymmetric separation. The resulting objective combines these two standardized signals, allowing the method to adapt to different separation regimes without specifying a mixture model or a low-dimensional representation. We provide a population-level analysis showing how the two statistics target complementary nearest-neighbor patterns. Simulation studies across mean, scale, and combined location-scale differences show that NAC is competitive under location separation and especially effective when nearest-neighbor asymmetry is present; gene-expression applications further illustrate its usefulness in small-sample, high-dimensional clustering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces the NAC method for high-dimensional clustering. It constructs a directed k-nearest-neighbor graph from the data and evaluates candidate partitions via an objective that combines two permutation-standardized statistics: a weighted within-edge statistic capturing within-cluster enrichment and a contrast statistic capturing asymmetric separation. A population-level analysis is provided to show that these statistics target complementary nearest-neighbor patterns. Simulations across mean, scale, and combined location-scale separations, together with gene-expression applications, are used to demonstrate competitiveness under location separation and particular effectiveness when nearest-neighbor asymmetry is present.

Significance. If the central claims hold, the work is significant because it supplies a model-free clustering principle that exploits dispersion-induced asymmetry in high dimensions without requiring a mixture model or low-dimensional embedding. The explicit population-level analysis demonstrating complementary targeting by the two statistics is a clear strength, as are the simulation studies that systematically vary separation regimes and the real-data illustrations. These elements together position NAC as a practical addition for small-sample, high-dimensional settings such as genomics. The permutation-standardization concern raised in the stress-test note does not appear to land as a load-bearing inconsistency on the basis of the provided description, because the standardization is presented as a device to isolate the asymmetry signal rather than as a data-dependent tuning step.

minor comments (2)
  1. The abstract packs the description of the two statistics and the objective into a single long sentence; splitting this material would improve readability without altering content.
  2. Notation for the two standardized statistics (e.g., symbols for the weighted within-edge and contrast quantities) should be introduced once in the main text and used consistently thereafter.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading and positive evaluation of our manuscript. The summary and significance assessment accurately reflect the paper's contributions. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines NAC via a directed kNN graph followed by two permutation-standardized statistics whose population-level behavior is analyzed directly; the standardization step normalizes against a permutation null on the observed graph rather than fitting a parameter that is then re-used as a prediction. No self-definitional reduction, fitted-input-called-prediction, or load-bearing self-citation chain appears in the derivation. The objective is constructed from the two standardized signals without renaming a known result or smuggling an ansatz through citation. The method therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, invented entities, or non-standard axioms beyond the usual assumption that the observed nearest-neighbor counts reflect the underlying separation mechanism.

pith-pipeline@v0.9.0 · 5714 in / 1194 out tokens · 20259 ms · 2026-05-24T08:37:05.941029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    , Barkai, N

    Alon, U. , Barkai, N. , Notterman, D. A. , Gish, K. , Ybarra, S. , Mack, D. & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences 96, 6745--6750

  4. [4]

    Bajwa, M. S. , Agarwal, A. P. & Manchanda, S. (2015). Ternary search algorithm: Improvement of binary search. In 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom). IEEE

  5. [5]

    , Richards, W

    Bhattacharjee, A. , Richards, W. G. , Staunton, J. , Li, C. , Monti, S. , Vasa, P. , Ladd, C. , Beheshti, J. , Bueno, R. , Gillette, M. et al. (2001). Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences 98, 13790--13795

  6. [6]

    Chan, P. K. , Schlag, M. D. & Zien, J. Y. (1993). Spectral k-way ratio-cut partitioning and clustering. In Proceedings of the 30th international Design Automation Conference

  7. [7]

    & Friedman, J

    Chen, H. & Friedman, J. H. (2017). A new graph-based two-sample test for multivariate and object data. Journal of the American statistical association 112, 397--409

  8. [8]

    Dettling, M. (2004). Bagboosting for tumor classification with gene expression data. Bioinformatics 20, 3583--3593

  9. [9]

    & Jin, J

    Donoho, D. & Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proceedings of the National Academy of Sciences 105, 14790--14795

  10. [10]

    Golub, T. R. , Slonim, D. K. , Tamayo, P. , Huard, C. , Gaasenbeek, M. , Mesirov, J. P. , Coller, H. , Loh, M. L. , Downing, J. R. , Caligiuri, M. A. et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science 286, 531--537

  11. [11]

    Gordon, G. J. , Jensen, R. V. , Hsiao, L.-L. , Gullans, S. R. , Blumenstock, J. E. , Ramaswamy, S. , Richards, W. G. , Sugarbaker, D. J. & Bueno, R. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer research 62, 4963--4967

  12. [12]

    , Tibshirani, R

    Hastie, T. , Tibshirani, R. , Friedman, J. H. & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer

  13. [13]

    Henze, N. (1988). A multivariate two-sample test based on the number of nearest neighbor type coincidences. The Annals of Statistics 16, 772--783

  14. [14]

    & Wang, W

    Jin, J. & Wang, W. (2016). Influential features pca for high dimensional clustering. The Annals of Statistics 44, 2323--2359

  15. [15]

    Leighton, F. T. & Rao, S. (1988). An approximate max-flow min-cut theorem for uniform multicommodity flow problems with applications to approximation algorithms. In FOCS, vol. 88

  16. [16]

    & Chen, H

    Liu, Y.-W. & Chen, H. (2022). A fast and efficient change-point detection framework based on approximate k -nearest neighbor graphs. IEEE Transactions on Signal Processing 70, 1976--1986

  17. [17]

    Maaten, L. v. d. & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research 9, 2579--2605

  18. [18]

    , Soares, J

    Mwangi, B. , Soares, J. C. & Hasan, K. M. (2014). Visualization and unsupervised predictive clustering of high-dimensional multimodal neuroimaging data. Journal of neuroscience methods 236, 19--25

  19. [19]

    , Chatterjee, S

    Rohe, K. , Chatterjee, S. & Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel

  20. [20]

    Schilling, M. F. (1986). Multivariate two-sample tests based on nearest neighbors. Journal of the American Statistical Association 81, 799--806

  21. [21]

    & Malik, J

    Shi, J. & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22, 888--905

  22. [22]

    , Febbo, P

    Singh, D. , Febbo, P. G. , Ross, K. , Jackson, D. G. , Manola, J. , Ladd, C. , Tamayo, P. , Renshaw, A. A. , D'Amico, A. V. , Richie, J. P. et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell 1, 203--209

  23. [23]

    & Verma, N

    Singh, V. & Verma, N. K. (2022). Gene expression data analysis using feature weighted robust fuzzy-means clustering. IEEE Transactions on NanoBioscience 22, 99--105

  24. [24]

    Su, A. I. , Welsh, J. B. , Sapinoso, L. M. , Kern, S. G. , Dimitrov, P. , Lapp, H. , Schultz, P. G. , Powell, S. M. , Moskaluk, C. A. , Frierson Jr, H. F. et al. (2001). Molecular classification of human carcinomas by use of gene expression signatures. Cancer research 61, 7388--7393

  25. [25]

    Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing 17, 395--416

  26. [26]

    , Klijn, J

    Wang, Y. , Klijn, J. G. , Zhang, Y. , Sieuwerts, A. M. , Look, M. P. , Yang, F. , Talantov, D. , Timmermans, M. , Meijer-van Gelder, M. E. , Yu, J. et al. (2005). Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. The Lancet 365, 671--679

  27. [27]

    & Leahy, R

    Wu, Z. & Leahy, R. (1993). An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE transactions on pattern analysis and machine intelligence 15, 1101--1113

  28. [28]

    Yousefi, M. R. , Hua, J. , Sima, C. & Dougherty, E. R. (2010). Reporting bias when using real data sets to analyze classification performance. Bioinformatics 26, 68--76

  29. [29]

    & Chen, H

    Zhang, Y. & Chen, H. (2021). Graph-based multiple change-point detection. arXiv preprint arXiv:2110.01170

  30. [30]

    & Chen, H

    Zhou, D. & Chen, H. (2022). Ring-cpd: Asymptotic distribution-free change-point detection for multivariate and non-euclidean data. arXiv preprint arXiv:2206.03038

  31. [31]

    & Chen, H

    Zhu, Y. & Chen, H. (2021). Limiting distributions of graph-based test statistics. arXiv preprint arXiv:2108.07446