arxiv: 2605.00598 · v1 · submitted 2026-05-01 · 📊 stat.ME

Recognition: unknown

Sparse K-spatial-median clustering for high-dimensional data

Ping Zhao , Dan Zhuang , Long Feng

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:15 UTC · model grok-4.3

classification 📊 stat.ME

keywords clusteringhigh-dimensional dataspatial medianfeature exclusionrobust statisticsK-meansheavy tailspermutation Gap

0 comments

The pith

A clustering method using spatial medians and hard feature exclusion improves stability for high-dimensional heavy-tailed data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a modified K-means procedure that swaps arithmetic means for spatial medians when updating cluster centers, giving resistance to heavy tails and outliers. For high-dimensional cases it adds a hard exclusion step that drops dimensions showing little separation across centers, with the cutoff chosen automatically by a permutation Gap statistic. Assignment can stay Euclidean or switch to a metric built from the spatial sign covariance to handle scale differences and dependence. A sympathetic reader would care because many data sets in genomics, finance, or imaging combine exactly these features—outliers, correlations, and mostly uninformative variables—and standard K-means or sparse K-means often become unstable under them.

Core claim

The central claim is that replacing the mean updates of Lloyd's algorithm with spatial medians, permitting either a Euclidean or a spatial-sign-covariance-based assignment rule, and applying a dispersion-based hard feature-exclusion rule whose threshold is set by a permutation Gap criterion produces clustering accuracy that is competitive with K-means and sparse K-means while delivering visibly higher stability in simulations drawn from correlated Gaussian and multivariate t distributions.

What carries the argument

The hard feature-exclusion rule, which removes dimensions whose across-center dispersion falls below a permutation-selected Gap threshold, supplies the sparsity mechanism while the spatial-median updates supply the robustness.

If this is right

Clustering accuracy remains competitive with standard K-means and sparse K-means under both Gaussian and heavy-tailed models.
Stability across repeated runs or slight data perturbations improves relative to the baselines.
The automatic exclusion step removes irrelevant dimensions without requiring separate variable-selection tuning.
The assignment step can incorporate a robust Mahalanobis-type distance when feature scales or dependence matter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spatial-median center updates could be inserted into other partitioning algorithms to gain similar robustness without changing the overall framework.
Because exclusion is driven by dispersion across centers, the procedure implicitly ranks variables by their contribution to separation and could therefore feed directly into post-clustering interpretation.
If the Gap criterion works for exclusion, the same permutation idea might be used to choose the number of clusters K in the same run.

Load-bearing premise

The across-center dispersion, paired with the permutation Gap threshold, reliably keeps signal-bearing dimensions and discards irrelevant ones even when features are correlated and tails are heavy.

What would settle it

A controlled simulation in which the method excludes a known separating feature or retains most known noise features would show the exclusion rule has failed.

Figures

Figures reproduced from arXiv: 2605.00598 by Dan Zhuang, Long Feng, Ping Zhao.

**Figure 1.** Figure 1: Framework of the proposed methods. Module 2 is the shared spatial-median view at source ↗

**Figure 2.** Figure 2: ARI values of different methods across varying dimensions view at source ↗

**Figure 3.** Figure 3: ARI values of competing methods under different contamination mechanisms. view at source ↗

**Figure 4.** Figure 4: Comparison of clustering performance on the mice protein data for the Control view at source ↗

read the original abstract

We propose a robust clustering framework for high-dimensional data with heavy tails and a large fraction of irrelevant variables. The method replaces the mean updates of Lloyd's $K$-means with \emph{spatial medians} to enhance robustness. For the assignment step, it admits either a Euclidean rule for computational simplicity or a robust Mahalanobis-type metric constructed from the spatial sign covariance matrix to account for heterogeneous scales and feature dependence. To handle the $p \gg n$ regime, we further introduce a simple \emph{hard feature-exclusion} mechanism that removes weakly separating dimensions based on across-center dispersion, with the exclusion threshold selected automatically via a permutation-based Gap criterion. Simulation studies under correlated Gaussian and multivariate $t$ models demonstrate that the proposed approach provides competitive clustering accuracy and improved stability relative to $K$-means and sparse $K$-means baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical robust clustering method using spatial medians plus Gap-based hard feature exclusion, but the permutation step risks mishandling correlations in the reported simulations.

read the letter

The paper replaces K-means means with spatial medians for robustness to heavy tails and outliers, allows either Euclidean or robust Mahalanobis assignment, and adds a hard feature-exclusion step that drops dimensions with low across-center dispersion, with the cutoff chosen by a permutation Gap statistic. This exact combination for high-dimensional clustering is not standard in the baselines it cites, so the framework itself is the main new piece. The simulations under correlated Gaussian and multivariate t models are the concrete evidence offered, and they claim competitive accuracy plus better stability than plain K-means or sparse K-means. That stability angle is useful for practitioners who care about reproducible clusters rather than just one-run accuracy. The method is algorithmic and does not rely on any circular fitting argument, which keeps the claims grounded in the reported experiments. The soft spot is the feature-exclusion rule. The Gap statistic depends on a permutation null, yet the simulations explicitly include correlated features. If the permutations break or ignore that dependence, the threshold can be off, either discarding weakly informative dimensions or retaining noise. The abstract does not spell out how the permutations are constructed, so it is unclear whether the reported gains survive a dependence-preserving null. This is not a fatal flaw but a point that needs direct checking in the full simulations and code. The work is aimed at statisticians and analysts who need a simple robust drop-in for high-dimensional clustering with many irrelevant variables. Anyone already using sparse K-means or spatial-median ideas would get immediate value from the comparison. It deserves a serious referee because the proposal is implementable, the simulation design is relevant, and the central claim is falsifiable once the permutation details are examined.

Referee Report

2 major / 3 minor

Summary. The paper proposes a robust clustering framework called sparse K-spatial-median clustering for high-dimensional data. It replaces mean updates in K-means with spatial medians for robustness to heavy tails, uses either Euclidean or robust Mahalanobis-type assignment based on the spatial sign covariance matrix, and introduces hard feature exclusion based on across-center dispersion with the threshold selected via a permutation Gap criterion. Simulations under correlated Gaussian and multivariate t models report competitive clustering accuracy and improved stability relative to K-means and sparse K-means baselines.

Significance. If the performance claims hold after addressing the concerns below, the work would provide a practical robust alternative for clustering high-dimensional data with outliers, heavy tails, and many irrelevant features. The automatic feature-exclusion step and emphasis on stability could be useful in applications such as genomics or imaging, where reproducibility matters. The simulations offer initial supporting evidence, though they require more detail to fully assess generalizability.

major comments (2)

[Feature exclusion and Gap criterion] The hard feature-exclusion rule based on across-center dispersion, with threshold chosen by the permutation Gap criterion, may fail to preserve weak signal features under the paper's own correlated models. Standard permutations break the dependence structure in the correlated Gaussian and t distributions used for simulation, producing an incorrect null for the Gap statistic. This risks excluding informative dimensions or retaining noise, which directly affects the central claim of improved stability (see simulation setup and method description).
[Simulation studies] The simulation studies report favorable accuracy and stability but provide no information on the number of replications, standard errors or variability measures for the reported metrics, or exact parameter values (e.g., correlation coefficients, degrees of freedom for the t distribution). This makes it impossible to determine whether the observed gains are reliable or could be artifacts of the specific design.

minor comments (3)

[Abstract] Ensure consistent use of terminology between the title ('Sparse K-spatial-median clustering') and the abstract description.
Add pseudocode or a clear algorithmic outline for the full procedure, including how the spatial sign covariance is estimated in the p ≫ n regime.
Clarify the exact formula for the across-center dispersion statistic used to rank features for exclusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Feature exclusion and Gap criterion] The hard feature-exclusion rule based on across-center dispersion, with threshold chosen by the permutation Gap criterion, may fail to preserve weak signal features under the paper's own correlated models. Standard permutations break the dependence structure in the correlated Gaussian and t distributions used for simulation, producing an incorrect null for the Gap statistic. This risks excluding informative dimensions or retaining noise, which directly affects the central claim of improved stability (see simulation setup and method description).

Authors: We agree that standard permutations destroy the correlation structure present in the simulated data, which is a recognized limitation of permutation-based procedures under dependence. The Gap criterion is applied to the across-center dispersion measure to select an automatic threshold for hard feature exclusion, with the goal of removing dimensions that do not contribute meaningfully to separation. While the referee's concern is valid in principle, the simulations under the paper's correlated Gaussian and multivariate-t settings show that the procedure retains competitive accuracy and improves stability relative to baselines. To address the point directly, the revised manuscript will include an explicit discussion of the exchangeability assumption underlying the permutation null and its potential impact on weak-signal retention, together with a brief sensitivity analysis using an alternative resampling scheme that approximately preserves pairwise correlations. revision: partial
Referee: [Simulation studies] The simulation studies report favorable accuracy and stability but provide no information on the number of replications, standard errors or variability measures for the reported metrics, or exact parameter values (e.g., correlation coefficients, degrees of freedom for the t distribution). This makes it impossible to determine whether the observed gains are reliable or could be artifacts of the specific design.

Authors: We acknowledge that the simulation section is missing these essential details. The revised manuscript will report the number of Monte Carlo replications, standard errors (or interquartile ranges) for all accuracy and stability metrics, and the precise parameter values used, including the correlation coefficient and degrees of freedom for the multivariate-t model. These additions will allow readers to assess the reliability of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic proposal relies on external Gap criterion and empirical simulations

full rationale

The paper describes a clustering algorithm replacing K-means means with spatial medians, optionally using a spatial-sign covariance for assignments, and applying hard feature exclusion whose threshold is set by the established permutation Gap statistic (external to the paper). No derivation, theorem, or prediction is claimed that reduces by the paper's own equations to a fitted parameter or self-citation. Simulations under specified models supply the performance claims without any load-bearing step that is definitionally equivalent to its inputs. This is the common case of a self-contained algorithmic contribution validated externally.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on the known robustness properties of spatial medians and the validity of the Gap statistic for selecting an exclusion threshold; no new entities are postulated and no parameters are fitted by hand beyond the automatic selection rule.

free parameters (1)

feature-exclusion threshold
Chosen automatically by the permutation Gap criterion rather than fixed by the user, but still constitutes a data-dependent tuning step in the procedure.

axioms (2)

domain assumption Spatial medians are robust to heavy tails and outliers in the assignment and update steps
Invoked to justify replacement of means; standard property of spatial medians assumed without new proof.
domain assumption The across-center dispersion measure combined with the Gap criterion correctly identifies irrelevant dimensions
Central to the sparse mechanism; treated as reliable based on prior literature on the Gap statistic.

pith-pipeline@v0.9.0 · 5439 in / 1317 out tokens · 50000 ms · 2026-05-09T19:15:19.284746+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 11 canonical work pages · 1 internal anchor

[1]

Simple and Scalable Sparse k-means Clustering via Feature Ranking , url =

Zhang, Zhiyue and Lange, Kenneth and Xu, Jason , booktitle =. Simple and Scalable Sparse k-means Clustering via Feature Ranking , url =
[2]

2004 , journal=

Feature Selection in k-Median Clustering , author=. 2004 , journal=

2004
[3]

and Yang, Miin-Shen , title =

Benjamin, Josephine Bernadette M. and Yang, Miin-Shen , title =. IEEE Access , year =
[4]

MacQueen, J. B. , title =. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability , volume =. 1967 , publisher =

1967
[5]

J. A. Hartigan and M. A. Wong , journal =. Algorithm AS 136: A K-Means Clustering Algorithm , volume =
[6]

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume =

Tibshirani, Robert and Walther, Guenther and Hastie, Trevor , title =. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume =. doi:https://doi.org/10.1111/1467-9868.00293 , url =

work page doi:10.1111/1467-9868.00293
[7]

Cuesta-Albertos, J. A. and Gordaliza, A. and Matr. Trimmed k-Means: An Attempt to Robustify Quantizers , journal =. 1997 , volume =

1997
[8]

and Salibian-Barrera, M

Kondo, Y. and Salibian-Barrera, M. and Zamar, R. , title =. 2012 , howpublished =

2012
[9]

Robust and sparse K-means clustering for high-dimensional data , journal =

Brodinov. Robust and Sparse k-Means Clustering for High-Dimensional Data , journal =. 2019 , volume =. doi:10.1007/s11634-019-00356-9 , url =

work page doi:10.1007/s11634-019-00356-9 2019
[10]

Electronic Journal of Statistics , number =

Wei Sun and Junhui Wang and Yixin Fang , title =. Electronic Journal of Statistics , number =. 2012 , doi =

2012
[11]

, booktitle =

Bradley, Paul and Mangasarian, Olvi and Street, W. , booktitle =. Clustering via Concave Minimization , url =
[12]

and Dubes, Richard C

Jain, Anil K. and Dubes, Richard C. , title =
[13]

2014 , issn =

Model-based Clustering of High-dimensional Data: A Review , journal =. 2014 , issn =. doi:https://doi.org/10.1016/j.csda.2012.12.008 , url =

work page doi:10.1016/j.csda.2012.12.008 2014
[14]

Journal of the American Statistical Association , volume =

Chris Fraley and Adrian E Raftery , title =. Journal of the American Statistical Association , volume =. 2002 , publisher =. doi:10.1198/016214502760047131 , URL =

work page doi:10.1198/016214502760047131 2002
[15]

Journal of Machine Learning Research , year =

Pan, Wei and Shen, Xiaotong , title =. Journal of Machine Learning Research , year =
[16]

The Annals of Statistics , year =

Jin, Jiashun and Wang, Wanjie , title =. The Annals of Statistics , year =
[17]

Journal of the Royal Statistical Society: Series C (Applied Statistics) , volume =

Chang, Wei-Chien , title =. Journal of the Royal Statistical Society: Series C (Applied Statistics) , volume =. doi:https://doi.org/10.2307/2347949 , url =

work page doi:10.2307/2347949
[18]

Principal component analysis.Chemometrics and Intelligent Laboratory Systems, 2(1):37–52, August 1987

Principal Component Analysis , journal =. 1987 , note =. doi:https://doi.org/10.1016/0169-7439(87)80084-9 , url =

work page doi:10.1016/0169-7439(87)80084-9 1987
[19]

2004 , issue_date =

Parsons, Lance and Haque, Ehtesham and Liu, Huan , title =. 2004 , issue_date =. doi:10.1145/1007730.1007731 , journal =

work page doi:10.1145/1007730.1007731 2004
[20]

IEEE Transactions on Pattern Analysis and Machine Intelligence , title=

Elhamifar, Ehsan and Vidal, Ren. IEEE Transactions on Pattern Analysis and Machine Intelligence , title=. 2013 , volume=

2013
[21]

Mahdi Soltanolkotabi and Emmanuel J. Cand. The Annals of Statistics , number =. 2012 , doi =

2012
[22]

Journal of Machine Learning Research , year =

Yu-Xiang Wang and Huan Xu , title =. Journal of Machine Learning Research , year =
[23]

Proceedings of the National Academy of Sciences , year =

Vardi, Yehuda and Zhang, Cun-Hui , title =. Proceedings of the National Academy of Sciences , year =
[24]

2010 , doi =

Oja, Hannu , title =. 2010 , doi =

2010
[25]

and Plastria, F

Weiszfeld, E. and Plastria, F. , title =. Annals of Operations Research , year =
[26]

and Baragilly, Mohammed , title =

Gabr, Hend and Willis, Brian H. and Baragilly, Mohammed , title =. Journal of Applied Statistics , year =
[27]

Journal of the American Statistical Association , year =

A Framework for Feature Selection in Clustering , author =. Journal of the American Statistical Association , year =
[28]

Journal of Classification , year =

Comparing Partitions , author =. Journal of Classification , year =
[29]

Journal of Statistical Planning and Inference , year =

Sign and Rank Covariance Matrices , author =. Journal of Statistical Planning and Inference , year =
[30]

Statistics and Computing , year =

The k -Step Spatial Sign Covariance Matrix , author =. Statistics and Computing , year =
[31]

Journal of Multivariate Analysis , year =

A Generalized Spatial Sign Covariance Matrix , author =. Journal of Multivariate Analysis , year =
[32]

Tohoku Mathematical Journal , year =

Sur le point pour lequel la somme des distances de n points donn\'es est minimum , author =. Tohoku Mathematical Journal , year =
[33]

Nature communications , volume=

ClusterMap for Multi-scale Clustering Analysis of Spatial Gene Expression , author=. Nature communications , volume=. 2021 , publisher=

2021
[34]

Nature communications , volume=

In-vivo Integration of Soft Neural Probes Through High-Resolution Printing of Liquid Electronics on The Cranium , author=. Nature communications , volume=. 2024 , publisher=

2024
[35]

2024 , issn =

Elastic Deep Autoencoder for Text Embedding Clustering by An Improved Graph Regularization , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.eswa.2023.121780 , url =

work page doi:10.1016/j.eswa.2023.121780 2024
[36]

Journal of Classification , volume=

A Survey on Feature Weighting Based K-means Algorithms , author=. Journal of Classification , volume=. 2016 , publisher=

2016
[37]

Electronic Journal of Statistics , volume =

Wei Sun and Junhui Wang and Yixin Fang , title =. Electronic Journal of Statistics , volume =. 2012 , doi =

2012
[38]

E. W. Forgy , title =. Biometrics , volume =
[39]

2024 , issn =

Sparse K-means Clustering Algorithm With Anchor Graph Regularization , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.ins.2024.120504 , url =

work page doi:10.1016/j.ins.2024.120504 2024
[40]

Journal of the American Statistical Association , volume=

Variable Selection for Model-based Clustering , author=. Journal of the American Statistical Association , volume=. 2006 , publisher=

2006
[41]

The Annals of Statistics , volume=

Robust K-means Clustering for Distributions With Two Moments , author=. The Annals of Statistics , volume=. 2021 , publisher=

2021
[42]

2004 , isbn =

Ding, Chris and He, Xiaofeng , title =. 2004 , isbn =. doi:10.1145/1015330.1015408 , booktitle =

work page doi:10.1145/1015330.1015408 2004
[43]

Egyptian Informatics Journal , volume=

Determining the Optimal Number of Clusters by Enhanced Gap Statistic in K-mean Algorithm , author=. Egyptian Informatics Journal , volume=. 2024 , publisher=

2024
[44]

High-Dimensional Data Analysis for Elliptically Symmetric Distributions

High-Dimensional Data Analysis for Elliptically Symmetric Distributions , author=. arXiv preprint arXiv:2604.13944 , year=

work page internal anchor Pith review Pith/arXiv arXiv