Statistical Properties of $k$-means Clustering for Data Missing Completely at Random

Xin Guan

arxiv: 2607.01945 · v1 · pith:33EEZZLDnew · submitted 2026-07-02 · 📊 stat.ML · cs.LG

Statistical Properties of k-means Clustering for Data Missing Completely at Random

Xin Guan This is my paper

Pith reviewed 2026-07-03 06:10 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords k-meansmissing dataMCARcluster centersconvergence rateasymptotic normalityexcess riskstatistical properties

0 comments

The pith

k-means cluster centers estimated from MCAR missing data achieve √n convergence and asymptotic normality, recovering true centers under a separation condition when centers differ in every dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves consistency and √n-excess risk bounds for k-means centers under general missing mechanisms. Under missing completely at random, it further establishes the √n rate, asymptotic normality, and conditions for convergence to the centers of the fully observed data. A reader would care because these guarantees justify using k-means on incomplete data sets that arise in practice, while exposing a dimension-wise distinctness requirement that restricts high-dimensional applications. The work supports the theoretical use of incomplete-data k-means by linking missing probability, cluster separation, and recovery of the original centers.

Core claim

Under the MCAR mechanism, the estimated cluster centers achieve the √n-convergence rate and asymptotic normality; moreover, they converge to the true cluster centers of the fully observed data when a sufficient condition on the missing probability and the separation among true clusters holds. The analysis shows both the rate and the convergence require the k true centers to be distinct in every dimension.

What carries the argument

The excess-risk analysis and asymptotic expansion of the k-means objective on incomplete observations under MCAR, which isolates the effect of dimension-wise center separation.

If this is right

The estimated centers remain consistent under general missing mechanisms.
A √n-excess risk bound holds for the incomplete-data k-means objective.
Asymptotic normality of the centers follows under MCAR.
Recovery of the original centers occurs once missing probability and cluster separation satisfy the stated sufficient condition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In high dimensions the distinctness requirement will often fail, so the √n guarantees may not apply without dimension selection or projection.
The results suggest comparing imputation-based k-means against the direct incomplete-data version by checking whether the former relaxes the per-dimension separation demand.
Practitioners could test the condition by verifying pairwise center differences across all coordinates before relying on the asymptotic rates.

Load-bearing premise

The k true centers must be distinct in every dimension.

What would settle it

A data set or simulation in which two centers coincide in at least one coordinate yet the MCAR estimator still attains the √n rate and converges to the full-data centers would falsify the necessity of that condition.

Figures

Figures reproduced from arXiv: 2607.01945 by Xin Guan.

**Figure 1.** Figure 1: The convergence of mˆ n to m∗ (log-log scale) under different missing mechanisms and missing proportions. The red dashed line is the line with slope being -0.5, implying the theoretical trend. 4.1.1 Excess risk bound To verify the Op(n −1/2 ) excess risk bound, we consider n ∈ {300, 900, 1500, 3000, 6000, 9000, 12000, 15000} and the number of repetitions T = 100. We calculate the average of {mˆ n,1, . . . … view at source ↗

**Figure 2.** Figure 2: The consistency of estimated cluster centers under different missing mechanisms and missing proportions. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The log-log scale convergence trend of estimated cluster centers under different missing mechanisms and [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: The MSE values versus the parameter ρ of true cluster centers. In panels of Case (a), the red point represents the empirical transition point (the minimum ρ required to achieve MSE ≈ 0), and the vertical dotted line represents the approximated theoretical threshold. proportion is larger, the vertical dotted line is always in the right of the red point, that is, the theoretical threshold is always larger th… view at source ↗

read the original abstract

The classical $k$-means clustering cannot be directly used to incomplete data, and existing $k$-means-based clustering for missing data primarily focus on improving the practical accuracy of clustering, whereas most of them lack theoretical guarantees in the asymptotic sense. In this paper, we investigate the statistical properties of $k$-means clustering in the presence of missing data. We first establish the $\sqrt{n}$-excess risk bound and prove the consistency of the estimated cluster centers under general missing mechanisms. For the Missing Completely at Random (MCAR) mechanism, we further derive the $\sqrt{n}$-convergence rate and asymptotic normality of the estimated cluster centers. Moreover, we study in what cases the cluster centers estimated by incomplete data converge to the true cluster centers of original fully observed data, and give a sufficient condition about the missing probability and the separation among true clusters. These results provide a theoretical guarantee for missing-data-$k$-means. Notably, our analysis reveal that under MCAR mechanism, both achieving the $\sqrt{n}$-rate and converging to the true cluster centers require $k$ true centers to be distinct in every dimension, highlighting the significant challenges of application in high-dimensional regimes. Finally, we conduct numerical simulations on synthetic incomplete datasets to support our theoretical analysis results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extends k-means asymptotics to MCAR missingness with √n rates and normality, but the per-dimension separation condition limits its reach.

read the letter

The core contribution is adapting existing k-means excess-risk and consistency arguments to incomplete data. Under general missingness they recover √n excess risk and consistency of the centers. Under MCAR they strengthen this to √n convergence of the centers themselves plus asymptotic normality, and they supply a sufficient condition on missing probability plus cluster separation that makes the incomplete-data centers converge to the full-data ones.

The paper does the useful job of spelling out where the theory carries over and where it does not. The explicit call-out that both the rate and the convergence require the true centers to be distinct in every coordinate is honest and worth knowing; it directly flags the high-dimensional difficulty.

The main limitation is that same per-dimension separation requirement. It is stated as necessary, but it narrows the practical scope quite a bit. Simulations are mentioned without enough detail to judge how well they match the theory. The derivations are asserted rather than shown here, so any referee would need to check the concentration steps and the handling of the missingness indicator.

This is aimed at researchers who already work on theoretical clustering with missing data. The formal bounds are the main value, so the paper is worth sending to peer review even if the separation condition ends up needing more discussion or relaxation.

Referee Report

0 major / 2 minor

Summary. The paper claims to establish √n-excess risk bounds and consistency of estimated k-means cluster centers under general missing mechanisms. Under the MCAR mechanism it further derives √n-convergence rates and asymptotic normality of the centers, identifies a sufficient condition on the missing probability and cluster separation for convergence to the true fully-observed centers, and shows that both the rate and the convergence require the k true centers to be distinct in every dimension. The claims are supported by numerical simulations on synthetic incomplete data.

Significance. If the derivations hold, the results supply the first asymptotic guarantees for k-means under missing data and isolate a concrete high-dimensional obstruction (distinctness in every coordinate) that is likely to be practically relevant. The asymptotic normality result would also enable downstream inference procedures on incomplete data.

minor comments (2)

[Abstract / MCAR section] The abstract states that proofs exist for the √n bounds and normality but the manuscript should include a short proof sketch or key steps in the main text (rather than relegating all details to an appendix) so that the distinct-in-every-dimension necessity can be verified without external material.
[Numerical experiments] The simulation section should report the precise dimensions, missing probabilities, and number of Monte Carlo replications used, together with quantitative checks (e.g., empirical rates or QQ plots) that directly corroborate the claimed √n rate and normality.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No major comments were raised, so we have no specific points to address point-by-point. We will perform a careful proofreading and minor polishing pass for the revised version.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives √n-excess risk bounds, consistency, √n-convergence rates, asymptotic normality, and a sufficient condition for convergence to true centers under MCAR via standard concentration and asymptotic arguments. The distinct-in-every-dimension requirement is presented as an output of the analysis (necessary for both the rate and true-center convergence), not an input definition or fitted parameter. No self-citations, ansatzes smuggled via prior work, or predictions that reduce to the fitted inputs by construction are indicated in the provided material; the central claims remain independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard probabilistic limit theorems and the MCAR assumption; no free parameters or new entities are introduced in the abstract.

axioms (2)

standard math Standard concentration inequalities and central limit theorem apply to the incomplete-data objective
Invoked to obtain √n rates and normality under MCAR
domain assumption Missing completely at random mechanism
Used to derive the tighter √n-convergence and normality results

pith-pipeline@v0.9.1-grok · 5747 in / 1259 out tokens · 26370 ms · 2026-07-03T06:10:51.058300+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Journal of statistical software , volume=

mice: Multivariate imputation by chained equations in R , author=. Journal of statistical software , volume=
[2]

Journal of statistical software , volume=

Amelia II: A program for missing data , author=. Journal of statistical software , volume=
[3]

The American Statistician , volume=

k-pod: A method for k-means clustering of missing data , author=. The American Statistician , volume=. 2016 , publisher=

2016
[4]

IEEE Access , volume=

K-means clustering with incomplete data , author=. IEEE Access , volume=. 2019 , publisher=

2019
[5]

The annals of statistics , volume =

Strong consistency of k-means clustering , author=. The annals of statistics , volume =
[6]

Pollard, David , journal =
[7]

The Annals of Statistics , volume=

Trimmed k -means: an attempt to robustify quantizers , author=. The Annals of Statistics , volume=
[8]

Scandinavian Journal of Statistics , volume=

Strong consistency of reduced k-means clustering , author=. Scandinavian Journal of Statistics , volume=
[9]

Annals of the Institute of Statistical Mathematics , volume=

Strong consistency of factorial k-means clustering , author=. Annals of the Institute of Statistical Mathematics , volume=
[10]

Electronic Journal of Statistics , pages =

Regularized k-means clustering of high-dimensional data and its asymptotic consistency , author=. Electronic Journal of Statistics , pages =. 2012 , volume =

2012
[11]

Journal of Machine Learning Research , volume=

Regularized k-means through hard-thresholding , author=. Journal of Machine Learning Research , volume=
[12]

Advances in Neural Information Processing Systems , volume=

Robust k-means: a theoretical revisit , author=. Advances in Neural Information Processing Systems , volume=
[13]

Stat , volume=

On the strong consistency of feature-weighted k-means clustering in a nearmetric space , author=. Stat , volume=. 2019 , publisher=

2019
[14]

The Annals of Statistics , pages=

Consistency of spectral clustering , author=. The Annals of Statistics , pages=
[15]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Implicit annealing in kernel spaces: A strongly consistent clustering approach , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=

2022
[16]

International Conference on Machine Learning , pages=

Consistency of multiple kernel clustering , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[17]

The Annals of Statistics , volume=

Asymptotic theory of geometric and adaptive k-means clustering , author=. The Annals of Statistics , volume=
[18]

Annals of statistics , pages=

A central limit theorem for multivariate generalized trimmed k-means , author=. Annals of statistics , pages=. 1999 , publisher=

1999
[19]

International Journal Of General System , volume=

On asymptotic normality of a class of fuzzy c-means clustering procedures , author=. International Journal Of General System , volume=. 1994 , publisher=

1994
[20]

Khoshgoftaar

Jason Van Hulse and Taghi M. Khoshgoftaar. Incomplete--case nearest neighbor imputation in software measurement data. Information Sciences. 2014. doi:https://doi.org/10.1016/j.ins.2010.12.017

work page doi:10.1016/j.ins.2010.12.017 2014
[21]

Adaptive imputation of missing values for incomplete pattern classification

Zhunga Liu and Quan Pan and Jean Dezert and Arnaud Martin. Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognition. 2016

2016
[22]

MissForest--non--parametric missing value imputation for mixed--type data

Stekhoven, Daniel J and B \"u hlmann, Peter. MissForest--non--parametric missing value imputation for mixed--type data. Bioinformatics. 2012

2012
[23]

Gain: Missing data imputation using generative adversarial nets

Yoon, Jinsung and Jordon, James and Schaar, Mihaela. Gain: Missing data imputation using generative adversarial nets. International conference on machine learning. 2018

2018
[24]

Suvra Jyoti Choudhury and Nikhil R. Pal. Imputation of missing data with neural networks for classification. Knowledge--Based Systems. 2019

2019
[25]

Missing data imputation with adversarially--trained graph convolutional networks

Indro Spinelli and Simone Scardapane and Aurelio Uncini. Missing data imputation with adversarially--trained graph convolutional networks. Neural Networks. 2020

2020
[26]

Mean shift clustering algorithm for data with missing values

AbdAllah, Loai and Shimshoni, Ilan. Mean shift clustering algorithm for data with missing values. International Conference on Data Warehousing and Knowledge Discovery. 2014

2014
[27]

Machine Learning , volume=

Clustering with missing features: a penalized dissimilarity measure based approach , author=. Machine Learning , volume=. 2018 , publisher=

2018
[28]

Mesquita and João P.P

Diego P.P. Mesquita and João P.P. Gomes and Amauri H. Souza Junior and Juvêncio S. Nobre. Euclidean distance estimation in incomplete datasets. Neurocomputing. 2017

2017
[29]

Statistical Analysis and Data Mining: The ASA Data Science Journal , volume=

An efficient k-means-type algorithm for clustering datasets with incomplete records , author=. Statistical Analysis and Data Mining: The ASA Data Science Journal , volume=
[30]

Journal of Classification , volume=

Imputation strategies for clustering mixed-type data with missing values , author=. Journal of Classification , volume=
[31]

Advanced Machine Learning and Data Science , year=

Joint dimensionality reduction and clustering with missing data , author=. Advanced Machine Learning and Data Science , year=
[32]

Van Der Vaart, Aad W and Wellner, Jon A , title =
[33]

2018 , publisher=

Foundations of machine learning , author=. 2018 , publisher=

2018
[34]

Journal of machine learning research , volume=

Rademacher and gaussian complexities: Risk bounds and structural results , author=. Journal of machine learning research , volume=
[35]

Transactions on Machine Learning Research , year=

A note on the k -means clustering for missing data , author=. Transactions on Machine Learning Research , year=
[36]

Statistical and Computational Guarantees of Lloyd's Algorithm and its Variants

Statistical and computational guarantees of lloyd's algorithm and its variants , author=. arXiv preprint arXiv:1612.02099 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Journal of statistical software , volume=

mice: Multivariate imputation by chained equations in R , author=. Journal of statistical software , volume=

[2] [2]

Journal of statistical software , volume=

Amelia II: A program for missing data , author=. Journal of statistical software , volume=

[3] [3]

The American Statistician , volume=

k-pod: A method for k-means clustering of missing data , author=. The American Statistician , volume=. 2016 , publisher=

2016

[4] [4]

IEEE Access , volume=

K-means clustering with incomplete data , author=. IEEE Access , volume=. 2019 , publisher=

2019

[5] [5]

The annals of statistics , volume =

Strong consistency of k-means clustering , author=. The annals of statistics , volume =

[6] [6]

Pollard, David , journal =

[7] [7]

The Annals of Statistics , volume=

Trimmed k -means: an attempt to robustify quantizers , author=. The Annals of Statistics , volume=

[8] [8]

Scandinavian Journal of Statistics , volume=

Strong consistency of reduced k-means clustering , author=. Scandinavian Journal of Statistics , volume=

[9] [9]

Annals of the Institute of Statistical Mathematics , volume=

Strong consistency of factorial k-means clustering , author=. Annals of the Institute of Statistical Mathematics , volume=

[10] [10]

Electronic Journal of Statistics , pages =

Regularized k-means clustering of high-dimensional data and its asymptotic consistency , author=. Electronic Journal of Statistics , pages =. 2012 , volume =

2012

[11] [11]

Journal of Machine Learning Research , volume=

Regularized k-means through hard-thresholding , author=. Journal of Machine Learning Research , volume=

[12] [12]

Advances in Neural Information Processing Systems , volume=

Robust k-means: a theoretical revisit , author=. Advances in Neural Information Processing Systems , volume=

[13] [13]

Stat , volume=

On the strong consistency of feature-weighted k-means clustering in a nearmetric space , author=. Stat , volume=. 2019 , publisher=

2019

[14] [14]

The Annals of Statistics , pages=

Consistency of spectral clustering , author=. The Annals of Statistics , pages=

[15] [15]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Implicit annealing in kernel spaces: A strongly consistent clustering approach , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=

2022

[16] [16]

International Conference on Machine Learning , pages=

Consistency of multiple kernel clustering , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[17] [17]

The Annals of Statistics , volume=

Asymptotic theory of geometric and adaptive k-means clustering , author=. The Annals of Statistics , volume=

[18] [18]

Annals of statistics , pages=

A central limit theorem for multivariate generalized trimmed k-means , author=. Annals of statistics , pages=. 1999 , publisher=

1999

[19] [19]

International Journal Of General System , volume=

On asymptotic normality of a class of fuzzy c-means clustering procedures , author=. International Journal Of General System , volume=. 1994 , publisher=

1994

[20] [20]

Khoshgoftaar

Jason Van Hulse and Taghi M. Khoshgoftaar. Incomplete--case nearest neighbor imputation in software measurement data. Information Sciences. 2014. doi:https://doi.org/10.1016/j.ins.2010.12.017

work page doi:10.1016/j.ins.2010.12.017 2014

[21] [21]

Adaptive imputation of missing values for incomplete pattern classification

Zhunga Liu and Quan Pan and Jean Dezert and Arnaud Martin. Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognition. 2016

2016

[22] [22]

MissForest--non--parametric missing value imputation for mixed--type data

Stekhoven, Daniel J and B \"u hlmann, Peter. MissForest--non--parametric missing value imputation for mixed--type data. Bioinformatics. 2012

2012

[23] [23]

Gain: Missing data imputation using generative adversarial nets

Yoon, Jinsung and Jordon, James and Schaar, Mihaela. Gain: Missing data imputation using generative adversarial nets. International conference on machine learning. 2018

2018

[24] [24]

Suvra Jyoti Choudhury and Nikhil R. Pal. Imputation of missing data with neural networks for classification. Knowledge--Based Systems. 2019

2019

[25] [25]

Missing data imputation with adversarially--trained graph convolutional networks

Indro Spinelli and Simone Scardapane and Aurelio Uncini. Missing data imputation with adversarially--trained graph convolutional networks. Neural Networks. 2020

2020

[26] [26]

Mean shift clustering algorithm for data with missing values

AbdAllah, Loai and Shimshoni, Ilan. Mean shift clustering algorithm for data with missing values. International Conference on Data Warehousing and Knowledge Discovery. 2014

2014

[27] [27]

Machine Learning , volume=

Clustering with missing features: a penalized dissimilarity measure based approach , author=. Machine Learning , volume=. 2018 , publisher=

2018

[28] [28]

Mesquita and João P.P

Diego P.P. Mesquita and João P.P. Gomes and Amauri H. Souza Junior and Juvêncio S. Nobre. Euclidean distance estimation in incomplete datasets. Neurocomputing. 2017

2017

[29] [29]

Statistical Analysis and Data Mining: The ASA Data Science Journal , volume=

An efficient k-means-type algorithm for clustering datasets with incomplete records , author=. Statistical Analysis and Data Mining: The ASA Data Science Journal , volume=

[30] [30]

Journal of Classification , volume=

Imputation strategies for clustering mixed-type data with missing values , author=. Journal of Classification , volume=

[31] [31]

Advanced Machine Learning and Data Science , year=

Joint dimensionality reduction and clustering with missing data , author=. Advanced Machine Learning and Data Science , year=

[32] [32]

Van Der Vaart, Aad W and Wellner, Jon A , title =

[33] [33]

2018 , publisher=

Foundations of machine learning , author=. 2018 , publisher=

2018

[34] [34]

Journal of machine learning research , volume=

Rademacher and gaussian complexities: Risk bounds and structural results , author=. Journal of machine learning research , volume=

[35] [35]

Transactions on Machine Learning Research , year=

A note on the k -means clustering for missing data , author=. Transactions on Machine Learning Research , year=

[36] [36]

Statistical and Computational Guarantees of Lloyd's Algorithm and its Variants

Statistical and computational guarantees of lloyd's algorithm and its variants , author=. arXiv preprint arXiv:1612.02099 , year=

work page internal anchor Pith review Pith/arXiv arXiv