MNAR-$k$-means: A $k$-means Clustering for Data Missing Not at Random with Magnitude-Decaying Probability

Xin Guan

arxiv: 2606.31253 · v1 · pith:SVO66LOWnew · submitted 2026-06-30 · 📊 stat.ML · cs.LG

MNAR-k-means: A k-means Clustering for Data Missing Not at Random with Magnitude-Decaying Probability

Xin Guan This is my paper

Pith reviewed 2026-07-01 04:26 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords k-means clusteringmissing not at randomMNARstatistical consistencyimputation constraintmagnitude-decaying missingness

0 comments

The pith

A constrained k-means variant recovers true cluster centers from magnitude-decaying MNAR data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard k-means distorts cluster centers on data missing not at random because mean imputation ignores the dependence between missingness and value size. The paper introduces MNAR-k-means, which adds an explicit bound on the magnitude of imputed entries to the clustering objective. Under the assumption that missing probability falls as absolute value rises, the method proves that its estimated centers converge in probability to the centers that would be obtained from complete data. An alternating minimization procedure solves the resulting optimization. Experiments on simulated and real incomplete datasets show lower center bias and improved cluster assignments compared with naive observed-distance or mean-imputation baselines.

Core claim

Under a magnitude-decaying MNAR mechanism, adding a size constraint on the imputed values to the k-means objective produces cluster-center estimates that are statistically consistent for the centers of the fully observed data.

What carries the argument

A magnitude constraint on imputed values inside the k-means loss, solved by alternating minimization between cluster assignments and center updates.

If this is right

Estimated centers remain consistent for the complete-data centers as sample size grows.
Clustering assignments improve relative to mean-imputation k-means on the same MNAR data.
The alternating algorithm converges to a stationary point of the constrained objective.
Bias in center estimates decreases as the decay rate of missingness increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same size-constraint idea could be inserted into other distance-based procedures such as hierarchical clustering.
Real-world sensor or survey data often exhibit exactly this magnitude-decaying missing pattern and would therefore be a natural test bed.
Extending the consistency proof to the case of unknown decay rate would require an additional estimation step for that rate.

Load-bearing premise

Missingness probability decreases as the absolute value of each entry increases.

What would settle it

Generate fully observed data with known centers, impose missingness whose probability does not decay with magnitude, run the method, and check whether the recovered centers still converge to the true centers.

Figures

Figures reproduced from arXiv: 2606.31253 by Xin Guan.

**Figure 1.** Figure 1: The clustering results of data in R 2 with 30% missingness by different methods. From left to right: k-means on original fully observed data, k-POD and proposed method on missing data under an MNAR mechanism. Each point represents an original fully observed data point. Shape represents the true cluster label, and color represents the estimated cluster label. Three solid black points are estimated cluster c… view at source ↗

**Figure 2.** Figure 2: The results of CER and MSE of the proposed method under the MNAR mechanism satisfying Assumption [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The results of CER and MSE of different methods under different missing mechanisms and missing propor [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 5.** Figure 5: The CER values of the proposed method with different values of λ, and corresponding instability indexes. The vertical gray dotted line is λ = 1 selected by the instability criterion. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Left: The convergence of estimated cluster centers of proposed method to true cluster centers as sample size n increases. The red solid line is the proposed method with theoretically optimal λ. The blue dashed line is the result of k-POD. The black dotted line is the result of k-means on the original fully observed dataset. Right: The MSE values of proposed method with different values of λ, where λopt is … view at source ↗

**Figure 7.** Figure 7: The UMAP visualization of clustering results of different methods for the [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

The classical $k$-means clustering, based on distances computed from all data features, cannot be directly applied to incomplete data with missing values. A natural extension of $k$-means to missing data is to involve only the observed positions in clustering, which is equivalent to imputing missing values by corresponding cluster means. However, for data missing not at random (MNAR), since missingness is related to data values, such a mean-imputation-based method may lead to the distortion of estimated cluster centers, resulting in a poor clustering result. Since MNAR mechanisms are very common in reality, it is necessary to improve the performance of $k$-means-based clustering methods for such data. In this paper, we focus on a magnitude-decaying MNAR scenario where data is more likely to be missing at positions with smaller absolute values, and we propose a novel $k$-means clustering method based on the constraint of the size of imputation values, which enjoys a good mathematical interpretation. Moreover, we establish the statistical consistency of the estimated cluster centers of the proposed method to the true cluster centers of fully observed data, and solve the optimization of the proposed loss function via an alternative minimization algorithm. Simulation experiments verify the effect of the proposed method in improving clustering results and reducing the bias of estimated cluster centers. Applications to real-world missing data further show the utility of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds an imputation-size constraint to k-means under magnitude-decaying MNAR and claims consistency to the true centers, but the alternating-minimization solver lacks any convergence argument that would make the claim apply to the actual procedure.

read the letter

The main thing here is a k-means extension that constrains the size of imputed values specifically for the case where missingness probability falls as absolute value grows. The authors tie that constraint directly to the MNAR mechanism, give it a mathematical reading, and state a consistency result for the estimated centers relative to the fully observed population centers.

The constraint itself is the clearest new piece; it turns the usual mean-imputation step into something that respects the assumed missingness pattern rather than ignoring it. The alternating-minimization algorithm is simple and the simulation section apparently shows reduced bias in the recovered centers, which is the practical payoff they emphasize.

The soft spot is the gap between the consistency theorem and the implemented estimator. The theorem is stated for the population minimizer of the constrained loss, yet the paper solves the finite-sample problem with alternating minimization and supplies no convergence guarantee, uniqueness result, or initialization condition. Without that link, it is possible for the algorithm to stop at a local solution whose centers remain biased even under the magnitude-decaying assumption. The experiments do not appear to test this possibility directly.

The work is aimed at applied statisticians and ML users who already model MNAR mechanisms and need a clustering tool that incorporates one specific pattern. It is narrow enough that most readers outside that niche will not need it, but the combination of the constraint and the consistency statement is coherent enough to merit a referee's time. I would send it for review so the derivations and the algorithm's behavior can be examined in detail.

Referee Report

2 major / 1 minor

Summary. The paper proposes MNAR-k-means, a constrained variant of k-means for magnitude-decaying MNAR data that imputes missing entries subject to a size constraint on the imputed values. It claims to prove statistical consistency of the resulting cluster-center estimates to the centers that would be obtained from the fully observed data, solves the finite-sample objective via alternating minimization, and reports improved performance in simulations and real-data examples relative to standard mean-imputation k-means.

Significance. If the consistency result holds for the estimator that is actually computed, the method would supply a theoretically supported clustering procedure for a practically relevant MNAR regime and could reduce bias relative to naive imputation approaches.

major comments (2)

[Abstract / theoretical analysis] Abstract and theoretical-results section: the consistency statement is phrased for 'the estimated cluster centers of the proposed method,' yet the only optimization procedure described is alternating minimization. No convergence guarantee to a global minimizer (or even to a stationary point whose limit is consistent) is indicated; therefore the headline consistency claim does not automatically transfer to the implemented estimator.
[Theoretical analysis] Consistency theorem (presumably the main result): if the theorem is stated only for the population or sample minimizer of the constrained loss, an additional argument is required to show that the alternating-minimization sequence inherits the same limit; without it the central claim is not fully supported by the stated theory.

minor comments (1)

[Model section] The precise functional form of the magnitude-decaying missingness probability (e.g., the decay rate or the threshold) should be stated explicitly in the model definition rather than left at the level of the abstract description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the distinction between the theoretical minimizer and the output of the alternating-minimization procedure. We address each major comment below and will revise the manuscript accordingly to make the scope of the consistency result precise.

read point-by-point responses

Referee: [Abstract / theoretical analysis] Abstract and theoretical-results section: the consistency statement is phrased for 'the estimated cluster centers of the proposed method,' yet the only optimization procedure described is alternating minimization. No convergence guarantee to a global minimizer (or even to a stationary point whose limit is consistent) is indicated; therefore the headline consistency claim does not automatically transfer to the implemented estimator.

Authors: We agree that the consistency theorem applies to the global minimizer of the constrained objective rather than directly to the output of any particular solver. The manuscript presents alternating minimization solely as a computational method for the finite-sample problem. In the revision we will update the abstract and the theoretical-results section to state explicitly that consistency holds for the minimizer of the proposed loss; we will also add a brief discussion noting that alternating minimization is used in practice and that its empirical behavior is examined in the simulation studies. revision: yes
Referee: [Theoretical analysis] Consistency theorem (presumably the main result): if the theorem is stated only for the population or sample minimizer of the constrained loss, an additional argument is required to show that the alternating-minimization sequence inherits the same limit; without it the central claim is not fully supported by the stated theory.

Authors: The observation is correct: an additional argument linking the iterates of alternating minimization to the consistent minimizer would be needed to transfer the limit result. Because establishing such a guarantee for the non-convex alternating procedure lies outside the present scope, we will revise the manuscript to restrict the consistency claim to the exact minimizer of the constrained loss and to clarify the role of the algorithm as a practical approximation whose performance is supported by the reported experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: consistency targets external true centers of fully observed data

full rationale

The paper defines a constrained k-means loss for magnitude-decaying MNAR data and claims statistical consistency of its estimated centers to the true centers of the fully observed data. This target is external and independent of the method's fitted values. The alternating minimization is presented only as a computational solver for the loss; no equation or self-citation reduces the consistency statement to a tautology, a fitted parameter renamed as prediction, or a self-referential definition. The derivation chain therefore remains self-contained against the stated external benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of magnitude-decaying MNAR and on an unspecified constraint parameter that controls imputation size; no free parameters or invented entities are explicitly named in the abstract.

free parameters (1)

imputation size constraint threshold
The method is built around a constraint on the size of imputation values whose exact functional form and any tuning parameter are not detailed in the abstract.

axioms (1)

domain assumption Missingness occurs with probability that decays with absolute magnitude (magnitude-decaying MNAR)
Invoked as the scenario the method targets and under which consistency is claimed.

pith-pipeline@v0.9.1-grok · 5779 in / 1251 out tokens · 53398 ms · 2026-07-01T04:26:37.020193+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages

[1]

doi: https://doi.org/10.1016/j.knosys.2019.07.009

ISSN 0950-7051. doi: https://doi.org/10.1016/j.knosys.2019.07.009. S. Datta, S. Bhattacharjee, and S. Das. Clustering with missing features: a penalized dissimilarity measure based approach.Machine Learning, 107:1987–2025,

work page doi:10.1016/j.knosys.2019.07.009 2019
[2]

doi: https://doi.org/10.1016/j.patcog.2015.10.001

ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2015.10.001. Z. Ghahramani and M. I. Jordan. Learning from incomplete data. Technical Report AIM-1509CBCL-108, Mas- sachusetts Institute of Technology,

work page doi:10.1016/j.patcog.2015.10.001 2015
[3]

doi: https://doi.org/10.1016/j.ins.2010.12.017

ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2010.12.017. J. Jin and W. Wang. Influential features pca for high dimensional clustering.The Annals of Statistics, 44(6):2323 – 2359,

work page doi:10.1016/j.ins.2010.12.017 2010
[4]

J. Li, S. Song, Y . Zhang, and Z. Zhou. Robust k-median and k-means clustering algorithms for incomplete data. Mathematical Problems in Engineering, 2016(1):4321928,

2016
[5]

doi: https://doi.org/10.1016/j.neucom.2016.12.081

ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2016.12.081. Neural Networks : Learning Algorithms and Classification Systems. M. Mohri, A. Rostamizadeh, and A. Talwalkar.Foundations of machine learning. MIT press,

work page doi:10.1016/j.neucom.2016.12.081 2016
[6]

doi: https://doi.org/10.1016/j.neunet.2020.06

ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2020.06

work page doi:10.1016/j.neunet.2020.06 2020
[7]

Proof.Consider a function class onX × {0,1} p as follows: Gk = gk,M : (x,r)7→min l=1,...,k pX j=1 rj(xj −µ lj)2 +λ(1−r j)µ2 lj ∀M∈ M,∥µ l∥2 ≤B, l= 1, . . . , k . For a size-nrandom sample of missing data(X,R) ={(x 1,r 1), . . . ,(xn,r n)}, we can write Lλ(M) =E x1,r1 [gk,M (x1,r 1)]and bLλ n(M) = 1 n nX i=1 gk,M (xi,r i). First, we prove the expectationE ...

2002
[8]

24√n Z ∥G∥L2 (Pn ) 0 p log(N(τ,G,L 2(Pn)))dτ # =E X,R

Then for anyg µ ∈ G, we have|g µ(x,r)| ≤G(x,r)holds for any(x,r)∈ X × {0,1} p. Under Assumption 2 (i.e., x1 is sub-Gaussian), we haveE x1[∥x1∥4 2]<∞, it follows thatE x1,r1[(G(x1,r 1))2]<∞. We write σ2 G =E x1,r1[(G(x1,r 1))2]. Then, according to Equations (3.8)-(3.13) of Mohri et al. [2018], we have EX,R sup M∈M Lλ(M)− bLλ n(M) ≤2R n(Gk)≤2kR n(G), Thus, ...

2018
[9]

sup M∈M 1 n nX i=1 g− k,M (xi,r i) 2 # ≤E X,R

i − 1 n nX i=1 g+ k,M (xi,r i) > ˜ϵ 2 ! :=(I)+(II). 20 Thus, it suffices to prove the two terms converge to zero asn→ ∞. For (I), because ∥g− k,M ∥∞ = sup x,r |g− k,M (x,r)| ≤sup x,r G(x,r)·1(G(x,r)≤ √n) ≤ √n and EX,R " sup M∈M 1 n nX i=1 g− k,M (xi,r i) 2 # ≤E X,R " 1 n nX i=1 (G(xi,r i))2 ·1(G(x i,r i)≤ √n) # ≤E x1,r1 h (G(x1,r 1))2 i =σ 2 G, then accor...

2019
[10]

, g1p, g21,

Because anyg µ ∈ Gcan be rewritten as gµ(x,r) = pX j=1 rjx2 j + pX j=1 (−2µj)·(r jxj) + pX j=1 (1−λ)µ 2 j ·r j + pX j=1 λµ2 j = pX j=1 g1j(x,r) + pX j=1 (−2µj)·g 2j(x,r) + pX j=1 (1−λ)µ 2 j ·g 3j(x,r) + pX j=1 λµ2 j ·g 4(x,r), then it is a linear combination of{g 11, . . . , g1p, g21, . . . , g2p, g31 . . . , g3p, g4}, i.e., G ⊂span g11, . . . , g1p, g21,...

2019
[11]

is satisfied. Consequently, we can apply the Bernstein inequality (Proposition 2.10 of Wainwright [2019]) to∆ ll′ ij as follows: Pr ∆ll′ ij −E xi h ∆ll′ ij rij = 1,z i =l i ≤ϵ rij = 1,z i =l ≥1−2 exp − ϵ2 2Var(xij |r ij = 1,z i =l) + 2b ljϵ ,∀ϵ >0, 25 which implies Pr ∆ll′ ij ≥E xi h ∆ll′ ij rij = 1,z i =l i −ϵ rij = 1,z i =l ≥1−2 exp − ϵ2 2Var(xij |r ij ...

2019
[12]

, n,z ∗∗ i = zi)→ 1under the conditionσ 2 log(n)→0

Because ∥xi −µ ∗∗ l′ ∥2 2 − ∥xi −µ ∗∗ l ∥2 2 = pX j=1 2(µ∗∗ lj −µ ∗∗ l′j)xij + (µ∗∗ l′j)2 −(µ ∗∗ lj )2, and givenz i =l,x ij ∼ N(µ ∗∗ lj , σ2), then by using the same technique, we can obtain Pr(∀i= 1, . . . , n,z ∗∗ i = zi)→ 1under the conditionσ 2 log(n)→0. Combining the consistency results ofˆ z λ i andz ∗∗ i leads to Pr ∀i= 1, . . . , n,ˆ zλ i = z∗∗ i...

1981

[1] [1]

doi: https://doi.org/10.1016/j.knosys.2019.07.009

ISSN 0950-7051. doi: https://doi.org/10.1016/j.knosys.2019.07.009. S. Datta, S. Bhattacharjee, and S. Das. Clustering with missing features: a penalized dissimilarity measure based approach.Machine Learning, 107:1987–2025,

work page doi:10.1016/j.knosys.2019.07.009 2019

[2] [2]

doi: https://doi.org/10.1016/j.patcog.2015.10.001

ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2015.10.001. Z. Ghahramani and M. I. Jordan. Learning from incomplete data. Technical Report AIM-1509CBCL-108, Mas- sachusetts Institute of Technology,

work page doi:10.1016/j.patcog.2015.10.001 2015

[3] [3]

doi: https://doi.org/10.1016/j.ins.2010.12.017

ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2010.12.017. J. Jin and W. Wang. Influential features pca for high dimensional clustering.The Annals of Statistics, 44(6):2323 – 2359,

work page doi:10.1016/j.ins.2010.12.017 2010

[4] [4]

J. Li, S. Song, Y . Zhang, and Z. Zhou. Robust k-median and k-means clustering algorithms for incomplete data. Mathematical Problems in Engineering, 2016(1):4321928,

2016

[5] [5]

doi: https://doi.org/10.1016/j.neucom.2016.12.081

ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2016.12.081. Neural Networks : Learning Algorithms and Classification Systems. M. Mohri, A. Rostamizadeh, and A. Talwalkar.Foundations of machine learning. MIT press,

work page doi:10.1016/j.neucom.2016.12.081 2016

[6] [6]

doi: https://doi.org/10.1016/j.neunet.2020.06

ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2020.06

work page doi:10.1016/j.neunet.2020.06 2020

[7] [7]

Proof.Consider a function class onX × {0,1} p as follows: Gk = gk,M : (x,r)7→min l=1,...,k pX j=1 rj(xj −µ lj)2 +λ(1−r j)µ2 lj ∀M∈ M,∥µ l∥2 ≤B, l= 1, . . . , k . For a size-nrandom sample of missing data(X,R) ={(x 1,r 1), . . . ,(xn,r n)}, we can write Lλ(M) =E x1,r1 [gk,M (x1,r 1)]and bLλ n(M) = 1 n nX i=1 gk,M (xi,r i). First, we prove the expectationE ...

2002

[8] [8]

24√n Z ∥G∥L2 (Pn ) 0 p log(N(τ,G,L 2(Pn)))dτ # =E X,R

Then for anyg µ ∈ G, we have|g µ(x,r)| ≤G(x,r)holds for any(x,r)∈ X × {0,1} p. Under Assumption 2 (i.e., x1 is sub-Gaussian), we haveE x1[∥x1∥4 2]<∞, it follows thatE x1,r1[(G(x1,r 1))2]<∞. We write σ2 G =E x1,r1[(G(x1,r 1))2]. Then, according to Equations (3.8)-(3.13) of Mohri et al. [2018], we have EX,R sup M∈M Lλ(M)− bLλ n(M) ≤2R n(Gk)≤2kR n(G), Thus, ...

2018

[9] [9]

sup M∈M 1 n nX i=1 g− k,M (xi,r i) 2 # ≤E X,R

i − 1 n nX i=1 g+ k,M (xi,r i) > ˜ϵ 2 ! :=(I)+(II). 20 Thus, it suffices to prove the two terms converge to zero asn→ ∞. For (I), because ∥g− k,M ∥∞ = sup x,r |g− k,M (x,r)| ≤sup x,r G(x,r)·1(G(x,r)≤ √n) ≤ √n and EX,R " sup M∈M 1 n nX i=1 g− k,M (xi,r i) 2 # ≤E X,R " 1 n nX i=1 (G(xi,r i))2 ·1(G(x i,r i)≤ √n) # ≤E x1,r1 h (G(x1,r 1))2 i =σ 2 G, then accor...

2019

[10] [10]

, g1p, g21,

Because anyg µ ∈ Gcan be rewritten as gµ(x,r) = pX j=1 rjx2 j + pX j=1 (−2µj)·(r jxj) + pX j=1 (1−λ)µ 2 j ·r j + pX j=1 λµ2 j = pX j=1 g1j(x,r) + pX j=1 (−2µj)·g 2j(x,r) + pX j=1 (1−λ)µ 2 j ·g 3j(x,r) + pX j=1 λµ2 j ·g 4(x,r), then it is a linear combination of{g 11, . . . , g1p, g21, . . . , g2p, g31 . . . , g3p, g4}, i.e., G ⊂span g11, . . . , g1p, g21,...

2019

[11] [11]

is satisfied. Consequently, we can apply the Bernstein inequality (Proposition 2.10 of Wainwright [2019]) to∆ ll′ ij as follows: Pr ∆ll′ ij −E xi h ∆ll′ ij rij = 1,z i =l i ≤ϵ rij = 1,z i =l ≥1−2 exp − ϵ2 2Var(xij |r ij = 1,z i =l) + 2b ljϵ ,∀ϵ >0, 25 which implies Pr ∆ll′ ij ≥E xi h ∆ll′ ij rij = 1,z i =l i −ϵ rij = 1,z i =l ≥1−2 exp − ϵ2 2Var(xij |r ij ...

2019

[12] [12]

, n,z ∗∗ i = zi)→ 1under the conditionσ 2 log(n)→0

Because ∥xi −µ ∗∗ l′ ∥2 2 − ∥xi −µ ∗∗ l ∥2 2 = pX j=1 2(µ∗∗ lj −µ ∗∗ l′j)xij + (µ∗∗ l′j)2 −(µ ∗∗ lj )2, and givenz i =l,x ij ∼ N(µ ∗∗ lj , σ2), then by using the same technique, we can obtain Pr(∀i= 1, . . . , n,z ∗∗ i = zi)→ 1under the conditionσ 2 log(n)→0. Combining the consistency results ofˆ z λ i andz ∗∗ i leads to Pr ∀i= 1, . . . , n,ˆ zλ i = z∗∗ i...

1981