Causal K-Means Clustering

Edward H. Kennedy; Jisu Kim; Kwangho Kim

arxiv: 2405.03083 · v5 · submitted 2024-05-05 · 📊 stat.ME · cs.LG· stat.ML

Causal K-Means Clustering

Kwangho Kim , Jisu Kim , Edward H. Kennedy This is my paper

Pith reviewed 2026-05-24 00:59 UTC · model grok-4.3

classification 📊 stat.ME cs.LGstat.ML

keywords causal inferencek-means clusteringheterogeneous treatment effectsdouble machine learningsubgroup analysisnonparametric estimationbias correctioncounterfactual functions

0 comments

The pith

Causal k-Means Clustering identifies unknown subgroups with heterogeneous treatment effects by clustering on counterfactual functions, with a double machine learning estimator achieving root-n convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Causal k-Means Clustering to find hidden subgroups where treatment effects differ by applying k-means directly to estimated counterfactual functions rather than observed data. Population-level summaries often miss this heterogeneity, so recovering the subgroup structure allows more precise evaluation of how effects vary across units. A straightforward plug-in estimator is introduced and its convergence rate analyzed, followed by a bias-corrected version that uses nonparametric efficiency theory and double machine learning to reach root-n rates and asymptotic normality even in large nonparametric models. The approach handles multiple treatment levels in outcome-wide studies and extends to clustering other unknown pseudo-outcomes.

Core claim

Causal k-Means Clustering applies the k-means algorithm to unknown counterfactual functions to identify subgroup structure in treatment effects. A plug-in estimator is studied for its convergence rate, and a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning is shown to achieve root-n rates and asymptotic normality in large nonparametric models. The methods are useful for outcome-wide studies with multiple treatments and extensible to generic pseudo-outcomes.

What carries the argument

The bias-corrected estimator based on nonparametric efficiency theory and double machine learning applied to k-means clustering of counterfactual functions.

If this is right

Allows subgroup discovery in studies with multiple treatment levels and many outcomes.
Delivers valid asymptotic inference through root-n normality of the bias-corrected estimator.
Extends to clustering with partially observed outcomes or other unknown functions.
The plug-in version can be implemented with standard off-the-shelf clustering algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clustering machinery could be applied to other causal estimands such as conditional average treatment effects.
Discovered subgroups might be tested for improved prediction of individual responses in held-out data.
The method could support adaptive treatment rules that assign interventions differently across identified clusters.

Load-bearing premise

The counterfactual functions to be clustered can be estimated at rates sufficient for the k-means objective and the double machine learning correction to deliver root-n convergence.

What would settle it

A simulation study or real-data application in which the bias-corrected estimator fails to achieve root-n rates or asymptotic normality when the counterfactual functions converge at slower nonparametric rates.

Figures

Figures reproduced from arXiv: 2405.03083 by Edward H. Kennedy, Jisu Kim, Kwangho Kim.

**Figure 4.** Figure 4: (Left) Excess risk R(Cˆ) − R(C ⋆ ) across sample sizes. (Right) Codebook error (1/6) ∥Cˆ − C ⋆∥1 across sample sizes. 5.2 Case Study: PROPEL Chronic Low Back Pain Trial We illustrate the causal clustering approach using data from the Problem-Solving Pain to Enhance Living Well (PROPEL) clinical trial, which was designed to evaluate the effectiveness of mobile-supported self-management interventions for pa… view at source ↗

read the original abstract

Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: \emph{Causal k-Means Clustering}, which leverages the k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods using a study of mobile-supported self-management for chronic low back pain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Causal k-means gives a workable framing for clustering on counterfactual functions but the root-n DML rate claim sits on unverified nuisance conditions.

read the letter

The paper introduces causal k-means to group units by their unknown counterfactual outcome functions under different treatments. They give a plug-in estimator and a bias-corrected version built from nonparametric efficiency theory and double machine learning, claiming the latter reaches root-n rates and asymptotic normality in large models. The setup is extended to multiple treatment levels and to generic pseudo-outcomes, with simulations and an application to a mobile health study on back pain.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Causal K-Means Clustering to identify unknown subgroup structures in heterogeneous treatment effects by applying the k-means algorithm to estimated counterfactual functions. It introduces a simple plug-in estimator whose convergence rate is studied, along with a bias-corrected estimator derived from nonparametric efficiency theory and double machine learning (DML) that is claimed to attain root-n rates and asymptotic normality in large nonparametric models. The approach is positioned as useful for outcome-wide studies with multiple treatment levels, extensible to generic pseudo-outcomes, and is supported by simulations and an application to a study of mobile-supported self-management for chronic low back pain.

Significance. If the root-n and asymptotic normality claims are rigorously established, the work would provide a practical and extensible tool for discovering latent subgroups in causal settings where population-level summaries are insufficient. The combination of k-means with DML bias correction for counterfactuals represents a novel extension of existing clustering and efficiency techniques, with potential applicability to modern multi-outcome studies.

major comments (1)

[Abstract] Abstract: The claim that the bias-corrected estimator 'achieves fast root-n rates and asymptotic normality in large nonparametric models' does not state the required convergence rates (faster than n^{-1/4}) for the nuisance estimators of the counterfactual functions. Without these conditions, it is impossible to verify that the remainder in the von Mises expansion of the k-means objective is o_p(n^{-1/2}), especially given the non-smooth nature of the clustering map.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the need for greater precision in the abstract. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The claim that the bias-corrected estimator 'achieves fast root-n rates and asymptotic normality in large nonparametric models' does not state the required convergence rates (faster than n^{-1/4}) for the nuisance estimators of the counterfactual functions. Without these conditions, it is impossible to verify that the remainder in the von Mises expansion of the k-means objective is o_p(n^{-1/2}), especially given the non-smooth nature of the clustering map.

Authors: We agree that the abstract should explicitly state the required rates on the nuisance estimators. The full manuscript assumes standard DML conditions under which the nuisance estimators (for the counterfactual functions) converge faster than n^{-1/4}; these conditions ensure the remainder term in the von Mises expansion of the k-means objective is o_p(n^{-1/2}). We will revise the abstract to include this requirement. The non-smoothness of the clustering map is addressed in the theoretical development by bounding the effect of the clustering step on the expansion, which holds under the stated rates. revision: yes

Circularity Check

0 steps flagged

No circularity; central claims invoke external DML theory

full rationale

The paper develops a bias-corrected estimator by applying standard nonparametric efficiency theory and double machine learning to the causal k-means objective. These are established external results (not derived within the paper or via self-citation chains that reduce to the target claim). The abstract states the root-n convergence as a consequence of this application in large nonparametric models, without any equation or step that defines the rates in terms of the clustering outputs themselves or renames fitted quantities as predictions. No self-definitional, fitted-input, or uniqueness-imported patterns appear in the provided text, so the derivation chain remains independent of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard causal identification assumptions plus rate conditions from double machine learning theory; no new entities are introduced and the number of clusters is a user choice rather than a fitted parameter in the abstract.

free parameters (1)

number of clusters k
User-specified or selected via cross-validation; central to applying k-means to the estimated counterfactuals.

axioms (2)

domain assumption Standard causal assumptions (ignorability, positivity, consistency) for identifying counterfactual outcomes
Required to define and estimate the counterfactual functions that are clustered.
domain assumption Neyman orthogonality and nuisance estimator rate conditions from double machine learning theory
Invoked to obtain root-n rates and asymptotic normality for the bias-corrected estimator.

pith-pipeline@v0.9.0 · 5733 in / 1490 out tokens · 32635 ms · 2026-05-24T00:59:44.225824+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

R(C) = E∥μ−ΠC(μ)∥²₂ with margin condition on Voronoi boundaries

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

[1]

and Vassilvitskii, S

Arthur, D. and Vassilvitskii, S. (2007), k-means++: The advantages of careful seeding,in ‘Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms’, Society for Industrial and Applied Mathematics, pp. 1027–1035. Athey, S. and Imbens, G. (2016), ‘Recursive partitioning for heterogeneous causal effects’, Proceedings of the National Aca...

work page 2007
[2]

C., Taylor, J

Foster, J. C., Taylor, J. M. and Ruberg, S. J. (2011), ‘Subgroup identification from random- ized clinical trial data’,Statistics in medicine30(24), 2867–2880. Giné, E. and Nickl, R. (2021),Mathematical foundations of infinite-dimensional statistical models, Cambridge university press. 24 Graf, S. and Luschgy, H. (2007),Foundations of quantization for pro...

work page 2011
[3]

Newey, W. K. and Robins, J. R. (2018), ‘Cross-fitting and fast remainder rates for semipara- metric estimation’,arXiv preprint arXiv:1801.09138. Nie, X. and Wager, S. (2021), ‘Quasi-oracle estimation of heterogeneous treatment effects’, Biometrika108(2), 299–319. Pollard, D. (1981), ‘Strong consistency of k-means clustering’,The Annals of Statistics pp. 1...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Serafini, A., Murphy, T. B. and Scrucca, L. (2020), ‘Handling missing data in model-based clustering’,arXiv preprint arXiv:2006.02954. Shahn, Z. and Madigan, D. (2017), ‘Latent class mixture models of treatment effect hetero- geneity’,Bayesian Analysis12(3), 831–854. Shalit, U., Johansson, F. D. and Sontag, D. (2017), Estimating individual treatment effec...

work page arXiv 2020
[5]

and Kang, H

Suk, Y., Kim, J.-S. and Kang, H. (2021), ‘Hybridizing machine learning methods and finite mixture models for estimating heterogeneous treatment effects in latent classes’,Journal of Educational and Behavioral Statistics46(3), 323–347. Tseng, G. C. and Wong, W. H. (2005), ‘Tight clustering: a resampling-based approach for identifying stable and tight patte...

work page 2021
[6]

J., Li, S., Tsai, A

VanderWeele, T. J., Li, S., Tsai, A. C. and Kawachi, I. (2016), ‘Association between religious serviceattendanceandlowersuicideratesamonguswomen’,JAMA psychiatry73(8),845–

work page 2016
[7]

and Athey, S

Wager, S. and Athey, S. (2018), ‘Estimation and inference of heterogeneous treatment effects using random forests’,Journal of the American Statistical Association113(523), 1228–

work page 2018
[8]

D., Liu, L., Zhou, Z.-H

Zhang, W., Le, T. D., Liu, L., Zhou, Z.-H. and Li, J. (2017), ‘Mining heterogeneous causal effects for personalized cancer treatment’,Bioinformatics33(15), 2372–2378. Zhang, Z., Chen, Z., Troendle, J. F. and Zhang, J. (2012), ‘Causal inference on quantiles with an obstetric application’,Biometrics68(3), 697–706. Zheng, W. and Van Der Laan, M. J. (2010), ‘...

work page 2017
[9]

harm” or “nonresponder

(b) Relative gain in fit fromk−1tok, normalized by thek= 1WCSS, indicating that the improvement atk= 6is the last meaningful increase prior to diminishing returns. Both diagnostics jointly support selectingk= 6as an appropriate number of causal clusters. (a) (b) Cluster-level effect heterogeneity and covariate profiles.In this analysis, we se- lect ten ba...

work page 2020
[10]

Then the same argument holds, and again,max{|ˆf|,|f|}≤|ˆf−f|

Case 2.f >0and ˆf≤0. Then the same argument holds, and again,max{|ˆf|,|f|}≤|ˆf−f|. In particular,|f|≤|ˆf−f|, and sincef >0, the first indicator on the RHS evaluates to1. Hence, in either case, one of the two indicators on the right-hand side is1, ensuring the inequality holds. The following lemma establishes that the projection error arising from perturba...

work page 2020
[11]

The terms on the RHS of the last display can be bounded using techniques that will be developed in the proof of Theorem 3.2

implies the following bound for the first term in (A.4): sup C∈Ck |R(C)−Rn(C)|=OP   √ logn n   (A.5) For the second term in (A.4), we observe that ˆRn( ˆC)−Rn( ˆC) =P n{fˆC( ˆµ)}−Pn{fˆC(µ)} = (Pn−P) { f ˆC( ˆµ)−fˆC(µ) } +P { f ˆC( ˆµ)−fˆC(µ) } . The terms on the RHS of the last display can be bounded using techniques that will be developed in the proo...

work page 2008
[12]

Then we have sup Q N(ϵ∥Fn∥Q,2,Fn,L 2(Q))≲ (c1 ϵ )c2ν′ for some universal constantsc 1,c 2 >0

Let the VC index ofFn beν′<∞. Then we have sup Q N(ϵ∥Fn∥Q,2,Fn,L 2(Q))≲ (c1 ϵ )c2ν′ for some universal constantsc 1,c 2 >0. Hence applying Giné and Nickl (2021, Theorem 3.5.4), we obtain that P { sup f∈Fn |Gn(f)| } ≲∥Fn∥sup Q ∫ 1 0 √ 1 + logN(ϵ∥Fn∥Q,2,Fn,L 2(Q))dϵ ≲∥Fn∥ ∫ 1 0 √ 1 +ν′log(1/ϵ)dϵ. Taking the envelopeFn = sup C∈Ck |fC( ˆµ)−fC(µ)|which is boun...

work page 2021
[13]

Therefore, P{φC∗(Z; ˆη)−φC∗(Z;η)}≲   maxa ∥ˆµa−µa∥∥ˆπa−πa∥ + maxa ∥ˆµa−µa∥α+1 ∞ + 1 κmaxa ∥ˆµa−µa∥∞∥ˆµa−µa∥P,1  1(p)

By the same argument used to derive (A.9) and (A.10) in the 30 proof of Lemma A.7, this is bounded by maxa ∥ˆµa−µa∥α+1 ∞ + 1 κmaxa ∥ˆµa−µa∥∞∥ˆµa−µa∥P,1. Therefore, P{φC∗(Z; ˆη)−φC∗(Z;η)}≲   maxa ∥ˆµa−µa∥∥ˆπa−πa∥ + maxa ∥ˆµa−µa∥α+1 ∞ + 1 κmaxa ∥ˆµa−µa∥∞∥ˆµa−µa∥P,1  1(p). Finally, consider the first term in (A.19). The derivative ofC↦→P{φC(Z;η)}atC=C∗is...

work page 1982

[1] [1]

and Vassilvitskii, S

Arthur, D. and Vassilvitskii, S. (2007), k-means++: The advantages of careful seeding,in ‘Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms’, Society for Industrial and Applied Mathematics, pp. 1027–1035. Athey, S. and Imbens, G. (2016), ‘Recursive partitioning for heterogeneous causal effects’, Proceedings of the National Aca...

work page 2007

[2] [2]

C., Taylor, J

Foster, J. C., Taylor, J. M. and Ruberg, S. J. (2011), ‘Subgroup identification from random- ized clinical trial data’,Statistics in medicine30(24), 2867–2880. Giné, E. and Nickl, R. (2021),Mathematical foundations of infinite-dimensional statistical models, Cambridge university press. 24 Graf, S. and Luschgy, H. (2007),Foundations of quantization for pro...

work page 2011

[3] [3]

Newey, W. K. and Robins, J. R. (2018), ‘Cross-fitting and fast remainder rates for semipara- metric estimation’,arXiv preprint arXiv:1801.09138. Nie, X. and Wager, S. (2021), ‘Quasi-oracle estimation of heterogeneous treatment effects’, Biometrika108(2), 299–319. Pollard, D. (1981), ‘Strong consistency of k-means clustering’,The Annals of Statistics pp. 1...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Serafini, A., Murphy, T. B. and Scrucca, L. (2020), ‘Handling missing data in model-based clustering’,arXiv preprint arXiv:2006.02954. Shahn, Z. and Madigan, D. (2017), ‘Latent class mixture models of treatment effect hetero- geneity’,Bayesian Analysis12(3), 831–854. Shalit, U., Johansson, F. D. and Sontag, D. (2017), Estimating individual treatment effec...

work page arXiv 2020

[5] [5]

and Kang, H

Suk, Y., Kim, J.-S. and Kang, H. (2021), ‘Hybridizing machine learning methods and finite mixture models for estimating heterogeneous treatment effects in latent classes’,Journal of Educational and Behavioral Statistics46(3), 323–347. Tseng, G. C. and Wong, W. H. (2005), ‘Tight clustering: a resampling-based approach for identifying stable and tight patte...

work page 2021

[6] [6]

J., Li, S., Tsai, A

VanderWeele, T. J., Li, S., Tsai, A. C. and Kawachi, I. (2016), ‘Association between religious serviceattendanceandlowersuicideratesamonguswomen’,JAMA psychiatry73(8),845–

work page 2016

[7] [7]

and Athey, S

Wager, S. and Athey, S. (2018), ‘Estimation and inference of heterogeneous treatment effects using random forests’,Journal of the American Statistical Association113(523), 1228–

work page 2018

[8] [8]

D., Liu, L., Zhou, Z.-H

Zhang, W., Le, T. D., Liu, L., Zhou, Z.-H. and Li, J. (2017), ‘Mining heterogeneous causal effects for personalized cancer treatment’,Bioinformatics33(15), 2372–2378. Zhang, Z., Chen, Z., Troendle, J. F. and Zhang, J. (2012), ‘Causal inference on quantiles with an obstetric application’,Biometrics68(3), 697–706. Zheng, W. and Van Der Laan, M. J. (2010), ‘...

work page 2017

[9] [9]

harm” or “nonresponder

(b) Relative gain in fit fromk−1tok, normalized by thek= 1WCSS, indicating that the improvement atk= 6is the last meaningful increase prior to diminishing returns. Both diagnostics jointly support selectingk= 6as an appropriate number of causal clusters. (a) (b) Cluster-level effect heterogeneity and covariate profiles.In this analysis, we se- lect ten ba...

work page 2020

[10] [10]

Then the same argument holds, and again,max{|ˆf|,|f|}≤|ˆf−f|

Case 2.f >0and ˆf≤0. Then the same argument holds, and again,max{|ˆf|,|f|}≤|ˆf−f|. In particular,|f|≤|ˆf−f|, and sincef >0, the first indicator on the RHS evaluates to1. Hence, in either case, one of the two indicators on the right-hand side is1, ensuring the inequality holds. The following lemma establishes that the projection error arising from perturba...

work page 2020

[11] [11]

The terms on the RHS of the last display can be bounded using techniques that will be developed in the proof of Theorem 3.2

implies the following bound for the first term in (A.4): sup C∈Ck |R(C)−Rn(C)|=OP   √ logn n   (A.5) For the second term in (A.4), we observe that ˆRn( ˆC)−Rn( ˆC) =P n{fˆC( ˆµ)}−Pn{fˆC(µ)} = (Pn−P) { f ˆC( ˆµ)−fˆC(µ) } +P { f ˆC( ˆµ)−fˆC(µ) } . The terms on the RHS of the last display can be bounded using techniques that will be developed in the proo...

work page 2008

[12] [12]

Then we have sup Q N(ϵ∥Fn∥Q,2,Fn,L 2(Q))≲ (c1 ϵ )c2ν′ for some universal constantsc 1,c 2 >0

Let the VC index ofFn beν′<∞. Then we have sup Q N(ϵ∥Fn∥Q,2,Fn,L 2(Q))≲ (c1 ϵ )c2ν′ for some universal constantsc 1,c 2 >0. Hence applying Giné and Nickl (2021, Theorem 3.5.4), we obtain that P { sup f∈Fn |Gn(f)| } ≲∥Fn∥sup Q ∫ 1 0 √ 1 + logN(ϵ∥Fn∥Q,2,Fn,L 2(Q))dϵ ≲∥Fn∥ ∫ 1 0 √ 1 +ν′log(1/ϵ)dϵ. Taking the envelopeFn = sup C∈Ck |fC( ˆµ)−fC(µ)|which is boun...

work page 2021

[13] [13]

Therefore, P{φC∗(Z; ˆη)−φC∗(Z;η)}≲   maxa ∥ˆµa−µa∥∥ˆπa−πa∥ + maxa ∥ˆµa−µa∥α+1 ∞ + 1 κmaxa ∥ˆµa−µa∥∞∥ˆµa−µa∥P,1  1(p)

By the same argument used to derive (A.9) and (A.10) in the 30 proof of Lemma A.7, this is bounded by maxa ∥ˆµa−µa∥α+1 ∞ + 1 κmaxa ∥ˆµa−µa∥∞∥ˆµa−µa∥P,1. Therefore, P{φC∗(Z; ˆη)−φC∗(Z;η)}≲   maxa ∥ˆµa−µa∥∥ˆπa−πa∥ + maxa ∥ˆµa−µa∥α+1 ∞ + 1 κmaxa ∥ˆµa−µa∥∞∥ˆµa−µa∥P,1  1(p). Finally, consider the first term in (A.19). The derivative ofC↦→P{φC(Z;η)}atC=C∗is...

work page 1982