Causal K-Means Clustering
Pith reviewed 2026-05-24 00:59 UTC · model grok-4.3
The pith
Causal k-Means Clustering identifies unknown subgroups with heterogeneous treatment effects by clustering on counterfactual functions, with a double machine learning estimator achieving root-n convergence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Causal k-Means Clustering applies the k-means algorithm to unknown counterfactual functions to identify subgroup structure in treatment effects. A plug-in estimator is studied for its convergence rate, and a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning is shown to achieve root-n rates and asymptotic normality in large nonparametric models. The methods are useful for outcome-wide studies with multiple treatments and extensible to generic pseudo-outcomes.
What carries the argument
The bias-corrected estimator based on nonparametric efficiency theory and double machine learning applied to k-means clustering of counterfactual functions.
If this is right
- Allows subgroup discovery in studies with multiple treatment levels and many outcomes.
- Delivers valid asymptotic inference through root-n normality of the bias-corrected estimator.
- Extends to clustering with partially observed outcomes or other unknown functions.
- The plug-in version can be implemented with standard off-the-shelf clustering algorithms.
Where Pith is reading between the lines
- The same clustering machinery could be applied to other causal estimands such as conditional average treatment effects.
- Discovered subgroups might be tested for improved prediction of individual responses in held-out data.
- The method could support adaptive treatment rules that assign interventions differently across identified clusters.
Load-bearing premise
The counterfactual functions to be clustered can be estimated at rates sufficient for the k-means objective and the double machine learning correction to deliver root-n convergence.
What would settle it
A simulation study or real-data application in which the bias-corrected estimator fails to achieve root-n rates or asymptotic normality when the counterfactual functions converge at slower nonparametric rates.
Figures
read the original abstract
Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: \emph{Causal k-Means Clustering}, which leverages the k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods using a study of mobile-supported self-management for chronic low back pain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Causal K-Means Clustering to identify unknown subgroup structures in heterogeneous treatment effects by applying the k-means algorithm to estimated counterfactual functions. It introduces a simple plug-in estimator whose convergence rate is studied, along with a bias-corrected estimator derived from nonparametric efficiency theory and double machine learning (DML) that is claimed to attain root-n rates and asymptotic normality in large nonparametric models. The approach is positioned as useful for outcome-wide studies with multiple treatment levels, extensible to generic pseudo-outcomes, and is supported by simulations and an application to a study of mobile-supported self-management for chronic low back pain.
Significance. If the root-n and asymptotic normality claims are rigorously established, the work would provide a practical and extensible tool for discovering latent subgroups in causal settings where population-level summaries are insufficient. The combination of k-means with DML bias correction for counterfactuals represents a novel extension of existing clustering and efficiency techniques, with potential applicability to modern multi-outcome studies.
major comments (1)
- [Abstract] Abstract: The claim that the bias-corrected estimator 'achieves fast root-n rates and asymptotic normality in large nonparametric models' does not state the required convergence rates (faster than n^{-1/4}) for the nuisance estimators of the counterfactual functions. Without these conditions, it is impossible to verify that the remainder in the von Mises expansion of the k-means objective is o_p(n^{-1/2}), especially given the non-smooth nature of the clustering map.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for highlighting the need for greater precision in the abstract. We address the major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: The claim that the bias-corrected estimator 'achieves fast root-n rates and asymptotic normality in large nonparametric models' does not state the required convergence rates (faster than n^{-1/4}) for the nuisance estimators of the counterfactual functions. Without these conditions, it is impossible to verify that the remainder in the von Mises expansion of the k-means objective is o_p(n^{-1/2}), especially given the non-smooth nature of the clustering map.
Authors: We agree that the abstract should explicitly state the required rates on the nuisance estimators. The full manuscript assumes standard DML conditions under which the nuisance estimators (for the counterfactual functions) converge faster than n^{-1/4}; these conditions ensure the remainder term in the von Mises expansion of the k-means objective is o_p(n^{-1/2}). We will revise the abstract to include this requirement. The non-smoothness of the clustering map is addressed in the theoretical development by bounding the effect of the clustering step on the expansion, which holds under the stated rates. revision: yes
Circularity Check
No circularity; central claims invoke external DML theory
full rationale
The paper develops a bias-corrected estimator by applying standard nonparametric efficiency theory and double machine learning to the causal k-means objective. These are established external results (not derived within the paper or via self-citation chains that reduce to the target claim). The abstract states the root-n convergence as a consequence of this application in large nonparametric models, without any equation or step that defines the rates in terms of the clustering outputs themselves or renames fitted quantities as predictions. No self-definitional, fitted-input, or uniqueness-imported patterns appear in the provided text, so the derivation chain remains independent of its inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of clusters k
axioms (2)
- domain assumption Standard causal assumptions (ignorability, positivity, consistency) for identifying counterfactual outcomes
- domain assumption Neyman orthogonality and nuisance estimator rate conditions from double machine learning theory
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
R(C) = E∥μ−ΠC(μ)∥²₂ with margin condition on Voronoi boundaries
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Arthur, D. and Vassilvitskii, S. (2007), k-means++: The advantages of careful seeding,in ‘Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms’, Society for Industrial and Applied Mathematics, pp. 1027–1035. Athey, S. and Imbens, G. (2016), ‘Recursive partitioning for heterogeneous causal effects’, Proceedings of the National Aca...
work page 2007
-
[2]
Foster, J. C., Taylor, J. M. and Ruberg, S. J. (2011), ‘Subgroup identification from random- ized clinical trial data’,Statistics in medicine30(24), 2867–2880. Giné, E. and Nickl, R. (2021),Mathematical foundations of infinite-dimensional statistical models, Cambridge university press. 24 Graf, S. and Luschgy, H. (2007),Foundations of quantization for pro...
work page 2011
-
[3]
Newey, W. K. and Robins, J. R. (2018), ‘Cross-fitting and fast remainder rates for semipara- metric estimation’,arXiv preprint arXiv:1801.09138. Nie, X. and Wager, S. (2021), ‘Quasi-oracle estimation of heterogeneous treatment effects’, Biometrika108(2), 299–319. Pollard, D. (1981), ‘Strong consistency of k-means clustering’,The Annals of Statistics pp. 1...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Serafini, A., Murphy, T. B. and Scrucca, L. (2020), ‘Handling missing data in model-based clustering’,arXiv preprint arXiv:2006.02954. Shahn, Z. and Madigan, D. (2017), ‘Latent class mixture models of treatment effect hetero- geneity’,Bayesian Analysis12(3), 831–854. Shalit, U., Johansson, F. D. and Sontag, D. (2017), Estimating individual treatment effec...
-
[5]
Suk, Y., Kim, J.-S. and Kang, H. (2021), ‘Hybridizing machine learning methods and finite mixture models for estimating heterogeneous treatment effects in latent classes’,Journal of Educational and Behavioral Statistics46(3), 323–347. Tseng, G. C. and Wong, W. H. (2005), ‘Tight clustering: a resampling-based approach for identifying stable and tight patte...
work page 2021
-
[6]
VanderWeele, T. J., Li, S., Tsai, A. C. and Kawachi, I. (2016), ‘Association between religious serviceattendanceandlowersuicideratesamonguswomen’,JAMA psychiatry73(8),845–
work page 2016
-
[7]
Wager, S. and Athey, S. (2018), ‘Estimation and inference of heterogeneous treatment effects using random forests’,Journal of the American Statistical Association113(523), 1228–
work page 2018
-
[8]
Zhang, W., Le, T. D., Liu, L., Zhou, Z.-H. and Li, J. (2017), ‘Mining heterogeneous causal effects for personalized cancer treatment’,Bioinformatics33(15), 2372–2378. Zhang, Z., Chen, Z., Troendle, J. F. and Zhang, J. (2012), ‘Causal inference on quantiles with an obstetric application’,Biometrics68(3), 697–706. Zheng, W. and Van Der Laan, M. J. (2010), ‘...
work page 2017
-
[9]
(b) Relative gain in fit fromk−1tok, normalized by thek= 1WCSS, indicating that the improvement atk= 6is the last meaningful increase prior to diminishing returns. Both diagnostics jointly support selectingk= 6as an appropriate number of causal clusters. (a) (b) Cluster-level effect heterogeneity and covariate profiles.In this analysis, we se- lect ten ba...
work page 2020
-
[10]
Then the same argument holds, and again,max{|ˆf|,|f|}≤|ˆf−f|
Case 2.f >0and ˆf≤0. Then the same argument holds, and again,max{|ˆf|,|f|}≤|ˆf−f|. In particular,|f|≤|ˆf−f|, and sincef >0, the first indicator on the RHS evaluates to1. Hence, in either case, one of the two indicators on the right-hand side is1, ensuring the inequality holds. The following lemma establishes that the projection error arising from perturba...
work page 2020
-
[11]
implies the following bound for the first term in (A.4): sup C∈Ck |R(C)−Rn(C)|=OP √ logn n (A.5) For the second term in (A.4), we observe that ˆRn( ˆC)−Rn( ˆC) =P n{fˆC( ˆµ)}−Pn{fˆC(µ)} = (Pn−P) { f ˆC( ˆµ)−fˆC(µ) } +P { f ˆC( ˆµ)−fˆC(µ) } . The terms on the RHS of the last display can be bounded using techniques that will be developed in the proo...
work page 2008
-
[12]
Then we have sup Q N(ϵ∥Fn∥Q,2,Fn,L 2(Q))≲ (c1 ϵ )c2ν′ for some universal constantsc 1,c 2 >0
Let the VC index ofFn beν′<∞. Then we have sup Q N(ϵ∥Fn∥Q,2,Fn,L 2(Q))≲ (c1 ϵ )c2ν′ for some universal constantsc 1,c 2 >0. Hence applying Giné and Nickl (2021, Theorem 3.5.4), we obtain that P { sup f∈Fn |Gn(f)| } ≲∥Fn∥sup Q ∫ 1 0 √ 1 + logN(ϵ∥Fn∥Q,2,Fn,L 2(Q))dϵ ≲∥Fn∥ ∫ 1 0 √ 1 +ν′log(1/ϵ)dϵ. Taking the envelopeFn = sup C∈Ck |fC( ˆµ)−fC(µ)|which is boun...
work page 2021
-
[13]
By the same argument used to derive (A.9) and (A.10) in the 30 proof of Lemma A.7, this is bounded by maxa ∥ˆµa−µa∥α+1 ∞ + 1 κmaxa ∥ˆµa−µa∥∞∥ˆµa−µa∥P,1. Therefore, P{φC∗(Z; ˆη)−φC∗(Z;η)}≲ maxa ∥ˆµa−µa∥∥ˆπa−πa∥ + maxa ∥ˆµa−µa∥α+1 ∞ + 1 κmaxa ∥ˆµa−µa∥∞∥ˆµa−µa∥P,1 1(p). Finally, consider the first term in (A.19). The derivative ofC↦→P{φC(Z;η)}atC=C∗is...
work page 1982
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.