pith. sign in

arxiv: 2405.03083 · v5 · submitted 2024-05-05 · 📊 stat.ME · cs.LG· stat.ML

Causal K-Means Clustering

Pith reviewed 2026-05-24 00:59 UTC · model grok-4.3

classification 📊 stat.ME cs.LGstat.ML
keywords causal inferencek-means clusteringheterogeneous treatment effectsdouble machine learningsubgroup analysisnonparametric estimationbias correctioncounterfactual functions
0
0 comments X

The pith

Causal k-Means Clustering identifies unknown subgroups with heterogeneous treatment effects by clustering on counterfactual functions, with a double machine learning estimator achieving root-n convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Causal k-Means Clustering to find hidden subgroups where treatment effects differ by applying k-means directly to estimated counterfactual functions rather than observed data. Population-level summaries often miss this heterogeneity, so recovering the subgroup structure allows more precise evaluation of how effects vary across units. A straightforward plug-in estimator is introduced and its convergence rate analyzed, followed by a bias-corrected version that uses nonparametric efficiency theory and double machine learning to reach root-n rates and asymptotic normality even in large nonparametric models. The approach handles multiple treatment levels in outcome-wide studies and extends to clustering other unknown pseudo-outcomes.

Core claim

Causal k-Means Clustering applies the k-means algorithm to unknown counterfactual functions to identify subgroup structure in treatment effects. A plug-in estimator is studied for its convergence rate, and a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning is shown to achieve root-n rates and asymptotic normality in large nonparametric models. The methods are useful for outcome-wide studies with multiple treatments and extensible to generic pseudo-outcomes.

What carries the argument

The bias-corrected estimator based on nonparametric efficiency theory and double machine learning applied to k-means clustering of counterfactual functions.

If this is right

  • Allows subgroup discovery in studies with multiple treatment levels and many outcomes.
  • Delivers valid asymptotic inference through root-n normality of the bias-corrected estimator.
  • Extends to clustering with partially observed outcomes or other unknown functions.
  • The plug-in version can be implemented with standard off-the-shelf clustering algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clustering machinery could be applied to other causal estimands such as conditional average treatment effects.
  • Discovered subgroups might be tested for improved prediction of individual responses in held-out data.
  • The method could support adaptive treatment rules that assign interventions differently across identified clusters.

Load-bearing premise

The counterfactual functions to be clustered can be estimated at rates sufficient for the k-means objective and the double machine learning correction to deliver root-n convergence.

What would settle it

A simulation study or real-data application in which the bias-corrected estimator fails to achieve root-n rates or asymptotic normality when the counterfactual functions converge at slower nonparametric rates.

Figures

Figures reproduced from arXiv: 2405.03083 by Edward H. Kennedy, Jisu Kim, Kwangho Kim.

Figure 4
Figure 4. Figure 4: (Left) Excess risk R(Cˆ) − R(C ⋆ ) across sample sizes. (Right) Codebook error (1/6) ∥Cˆ − C ⋆∥1 across sample sizes. 5.2 Case Study: PROPEL Chronic Low Back Pain Trial We illustrate the causal clustering approach using data from the Problem-Solving Pain to En￾hance Living Well (PROPEL) clinical trial, which was designed to evaluate the effectiveness of mobile-supported self-management interventions for pa… view at source ↗
read the original abstract

Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: \emph{Causal k-Means Clustering}, which leverages the k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods using a study of mobile-supported self-management for chronic low back pain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Causal K-Means Clustering to identify unknown subgroup structures in heterogeneous treatment effects by applying the k-means algorithm to estimated counterfactual functions. It introduces a simple plug-in estimator whose convergence rate is studied, along with a bias-corrected estimator derived from nonparametric efficiency theory and double machine learning (DML) that is claimed to attain root-n rates and asymptotic normality in large nonparametric models. The approach is positioned as useful for outcome-wide studies with multiple treatment levels, extensible to generic pseudo-outcomes, and is supported by simulations and an application to a study of mobile-supported self-management for chronic low back pain.

Significance. If the root-n and asymptotic normality claims are rigorously established, the work would provide a practical and extensible tool for discovering latent subgroups in causal settings where population-level summaries are insufficient. The combination of k-means with DML bias correction for counterfactuals represents a novel extension of existing clustering and efficiency techniques, with potential applicability to modern multi-outcome studies.

major comments (1)
  1. [Abstract] Abstract: The claim that the bias-corrected estimator 'achieves fast root-n rates and asymptotic normality in large nonparametric models' does not state the required convergence rates (faster than n^{-1/4}) for the nuisance estimators of the counterfactual functions. Without these conditions, it is impossible to verify that the remainder in the von Mises expansion of the k-means objective is o_p(n^{-1/2}), especially given the non-smooth nature of the clustering map.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the need for greater precision in the abstract. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: The claim that the bias-corrected estimator 'achieves fast root-n rates and asymptotic normality in large nonparametric models' does not state the required convergence rates (faster than n^{-1/4}) for the nuisance estimators of the counterfactual functions. Without these conditions, it is impossible to verify that the remainder in the von Mises expansion of the k-means objective is o_p(n^{-1/2}), especially given the non-smooth nature of the clustering map.

    Authors: We agree that the abstract should explicitly state the required rates on the nuisance estimators. The full manuscript assumes standard DML conditions under which the nuisance estimators (for the counterfactual functions) converge faster than n^{-1/4}; these conditions ensure the remainder term in the von Mises expansion of the k-means objective is o_p(n^{-1/2}). We will revise the abstract to include this requirement. The non-smoothness of the clustering map is addressed in the theoretical development by bounding the effect of the clustering step on the expansion, which holds under the stated rates. revision: yes

Circularity Check

0 steps flagged

No circularity; central claims invoke external DML theory

full rationale

The paper develops a bias-corrected estimator by applying standard nonparametric efficiency theory and double machine learning to the causal k-means objective. These are established external results (not derived within the paper or via self-citation chains that reduce to the target claim). The abstract states the root-n convergence as a consequence of this application in large nonparametric models, without any equation or step that defines the rates in terms of the clustering outputs themselves or renames fitted quantities as predictions. No self-definitional, fitted-input, or uniqueness-imported patterns appear in the provided text, so the derivation chain remains independent of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard causal identification assumptions plus rate conditions from double machine learning theory; no new entities are introduced and the number of clusters is a user choice rather than a fitted parameter in the abstract.

free parameters (1)
  • number of clusters k
    User-specified or selected via cross-validation; central to applying k-means to the estimated counterfactuals.
axioms (2)
  • domain assumption Standard causal assumptions (ignorability, positivity, consistency) for identifying counterfactual outcomes
    Required to define and estimate the counterfactual functions that are clustered.
  • domain assumption Neyman orthogonality and nuisance estimator rate conditions from double machine learning theory
    Invoked to obtain root-n rates and asymptotic normality for the bias-corrected estimator.

pith-pipeline@v0.9.0 · 5733 in / 1490 out tokens · 32635 ms · 2026-05-24T00:59:44.225824+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    and Vassilvitskii, S

    Arthur, D. and Vassilvitskii, S. (2007), k-means++: The advantages of careful seeding,in ‘Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms’, Society for Industrial and Applied Mathematics, pp. 1027–1035. Athey, S. and Imbens, G. (2016), ‘Recursive partitioning for heterogeneous causal effects’, Proceedings of the National Aca...

  2. [2]

    C., Taylor, J

    Foster, J. C., Taylor, J. M. and Ruberg, S. J. (2011), ‘Subgroup identification from random- ized clinical trial data’,Statistics in medicine30(24), 2867–2880. Giné, E. and Nickl, R. (2021),Mathematical foundations of infinite-dimensional statistical models, Cambridge university press. 24 Graf, S. and Luschgy, H. (2007),Foundations of quantization for pro...

  3. [3]

    Newey, W. K. and Robins, J. R. (2018), ‘Cross-fitting and fast remainder rates for semipara- metric estimation’,arXiv preprint arXiv:1801.09138. Nie, X. and Wager, S. (2021), ‘Quasi-oracle estimation of heterogeneous treatment effects’, Biometrika108(2), 299–319. Pollard, D. (1981), ‘Strong consistency of k-means clustering’,The Annals of Statistics pp. 1...

  4. [4]

    Serafini, A., Murphy, T. B. and Scrucca, L. (2020), ‘Handling missing data in model-based clustering’,arXiv preprint arXiv:2006.02954. Shahn, Z. and Madigan, D. (2017), ‘Latent class mixture models of treatment effect hetero- geneity’,Bayesian Analysis12(3), 831–854. Shalit, U., Johansson, F. D. and Sontag, D. (2017), Estimating individual treatment effec...

  5. [5]

    and Kang, H

    Suk, Y., Kim, J.-S. and Kang, H. (2021), ‘Hybridizing machine learning methods and finite mixture models for estimating heterogeneous treatment effects in latent classes’,Journal of Educational and Behavioral Statistics46(3), 323–347. Tseng, G. C. and Wong, W. H. (2005), ‘Tight clustering: a resampling-based approach for identifying stable and tight patte...

  6. [6]

    J., Li, S., Tsai, A

    VanderWeele, T. J., Li, S., Tsai, A. C. and Kawachi, I. (2016), ‘Association between religious serviceattendanceandlowersuicideratesamonguswomen’,JAMA psychiatry73(8),845–

  7. [7]

    and Athey, S

    Wager, S. and Athey, S. (2018), ‘Estimation and inference of heterogeneous treatment effects using random forests’,Journal of the American Statistical Association113(523), 1228–

  8. [8]

    D., Liu, L., Zhou, Z.-H

    Zhang, W., Le, T. D., Liu, L., Zhou, Z.-H. and Li, J. (2017), ‘Mining heterogeneous causal effects for personalized cancer treatment’,Bioinformatics33(15), 2372–2378. Zhang, Z., Chen, Z., Troendle, J. F. and Zhang, J. (2012), ‘Causal inference on quantiles with an obstetric application’,Biometrics68(3), 697–706. Zheng, W. and Van Der Laan, M. J. (2010), ‘...

  9. [9]

    harm” or “nonresponder

    (b) Relative gain in fit fromk−1tok, normalized by thek= 1WCSS, indicating that the improvement atk= 6is the last meaningful increase prior to diminishing returns. Both diagnostics jointly support selectingk= 6as an appropriate number of causal clusters. (a) (b) Cluster-level effect heterogeneity and covariate profiles.In this analysis, we se- lect ten ba...

  10. [10]

    Then the same argument holds, and again,max{|ˆf|,|f|}≤|ˆf−f|

    Case 2.f >0and ˆf≤0. Then the same argument holds, and again,max{|ˆf|,|f|}≤|ˆf−f|. In particular,|f|≤|ˆf−f|, and sincef >0, the first indicator on the RHS evaluates to1. Hence, in either case, one of the two indicators on the right-hand side is1, ensuring the inequality holds. The following lemma establishes that the projection error arising from perturba...

  11. [11]

    The terms on the RHS of the last display can be bounded using techniques that will be developed in the proof of Theorem 3.2

    implies the following bound for the first term in (A.4): sup C∈Ck |R(C)−Rn(C)|=OP   √ logn n   (A.5) For the second term in (A.4), we observe that ˆRn( ˆC)−Rn( ˆC) =P n{fˆC( ˆµ)}−Pn{fˆC(µ)} = (Pn−P) { f ˆC( ˆµ)−fˆC(µ) } +P { f ˆC( ˆµ)−fˆC(µ) } . The terms on the RHS of the last display can be bounded using techniques that will be developed in the proo...

  12. [12]

    Then we have sup Q N(ϵ∥Fn∥Q,2,Fn,L 2(Q))≲ (c1 ϵ )c2ν′ for some universal constantsc 1,c 2 >0

    Let the VC index ofFn beν′<∞. Then we have sup Q N(ϵ∥Fn∥Q,2,Fn,L 2(Q))≲ (c1 ϵ )c2ν′ for some universal constantsc 1,c 2 >0. Hence applying Giné and Nickl (2021, Theorem 3.5.4), we obtain that P { sup f∈Fn |Gn(f)| } ≲∥Fn∥sup Q ∫ 1 0 √ 1 + logN(ϵ∥Fn∥Q,2,Fn,L 2(Q))dϵ ≲∥Fn∥ ∫ 1 0 √ 1 +ν′log(1/ϵ)dϵ. Taking the envelopeFn = sup C∈Ck |fC( ˆµ)−fC(µ)|which is boun...

  13. [13]

    Therefore, P{φC∗(Z; ˆη)−φC∗(Z;η)}≲   maxa ∥ˆµa−µa∥∥ˆπa−πa∥ + maxa ∥ˆµa−µa∥α+1 ∞ + 1 κmaxa ∥ˆµa−µa∥∞∥ˆµa−µa∥P,1  1(p)

    By the same argument used to derive (A.9) and (A.10) in the 30 proof of Lemma A.7, this is bounded by maxa ∥ˆµa−µa∥α+1 ∞ + 1 κmaxa ∥ˆµa−µa∥∞∥ˆµa−µa∥P,1. Therefore, P{φC∗(Z; ˆη)−φC∗(Z;η)}≲   maxa ∥ˆµa−µa∥∥ˆπa−πa∥ + maxa ∥ˆµa−µa∥α+1 ∞ + 1 κmaxa ∥ˆµa−µa∥∞∥ˆµa−µa∥P,1  1(p). Finally, consider the first term in (A.19). The derivative ofC↦→P{φC(Z;η)}atC=C∗is...