Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence

Angel Reyero Lobo; Bertrand Thirion; Denis A. Engemann; Joseph Paillard; Vitaliy Kolodyazhniy

arxiv: 2408.13002 · v4 · pith:EHTHB7T7new · submitted 2024-08-23 · 💻 cs.LG

Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence

Joseph Paillard , Angel Reyero Lobo , Vitaliy Kolodyazhniy , Bertrand Thirion , Denis A. Engemann This is my paper

Pith reviewed 2026-05-23 21:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords variable importanceconditional average treatment effectpermutation importancecausal machine learningheterogeneous treatment effectsfinite sample analysisbiomedical applications

0 comments

The pith

PermuCATE adapts conditional permutation importance to deliver lower-variance variable importance scores for conditional average treatment effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops PermuCATE to determine which input variables most influence how treatment effects vary across individuals. It modifies the conditional permutation importance procedure for use with CATE models and proves in finite samples that the resulting importance scores have lower variance than those from the leave-one-covariate-out baseline. This variance reduction directly raises the power to detect genuine drivers of treatment heterogeneity. The gain is especially relevant for biomedical studies that must work with modest sample sizes and correlated predictors, and the authors demonstrate the approach on both synthetic data and real health records.

Core claim

PermuCATE is an algorithm that applies conditional permutation importance to global variable importance assessment inside CATE estimation. Theoretical analysis of the finite-sample regime together with empirical comparisons shows that PermuCATE produces importance estimates whose variance is lower than the LOCO reference while remaining statistically reliable. The reduced variance improves detection power for variables that drive heterogeneous treatment responses, a property verified across simulated settings and real-world health datasets containing up to hundreds of correlated covariates.

What carries the argument

PermuCATE, the adaptation of conditional permutation importance that isolates each variable's contribution to CATE heterogeneity by conditional permutation while preserving finite-sample variance properties.

If this is right

Higher statistical power becomes available for identifying treatment-effect drivers when sample sizes are small.
Reliable rankings remain feasible even when predictors are numerous and correlated.
The method supplies a practical tool for biomedical causal analyses that must operate in limited-data regimes.
Empirical gains appear consistently in both controlled simulations and real health records.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditional-permutation construction could be applied to other causal functionals such as individual treatment effects or policy value functions.
Pairing PermuCATE with different base CATE learners might further tighten variance bounds in high-dimensional regimes.
The variance advantage could translate into smaller required sample sizes for achieving a target power level in study design.

Load-bearing premise

Adapting conditional permutation importance to the CATE setting preserves its statistical properties and produces strictly lower variance than LOCO without introducing adaptation-specific bias.

What would settle it

A repeated finite-sample simulation in which PermuCATE exhibits equal or higher variance than LOCO on the same CATE estimates across multiple data-generating processes would falsify the central claim.

Figures

Figures reproduced from arXiv: 2408.13002 by Angel Reyero Lobo, Bertrand Thirion, Denis A. Engemann, Joseph Paillard, Vitaliy Kolodyazhniy.

**Figure 1.** Figure 1: PermuCATE detects important variables with greater statistical power. Using the data-generating process proposed by (Hines et al., 2022), we compared the estimates of variable importance using LOCO (top subplot), the baseline proposed by the authors, and PermuCATE. By computing p-values, we observed at different sample sizes whether each of the three important variables was correctly identified (true posi… view at source ↗

**Figure 3.** Figure 3: The PermuCATE method identified more important variables in high-dimensional, linear, and complex scenarios. (a) Statistical power for detecting important variables as a function of sample size on the HP dataset (non-linear scenario) with PermuCATE and LOCO methods. The CATE was estimated with a DR-learner using super-learners stacking gradient boosting trees and regularized linear models to estimate the … view at source ↗

**Figure 4.** Figure 4: Comparison of variable importance methods with three different learners on the IHDP benchmark. The CATE was estimated using a Causal Forest (CF, Athey & Wager 2019), a deep neural network (CATENet, Curth & Schaar 2021), and a pre-trained tabular foundation model (TabPFN, Hollmann et al. 2025). For each learner, variable importance was estimated with LOCO and PermuCATE. (a) Displays the negative Precision i… view at source ↗

read the original abstract

Causal machine learning holds promise for estimating individual treatment effects from complex data. For successful real-world applications of machine learning methods, it is of paramount importance to obtain reliable insights into which variables drive heterogeneity in the response to treatment. We propose PermuCATE, an algorithm based on the Conditional Permutation Importance (CPI) method, for statistically rigorous global variable importance assessment in the estimation of the Conditional Average Treatment Effect (CATE). Theoretical analysis of the finite sample regime and empirical studies show that PermuCATE has lower variance than the Leave-One-Covariate-Out (LOCO) reference method and provides a reliable measure of variable importance. This property increases statistical power, which is crucial for causal inference in the limited-data regime common to biomedical applications. We empirically demonstrate the benefits of PermuCATE in simulated and real-world health datasets, including settings with up to hundreds of correlated variables.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PermuCATE adapts CPI to CATE variable importance and reports lower finite-sample variance than LOCO, but the theory needs checking on whether it survives estimated CATE.

read the letter

The paper's main contribution is PermuCATE, which takes the conditional permutation importance framework and applies it to global variable importance for CATE models. The authors position it as a distinct algorithm and show through finite-sample analysis and experiments that it has lower variance than the LOCO baseline while remaining reliable. This is presented as useful for biomedical settings where data is limited and identifying drivers of treatment heterogeneity matters for power.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes PermuCATE, an adaptation of the Conditional Permutation Importance (CPI) framework for global variable importance assessment in Conditional Average Treatment Effect (CATE) estimation. It claims that finite-sample theoretical analysis and empirical studies demonstrate lower variance than the Leave-One-Covariate-Out (LOCO) baseline while remaining reliable, thereby increasing statistical power in limited-data biomedical settings, with demonstrations on simulated data and real-world health datasets involving up to hundreds of correlated variables.

Significance. If the finite-sample variance reduction holds when the CATE is estimated from data rather than treated as an oracle, the result would supply a higher-powered alternative to LOCO for identifying drivers of treatment heterogeneity, addressing a practical need in causal machine learning for biomedical applications.

major comments (1)

[Theoretical analysis] Theoretical analysis section: the finite-sample variance comparison between PermuCATE and LOCO is presented as holding while remaining unbiased, but the derivation appears to condition on the true (oracle) CATE function; it is unclear whether the claimed strict variance reduction survives replacement by a data-driven CATE estimator whose estimation error may be correlated with the permutation step, potentially introducing additional finite-sample bias or variance inflation absent from the LOCO baseline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address the major theoretical concern point-by-point below and will incorporate clarifications into the revised version.

read point-by-point responses

Referee: Theoretical analysis section: the finite-sample variance comparison between PermuCATE and LOCO is presented as holding while remaining unbiased, but the derivation appears to condition on the true (oracle) CATE function; it is unclear whether the claimed strict variance reduction survives replacement by a data-driven CATE estimator whose estimation error may be correlated with the permutation step, potentially introducing additional finite-sample bias or variance inflation absent from the LOCO baseline.

Authors: The finite-sample variance analysis in Section 3 is derived under the oracle CATE assumption to obtain exact expressions and prove the strict variance reduction of PermuCATE relative to LOCO while preserving unbiasedness. This establishes the core statistical advantage of conditional permutation in the ideal case. For data-driven CATE estimators, the manuscript relies on the extensive empirical evaluation in Sections 4 and 5 across multiple estimators (random forests, neural networks) and datasets, which consistently demonstrate lower variance for PermuCATE without evidence of additional bias or inflation from estimation error. We agree that explicitly stating the oracle assumption and discussing the transition to estimated CATE would improve clarity. We will revise the theoretical section to highlight the assumption and add a dedicated paragraph on the empirical robustness to estimation error. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The provided abstract and description present PermuCATE as an adaptation of the established Conditional Permutation Importance (CPI) framework, with theoretical finite-sample analysis and empirical comparisons to the LOCO baseline. No equations, self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations are quoted or evident. The variance and power claims rest on independent theoretical analysis and external empirical benchmarks rather than reducing to the method's own inputs by construction. This is the most common honest finding for papers that build on prior methods with separate validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated in the provided text.

axioms (1)

domain assumption Conditional permutation importance retains its validity and variance properties when applied to CATE estimation.
The method's claimed advantages rest on this transfer of CPI properties to the heterogeneous treatment effect setting.

pith-pipeline@v0.9.0 · 5694 in / 1217 out tokens · 23176 ms · 2026-05-23T21:44:41.863527+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Candes, E., Fan, Y ., Janson, L., and Lv, J

doi: 10.48550/arxiv.2308.03369. Candes, E., Fan, Y ., Janson, L., and Lv, J. Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80,

work page doi:10.48550/arxiv.2308.03369
[2]

Chamma, A., Thirion, B., and Engemann, D

doi: 10.48550/arxiv.2309.07593. Chamma, A., Thirion, B., and Engemann, D. Variable im- portance in high-dimensional settings requires grouping. In Proceedings of the AAAI Conference on Artificial In- telligence, volume 38, pp. 11195–11203,

work page doi:10.48550/arxiv.2309.07593
[3]

Holland, P

doi: 10.48550/arxiv.2204.06030. Holland, P. W. Statistics and Causal Inference. Journal of the American Statistical Association, (396):945,

work page doi:10.48550/arxiv.2204.06030
[4]

Hollmann, N., M¨uller, S., Purucker, L., Krishnakumar, A., K¨orfer, M., Hoo, S

doi: 10.2307/2289064. Hollmann, N., M¨uller, S., Purucker, L., Krishnakumar, A., K¨orfer, M., Hoo, S. B., Schirrmeister, R. T., and Hutter, F. Accurate predictions on small data with a tabular foundation model. Nature, 637(8045):319–326,

work page doi:10.2307/2289064
[5]

Accurate predictions on small data with a tab- ular foundation model.Nature, 637(8045):319–326, 2025

doi: 10.1038/s41586-024-08328-6. Imbens, G. W. and Rubin, D. B. Causal inference in statis- tics, social, and biomedical sciences. Cambridge univer- sity press,

work page doi:10.1038/s41586-024-08328-6
[6]

Kong, J., Ha, D., Lee, J., Kim, I., Park, M., Im, S.-H., Shin, K., and Kim, S

doi: 10.1214/23-ejs2157. Kong, J., Ha, D., Lee, J., Kim, I., Park, M., Im, S.-H., Shin, K., and Kim, S. Network-based machine learning approach to predict immunotherapy response in cancer patients. Nature communications, 13:3703,

work page doi:10.1214/23-ejs2157
[7]

Le Goallec, A., Diai, S., Collin, S., Prost, J.-B., Vincent, T., and Patel, C

doi: 10.2202/1544-6115.1309. Le Goallec, A., Diai, S., Collin, S., Prost, J.-B., Vincent, T., and Patel, C. J. Using deep learning to predict abdominal 10 Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence age from liver and pancreas magnetic resonance images. Nature Communications, 13(1):1979,

work page doi:10.2202/1544-6115.1309 1979
[8]

Parrish, R

doi: 10.1002/alz.13431. Parrish, R. L., Buchman, A. S., Tasaki, S., Wang, Y ., Avey, D., Xu, J., De Jager, P. L., Bennett, D. A., Epstein, M. P., and Yang, J. Sr-twas: leveraging multiple reference panels to improve transcriptome-wide association study power by ensemble machine learning. Nature Communi- cations, 15(1):6646,

work page doi:10.1002/alz.13431
[9]

Sanchez, P., V oisey, J

doi: 10.48550/arXiv.2501.17520. Sanchez, P., V oisey, J. P., Xia, T., Watson, H. I., O’Neil, A. Q., and Tsaftaris, S. A. Causal machine learning for healthcare and precision medicine. Royal Society Open Science, 9,

work page doi:10.48550/arxiv.2501.17520
[10]

Watson, D

doi: 10.1214/24-STS937. Watson, D. S. and Wright, M. N. Testing conditional in- dependence in supervised learning algorithms. Machine Learning, 110(8):2107–2129,

work page doi:10.1214/24-sts937
[11]

(Y − µ(A, X))(A − π(X)) π(X)(1 − π(X)) + τ(X) − ˆτ(X) 2# = E h (τ(X) − ˆτ(X))2 i | {z } τ −risk + E

A.2. R- and pseudo-outcome-risk decomposition In a scenario where the outcome y is not deterministic: y = m(x) + (a − π(x))τ(x) + ϵ(a, x), given a CATE estimate ˆτ, the expectation of the pseudo-outcome risk can be formulated as, E[RP O(ˆτ , X, A, Y)] = E (φ(Z) − ˆτ(X))2 = E ((Y − µ(A, X))(A − π(X)) π(X)(1 − π(X)) + µ(1, X) − µ(0, X) − ˆτ(X))2 = E " (Y − ...

work page 2025
[12]

However, the term E[( ε(X,A) π(X)(1−π(X)))2] will likely take extreme values and lead to inconsistencies between theRP O-risk and the oracle τ-MSE

a DR-learner might still provide reliable estimation of the CATE. However, the term E[( ε(X,A) π(X)(1−π(X)))2] will likely take extreme values and lead to inconsistencies between theRP O-risk and the oracle τ-MSE. While this consideration holds for model selection of CATE estimators and is consistent with the findings from Doutreligne & Varoquaux 2025, as...

work page 2025
[13]

ε(X, A) π(X)(1 − π(X)) 2# − 1 ntest X ntest h (τ(X) − ˆτ(X))2 i − 1 ntest X ntest

In addition, the same experiment was performed using the R-risk (right panel). A.3. Proof of Proposition 3.2 Proof. The causal risk considered in this proof is the pseudo-outcome risk described in subsection A.2. Here again, XP,j denotes the features matrix with the jth feature sampled from the conditional distribution [XP,j]j ∼ X j|X(−j) 2bΨj CP I = 1 nt...

work page 2025
[14]

The correlation structure from the LD dataset has been removed for clarity. A.10. Datasets Low Dimensional (LD) dataset Taken from the work of Hines et al. 2022: (X1, X2), (X3, X4), (X5, X6) ∼ N (0, 1 0 .5 0.5 1 ) τ(X) = X1 + 2X2 + X3 π(X) = expit (−0.4X1 + 0.1X1X2 + 0.25X5) A ∼ Bernoulli(π(X)) µ0(X) = X3 − X6 Y ∼ µ0(X) + Aτ(X) + N (0,

work page 2022
[15]

2017; Curth & Schaar 2021, we used 100 repetitions of the simulations

Similar to Shalit et al. 2017; Curth & Schaar 2021, we used 100 repetitions of the simulations. A.11. Hyper-parameter search All hyper-parameters were optimized using a nested cross-validation loop. For linear models, we used the scikit- learn implementation RidgeCV for regression and LogisticRegressionCV with a range of penalization strength from 10−3 to...

work page 2017
[16]

Our interpretation is that the complexity of covariates distributions also affects LOCO , by making the CATE harder to estimate. 19 Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence 0.8 0.6 0.4 negative PEHE 26 52 78 104 130 a c 0.4 0.6 power ( = 0.05) 0.6 0.8 1.0 AUC 26 52 78 104 130 b d 0.0 0.1 0.2 type-I error ( = 0.05) m...

work page 2019

[1] [1]

Candes, E., Fan, Y ., Janson, L., and Lv, J

doi: 10.48550/arxiv.2308.03369. Candes, E., Fan, Y ., Janson, L., and Lv, J. Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80,

work page doi:10.48550/arxiv.2308.03369

[2] [2]

Chamma, A., Thirion, B., and Engemann, D

doi: 10.48550/arxiv.2309.07593. Chamma, A., Thirion, B., and Engemann, D. Variable im- portance in high-dimensional settings requires grouping. In Proceedings of the AAAI Conference on Artificial In- telligence, volume 38, pp. 11195–11203,

work page doi:10.48550/arxiv.2309.07593

[3] [3]

Holland, P

doi: 10.48550/arxiv.2204.06030. Holland, P. W. Statistics and Causal Inference. Journal of the American Statistical Association, (396):945,

work page doi:10.48550/arxiv.2204.06030

[4] [4]

Hollmann, N., M¨uller, S., Purucker, L., Krishnakumar, A., K¨orfer, M., Hoo, S

doi: 10.2307/2289064. Hollmann, N., M¨uller, S., Purucker, L., Krishnakumar, A., K¨orfer, M., Hoo, S. B., Schirrmeister, R. T., and Hutter, F. Accurate predictions on small data with a tabular foundation model. Nature, 637(8045):319–326,

work page doi:10.2307/2289064

[5] [5]

Accurate predictions on small data with a tab- ular foundation model.Nature, 637(8045):319–326, 2025

doi: 10.1038/s41586-024-08328-6. Imbens, G. W. and Rubin, D. B. Causal inference in statis- tics, social, and biomedical sciences. Cambridge univer- sity press,

work page doi:10.1038/s41586-024-08328-6

[6] [6]

Kong, J., Ha, D., Lee, J., Kim, I., Park, M., Im, S.-H., Shin, K., and Kim, S

doi: 10.1214/23-ejs2157. Kong, J., Ha, D., Lee, J., Kim, I., Park, M., Im, S.-H., Shin, K., and Kim, S. Network-based machine learning approach to predict immunotherapy response in cancer patients. Nature communications, 13:3703,

work page doi:10.1214/23-ejs2157

[7] [7]

Le Goallec, A., Diai, S., Collin, S., Prost, J.-B., Vincent, T., and Patel, C

doi: 10.2202/1544-6115.1309. Le Goallec, A., Diai, S., Collin, S., Prost, J.-B., Vincent, T., and Patel, C. J. Using deep learning to predict abdominal 10 Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence age from liver and pancreas magnetic resonance images. Nature Communications, 13(1):1979,

work page doi:10.2202/1544-6115.1309 1979

[8] [8]

Parrish, R

doi: 10.1002/alz.13431. Parrish, R. L., Buchman, A. S., Tasaki, S., Wang, Y ., Avey, D., Xu, J., De Jager, P. L., Bennett, D. A., Epstein, M. P., and Yang, J. Sr-twas: leveraging multiple reference panels to improve transcriptome-wide association study power by ensemble machine learning. Nature Communi- cations, 15(1):6646,

work page doi:10.1002/alz.13431

[9] [9]

Sanchez, P., V oisey, J

doi: 10.48550/arXiv.2501.17520. Sanchez, P., V oisey, J. P., Xia, T., Watson, H. I., O’Neil, A. Q., and Tsaftaris, S. A. Causal machine learning for healthcare and precision medicine. Royal Society Open Science, 9,

work page doi:10.48550/arxiv.2501.17520

[10] [10]

Watson, D

doi: 10.1214/24-STS937. Watson, D. S. and Wright, M. N. Testing conditional in- dependence in supervised learning algorithms. Machine Learning, 110(8):2107–2129,

work page doi:10.1214/24-sts937

[11] [11]

(Y − µ(A, X))(A − π(X)) π(X)(1 − π(X)) + τ(X) − ˆτ(X) 2# = E h (τ(X) − ˆτ(X))2 i | {z } τ −risk + E

A.2. R- and pseudo-outcome-risk decomposition In a scenario where the outcome y is not deterministic: y = m(x) + (a − π(x))τ(x) + ϵ(a, x), given a CATE estimate ˆτ, the expectation of the pseudo-outcome risk can be formulated as, E[RP O(ˆτ , X, A, Y)] = E (φ(Z) − ˆτ(X))2 = E ((Y − µ(A, X))(A − π(X)) π(X)(1 − π(X)) + µ(1, X) − µ(0, X) − ˆτ(X))2 = E " (Y − ...

work page 2025

[12] [12]

However, the term E[( ε(X,A) π(X)(1−π(X)))2] will likely take extreme values and lead to inconsistencies between theRP O-risk and the oracle τ-MSE

a DR-learner might still provide reliable estimation of the CATE. However, the term E[( ε(X,A) π(X)(1−π(X)))2] will likely take extreme values and lead to inconsistencies between theRP O-risk and the oracle τ-MSE. While this consideration holds for model selection of CATE estimators and is consistent with the findings from Doutreligne & Varoquaux 2025, as...

work page 2025

[13] [13]

ε(X, A) π(X)(1 − π(X)) 2# − 1 ntest X ntest h (τ(X) − ˆτ(X))2 i − 1 ntest X ntest

In addition, the same experiment was performed using the R-risk (right panel). A.3. Proof of Proposition 3.2 Proof. The causal risk considered in this proof is the pseudo-outcome risk described in subsection A.2. Here again, XP,j denotes the features matrix with the jth feature sampled from the conditional distribution [XP,j]j ∼ X j|X(−j) 2bΨj CP I = 1 nt...

work page 2025

[14] [14]

The correlation structure from the LD dataset has been removed for clarity. A.10. Datasets Low Dimensional (LD) dataset Taken from the work of Hines et al. 2022: (X1, X2), (X3, X4), (X5, X6) ∼ N (0, 1 0 .5 0.5 1 ) τ(X) = X1 + 2X2 + X3 π(X) = expit (−0.4X1 + 0.1X1X2 + 0.25X5) A ∼ Bernoulli(π(X)) µ0(X) = X3 − X6 Y ∼ µ0(X) + Aτ(X) + N (0,

work page 2022

[15] [15]

2017; Curth & Schaar 2021, we used 100 repetitions of the simulations

Similar to Shalit et al. 2017; Curth & Schaar 2021, we used 100 repetitions of the simulations. A.11. Hyper-parameter search All hyper-parameters were optimized using a nested cross-validation loop. For linear models, we used the scikit- learn implementation RidgeCV for regression and LogisticRegressionCV with a range of penalization strength from 10−3 to...

work page 2017

[16] [16]

Our interpretation is that the complexity of covariates distributions also affects LOCO , by making the CATE harder to estimate. 19 Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence 0.8 0.6 0.4 negative PEHE 26 52 78 104 130 a c 0.4 0.6 power ( = 0.05) 0.6 0.8 1.0 AUC 26 52 78 104 130 b d 0.0 0.1 0.2 type-I error ( = 0.05) m...

work page 2019