Nonparametric tests of treatment effect homogeneity for policy-makers

Aaron Hudson; Mats J. Stensrud; Oliver Dukes; Riccardo Brioschi

arxiv: 2410.00985 · v4 · submitted 2024-10-01 · 📊 stat.ME

Nonparametric tests of treatment effect homogeneity for policy-makers

Oliver Dukes , Mats J. Stensrud , Riccardo Brioschi , Aaron Hudson This is my paper

Pith reviewed 2026-05-23 20:03 UTC · model grok-4.3

classification 📊 stat.ME

keywords treatment effect heterogeneitynonparametric testsconditional average treatment effectpersonalized treatmentpolicy evaluationasymptotic inferenceclinical trials

0 comments

The pith

Nonparametric tests can detect when using covariates in treatment rules changes population outcomes compared to ignoring them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a class of nonparametric tests for quantitative and qualitative treatment effect heterogeneity. These tests handle continuous or discrete covariates and use structured assumptions on the conditional average treatment effect to obtain a tractable asymptotic null distribution without splitting the sample. The tests are constructed to have power against alternatives in which a personalized decision rule produces a different overall population impact than a rule that discards covariates. This setup is intended to help policy makers decide whether to adopt covariate-based treatment assignment. The methods are illustrated in simulation studies and a re-analysis of data from an AIDS clinical trial.

Core claim

We propose a class of nonparametric tests for both quantitative and qualitative treatment effect heterogeneity. The tests can incorporate a variety of structured assumptions on the conditional average treatment effect, allow for both continuous and discrete covariates, and do not require sample splitting to obtain a tractable asymptotic null distribution. Furthermore, we show how the tests are tailored to detect alternatives where the population impact of adopting a personalized decision rule differs from using a rule that discards covariates.

What carries the argument

The class of nonparametric tests for treatment effect heterogeneity, constructed under structured assumptions on the conditional average treatment effect to yield tractable asymptotics without sample splitting and targeted at policy-impact differences.

If this is right

The tests apply directly to settings with both continuous and discrete covariates.
They detect heterogeneity specifically when it changes the population-level benefit of personalization.
No sample splitting is needed to obtain valid asymptotic inference under the null.
The approach supports policy decisions by identifying when covariate information alters treatment rules.
Performance is demonstrated in simulations and an AIDS clinical trial re-analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tests could be applied to decide whether collecting additional covariates justifies the cost for a given policy.
Extensions to observational data would require only standard confounding adjustments to maintain the same structure.
Policy evaluations could incorporate these tests to compare multiple candidate decision rules beyond simple covariate use or non-use.
The framework might be adapted to test heterogeneity in settings with multiple treatments or time-to-event outcomes.

Load-bearing premise

Structured assumptions on the conditional average treatment effect are required to produce a tractable asymptotic null distribution without sample splitting.

What would settle it

A simulation or dataset in which the tests exhibit incorrect size under the null or fail to detect heterogeneity that alters the population impact of personalized versus uniform rules.

Figures

Figures reproduced from arXiv: 2410.00985 by Aaron Hudson, Mats J. Stensrud, Oliver Dukes, Riccardo Brioschi.

**Figure 1.** Figure 1: Illustration of effect heterogeneity better (worse) than average. Moreover, it can easily be seen that θ + 0,τ0 − θ − 0,τ0 = E0{|τ0,s(Xs) − τ0|}, giving us a representation of the probability-weighted L1-distance of the CATE curve from the mean. Given this intuition, we believe that this is often easily interpretable as a summary of heterogeneity relative to contrasts based on other distances (e.g. L2- dis… view at source ↗

**Figure 2.** Figure 2: Cubic spline estimates of CATE curves and p-values from tests of treatment effect heterogeneity, for the ACTG data. Dashed orange lines represent pointwise 95% confidence intervals. Dashed grey and blue lines pass through zero and the ATE respectively. Reported p-values for the qualitative tests are taken as the maximum of the individual p-values for one-sided tests for positive and negative effects. 8. Di… view at source ↗

read the original abstract

Recent work has focused on nonparametric estimation of conditional treatment effects, but inference has remained relatively unexplored. We propose a class of nonparametric tests for both quantitative and qualitative treatment effect heterogeneity. The tests can incorporate a variety of structured assumptions on the conditional average treatment effect, allow for both continuous and discrete covariates, and do not require sample splitting to obtain a tractable asymptotic null distribution. Furthermore, we show how the tests are tailored to detect alternatives where the population impact of adopting a personalized decision rule differs from using a rule that discards covariates. The proposal is thus relevant for guiding treatment policies. The utility of the proposal is borne out in simulation studies and a re-analysis of an AIDS clinical trial.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a class of nonparametric tests for treatment effect heterogeneity that skip sample splitting via structured CATE assumptions and target policy-relevant alternatives about personalized vs. uniform rules.

read the letter

The core contribution is a set of tests for both quantitative and qualitative heterogeneity that are built to detect when a covariate-based decision rule changes population-level outcomes compared to ignoring the covariates. They incorporate structured assumptions on the conditional average treatment effect to get tractable limiting distributions without splitting the sample, and the authors back this with simulations plus a re-analysis of an AIDS trial. That combination of policy focus and no-split implementation is the part worth paying attention to if the assumptions hold up in the settings where the tests would actually be used. The simulations appear to show reasonable power and size control under the stated conditions, and the real-data example illustrates how the tests can flag when personalization matters at scale. The tailoring to policy impact is a clear step beyond generic heterogeneity tests. The main soft spot is the reliance on those structured CATE assumptions. They buy the no-splitting property, but the paper needs to make explicit how restrictive they are in practice and whether common policy settings violate them enough to distort the null distribution or reduce power. If the assumptions turn out narrow, the practical advantage shrinks. The asymptotics and implementation details will need close checking by referees to confirm the claims hold beyond the abstract. This is aimed at causal inference researchers who work on policy evaluation in medicine or social science. It has enough new machinery and empirical checks to merit sending out for serious review rather than a desk reject, even if revisions will likely be needed on the assumption scope and sensitivity checks.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a class of nonparametric tests for both quantitative and qualitative treatment effect heterogeneity. The tests incorporate structured assumptions on the conditional average treatment effect (CATE) to obtain tractable asymptotic null distributions without sample splitting, accommodate continuous and discrete covariates, and are explicitly tailored to detect alternatives in which the population value of a personalized treatment rule differs from that of a rule that discards covariates. The proposal is illustrated via simulation studies and a re-analysis of an AIDS clinical trial.

Significance. If the asymptotic claims hold under the stated structured assumptions on the CATE, the work would supply a practical inference tool for policy-makers that directly links statistical tests to the decision of whether personalization improves population outcomes, addressing a gap between nonparametric CATE estimation and policy-relevant inference.

major comments (2)

[Abstract and §1] Abstract and §1: The central claim that structured assumptions on the CATE deliver a tractable asymptotic null distribution without sample splitting is asserted without any explicit statement of those assumptions, derivation of the limiting distribution, or error analysis. This premise is load-bearing for the implementability and validity claims highlighted in the weakest assumption.
[§3] §3 (theoretical results): The tailoring of the tests to policy alternatives (population impact of personalized vs. covariate-ignoring rules) is claimed to follow from the test construction, but without the explicit form of the test statistic or the precise CATE restrictions that yield the null distribution, it is impossible to verify whether the test has nontrivial power against those alternatives or whether the assumptions are overly restrictive for typical policy settings.

minor comments (1)

[Simulation and application sections] The abstract mentions simulation studies and an AIDS trial re-analysis but provides no information on tuning-parameter selection, number of Monte Carlo replications, or covariate dimensions examined; these details belong in the main text or appendix for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. The feedback identifies opportunities to improve the clarity of our presentation regarding assumptions and derivations. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1: The central claim that structured assumptions on the CATE deliver a tractable asymptotic null distribution without sample splitting is asserted without any explicit statement of those assumptions, derivation of the limiting distribution, or error analysis. This premise is load-bearing for the implementability and validity claims highlighted in the weakest assumption.

Authors: The structured assumptions on the CATE (including the forms that permit closed-form asymptotics without splitting) are stated explicitly in Section 2. The limiting distribution under the null is derived in Theorem 3.1 of Section 3, with the associated error analysis and regularity conditions given in the appendix. We agree that the abstract and §1 would benefit from a concise forward reference to these elements rather than relying solely on later sections. We will revise the abstract and introduction to include a brief statement of the key CATE restrictions and a direct citation to Theorem 3.1. revision: yes
Referee: [§3] §3 (theoretical results): The tailoring of the tests to policy alternatives (population impact of personalized vs. covariate-ignoring rules) is claimed to follow from the test construction, but without the explicit form of the test statistic or the precise CATE restrictions that yield the null distribution, it is impossible to verify whether the test has nontrivial power against those alternatives or whether the assumptions are overly restrictive for typical policy settings.

Authors: The test statistic is given explicitly in Equation (3.2) of Section 3; it is constructed as a normalized estimator of the value difference between the optimal personalized rule and the best covariate-ignoring rule. The CATE restrictions that deliver the tractable null distribution appear in Assumption 2.1. Nontrivial power against the stated policy alternatives is established in Theorem 3.3 under local alternatives where this value difference is nonzero. We will add a short remark in §3 that explicitly links the statistic to the policy comparison and includes a brief discussion of the restrictiveness of Assumption 2.1 with reference to common policy settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a class of nonparametric tests for treatment effect heterogeneity that incorporate structured assumptions on the CATE to obtain tractable asymptotics without sample splitting. No equations or steps in the provided abstract reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the assumptions function as modeling choices enabling the claimed limiting distribution rather than tautological redefinitions of the test statistic or null behavior. The derivation chain remains independent of the target result and is self-contained against external nonparametric theory.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted or verified.

pith-pipeline@v0.9.0 · 5645 in / 948 out tokens · 17917 ms · 2026-05-23T20:03:50.380577+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A general nonparametric framework for testing hypotheses about function-valued parameters
stat.ME 2026-04 unverdicted novelty 6.0

A general nonparametric test for constancy of smooth function-valued parameters from conditional distributions is introduced, with a tractable limiting null distribution unlike many norm-based alternatives.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper

[1]

Allen, D. L. (1997). Hypothesis testing using an l 1-distance bootstrap. The American Statistician, 51(2):145–150. Andrews, D. W. and Shi, X. (2013). Inference based on conditional moment inequalities. Econometrica, 81(2):609–666. Athey, S. and Wager, S. (2021). Policy learning with observational data. Econometrica, 89(1):133–161. Benkeser, D. and Van Der...

work page 1997
[2]

Ding, P., Feller, A., and Miratrix, L. (2019). Decomposing treatment effect variation. Journal of the American Statistical Association , 114(525):304–317. Dudley, R. M. (2014). Uniform central limit theorems, volume

work page 2019
[3]

Cvxr: An r package for disciplined convex optimization

Cambridge university press. Fu, A., Narasimhan, B., and Boyd, S. (2017). Cvxr: An r package for disciplined convex optimization. arXiv preprint arXiv:1711.07582 . 32 Gail, M. and Simon, R. (1985). Testing for qualitative interactions between treatment effects and patient subsets. Biometrics, pages 361–372. Hammer, S. M., Katzenstein, D. A., Hughes, M. D.,...

work page arXiv 2017
[4]

Li, Z., Nassif, H., and Luedtke, A

Springer. Li, Z., Nassif, H., and Luedtke, A. (2024). Estimation of subsidiary performance metrics under optimal policies. arXiv preprint arXiv:2401.04265 . Luedtke, A. R. and Van Der Laan, M. J. (2016). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 44(2):713. Nie, X. and Wager, S....

work page arXiv 2024
[5]

van der Vaart, A

Cambridge university press. van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer New York. VanderWeele, T. J. (2009). On the distinction between interaction and effect modification. Epidemiology, pages 863–871. Watson, J. A. and Holmes, C. C. (2020). Machine learning analysis plans for randomised controlled tr...

work page 1996
[6]

Let t+ α and t− α be chosen as the (1 − α) quantile of supf ∈F G+(f) and the α quantile of inf f ∈F G−(f) respectively

(Asymptotic type I error control: qualitative heterogeneity) Suppose P0 is any fixed probability distribution for which the null of no qualitative effect heterogeneity holds. Let t+ α and t− α be chosen as the (1 − α) quantile of supf ∈F G+(f) and the α quantile of inf f ∈F G−(f) respectively. Then under the conditions of Corollary 1, lim sup n→∞ P0 n1/2 ...

work page 2021
[7]

Local Asymptotic Behavior C.1

Appendix C. Local Asymptotic Behavior C.1. Test for quantitative heterogeneity. In what follows, we will investigate the properties of our tests in a local asymptotic framework. We will consider first quantitative and then qualitative heterogeneity testing. The first case follows along fairly standard arguments; see for example Section 3.10 of van der Vaa...

work page 1996
[8]

(Power against local alternatives: qualitative heterogeneity) Assume the setting of Theorem 7, and let t+ α and t− α , respectively, be the (1 − α) and α quantiles of 44 supf ∈F G+(f) and inf f ∈F G−(f). Then under sampling from ˜Pn, lim n→∞ ˜Pn n1/2 sup f ∈F θ+ n,δ(f) > t + α and n1/2 inf f ∈F θ− n,δ(f) < t − α ≥ max 0, P0 sup f ∈F {G+(f) + c+(f)} > t + ...

work page 1989
[9]

We will show the result for θ+ n,τn(f)

Proof. We will show the result for θ+ n,τn(f). For a fixed f, we have that r+ n,τn(f) = R1(f) + R2(f) where R1(f) := 1 n nX i=1 {ψn(Zi) − τn} {f(Xs,i) − ¯fn} − {ψ0(Zi) − τ0} f(Xs,i) − ¯f0 − Z {ψn(z) − τn} {f(xs) − ¯fn} − {ψ0(z) − τ0} f(xs) − ¯f0 dP0(z) R2(f) := Z {ψn(z) − τn} {f(xs) − ¯fn} − θ+ 0,τ0(f) dP0(z) where ¯fn = n−1Pn i=1 f(Xs,i) and ¯f0 = E0{f(X...

work page 2022
[10]

Hence by Theorem 3.10.5 of van der Vaart and Wellner (1996), we have that sup f ∈F √n{(θ+ n,τn(f) − θ− n,τn(f)} − 1√n nX i=1 {φ+ 0,τ0(Zn,i; f) − φ− 0,τ0(Zn,i; f)} Pn →

It follows from Lemma 3.10.11 of van der Vaart and Wellner (1996) that Pn is contiguous with respect to P0 under (11). Hence by Theorem 3.10.5 of van der Vaart and Wellner (1996), we have that sup f ∈F √n{(θ+ n,τn(f) − θ− n,τn(f)} − 1√n nX i=1 {φ+ 0,τ0(Zn,i; f) − φ− 0,τ0(Zn,i; f)} Pn →

work page 1996
[11]

□ 54 D.8

Finally, under the Donsker condition in Assumption 4, Theorem 3.10.12 of van der Vaart and Wellner (1996) implies that ( 1√n nX i=1 {φ+ 0,τ0(Zn,i; f) − φ− 0,τ0(Zn,i; f)} : f ∈ F ) converges to {G(f) + c(f) : f ∈ F } as an element in ℓ∞(F). □ 54 D.8. Proof of Corollary

work page 1996
[12]

This implies the first part of (17); the second part follows using the same reasoning

Furthermore, following the proof of Theorem 3.10.12 in van der Vaart and Wellner (1996), (16) implies that Z φ+ 0,δ(z; f){n1/2dPn(z) − n1/2dP0(z) − S(z)dP0(z)} 55 also converges to zero uniformly in f. This implies the first part of (17); the second part follows using the same reasoning. □ D.10. Proof of Theorem

work page 1996
[13]

Joint weak convergence of θ+ n,δ(f) and θ− n,δ(f) under Pn can be established as follows

Namely, uniform asymptotic linearity under P0 of θ+ n,δ(f) and θ− n,δ(f) follows from Theorem 1, contiguity w.r.t P + n and P − n follows from Lemma 3.10.11 of van der Vaart and Wellner (1996), uniform asymptotic linearity under P + n and P − n follows from Theorem 3.10.5 of van der Vaart and Wellner (1996) and the resulting weak convergence result follow...

work page 1996

[1] [1]

Allen, D. L. (1997). Hypothesis testing using an l 1-distance bootstrap. The American Statistician, 51(2):145–150. Andrews, D. W. and Shi, X. (2013). Inference based on conditional moment inequalities. Econometrica, 81(2):609–666. Athey, S. and Wager, S. (2021). Policy learning with observational data. Econometrica, 89(1):133–161. Benkeser, D. and Van Der...

work page 1997

[2] [2]

Ding, P., Feller, A., and Miratrix, L. (2019). Decomposing treatment effect variation. Journal of the American Statistical Association , 114(525):304–317. Dudley, R. M. (2014). Uniform central limit theorems, volume

work page 2019

[3] [3]

Cvxr: An r package for disciplined convex optimization

Cambridge university press. Fu, A., Narasimhan, B., and Boyd, S. (2017). Cvxr: An r package for disciplined convex optimization. arXiv preprint arXiv:1711.07582 . 32 Gail, M. and Simon, R. (1985). Testing for qualitative interactions between treatment effects and patient subsets. Biometrics, pages 361–372. Hammer, S. M., Katzenstein, D. A., Hughes, M. D.,...

work page arXiv 2017

[4] [4]

Li, Z., Nassif, H., and Luedtke, A

Springer. Li, Z., Nassif, H., and Luedtke, A. (2024). Estimation of subsidiary performance metrics under optimal policies. arXiv preprint arXiv:2401.04265 . Luedtke, A. R. and Van Der Laan, M. J. (2016). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 44(2):713. Nie, X. and Wager, S....

work page arXiv 2024

[5] [5]

van der Vaart, A

Cambridge university press. van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer New York. VanderWeele, T. J. (2009). On the distinction between interaction and effect modification. Epidemiology, pages 863–871. Watson, J. A. and Holmes, C. C. (2020). Machine learning analysis plans for randomised controlled tr...

work page 1996

[6] [6]

Let t+ α and t− α be chosen as the (1 − α) quantile of supf ∈F G+(f) and the α quantile of inf f ∈F G−(f) respectively

(Asymptotic type I error control: qualitative heterogeneity) Suppose P0 is any fixed probability distribution for which the null of no qualitative effect heterogeneity holds. Let t+ α and t− α be chosen as the (1 − α) quantile of supf ∈F G+(f) and the α quantile of inf f ∈F G−(f) respectively. Then under the conditions of Corollary 1, lim sup n→∞ P0 n1/2 ...

work page 2021

[7] [7]

Local Asymptotic Behavior C.1

Appendix C. Local Asymptotic Behavior C.1. Test for quantitative heterogeneity. In what follows, we will investigate the properties of our tests in a local asymptotic framework. We will consider first quantitative and then qualitative heterogeneity testing. The first case follows along fairly standard arguments; see for example Section 3.10 of van der Vaa...

work page 1996

[8] [8]

(Power against local alternatives: qualitative heterogeneity) Assume the setting of Theorem 7, and let t+ α and t− α , respectively, be the (1 − α) and α quantiles of 44 supf ∈F G+(f) and inf f ∈F G−(f). Then under sampling from ˜Pn, lim n→∞ ˜Pn n1/2 sup f ∈F θ+ n,δ(f) > t + α and n1/2 inf f ∈F θ− n,δ(f) < t − α ≥ max 0, P0 sup f ∈F {G+(f) + c+(f)} > t + ...

work page 1989

[9] [9]

We will show the result for θ+ n,τn(f)

Proof. We will show the result for θ+ n,τn(f). For a fixed f, we have that r+ n,τn(f) = R1(f) + R2(f) where R1(f) := 1 n nX i=1 {ψn(Zi) − τn} {f(Xs,i) − ¯fn} − {ψ0(Zi) − τ0} f(Xs,i) − ¯f0 − Z {ψn(z) − τn} {f(xs) − ¯fn} − {ψ0(z) − τ0} f(xs) − ¯f0 dP0(z) R2(f) := Z {ψn(z) − τn} {f(xs) − ¯fn} − θ+ 0,τ0(f) dP0(z) where ¯fn = n−1Pn i=1 f(Xs,i) and ¯f0 = E0{f(X...

work page 2022

[10] [10]

Hence by Theorem 3.10.5 of van der Vaart and Wellner (1996), we have that sup f ∈F √n{(θ+ n,τn(f) − θ− n,τn(f)} − 1√n nX i=1 {φ+ 0,τ0(Zn,i; f) − φ− 0,τ0(Zn,i; f)} Pn →

It follows from Lemma 3.10.11 of van der Vaart and Wellner (1996) that Pn is contiguous with respect to P0 under (11). Hence by Theorem 3.10.5 of van der Vaart and Wellner (1996), we have that sup f ∈F √n{(θ+ n,τn(f) − θ− n,τn(f)} − 1√n nX i=1 {φ+ 0,τ0(Zn,i; f) − φ− 0,τ0(Zn,i; f)} Pn →

work page 1996

[11] [11]

□ 54 D.8

Finally, under the Donsker condition in Assumption 4, Theorem 3.10.12 of van der Vaart and Wellner (1996) implies that ( 1√n nX i=1 {φ+ 0,τ0(Zn,i; f) − φ− 0,τ0(Zn,i; f)} : f ∈ F ) converges to {G(f) + c(f) : f ∈ F } as an element in ℓ∞(F). □ 54 D.8. Proof of Corollary

work page 1996

[12] [12]

This implies the first part of (17); the second part follows using the same reasoning

Furthermore, following the proof of Theorem 3.10.12 in van der Vaart and Wellner (1996), (16) implies that Z φ+ 0,δ(z; f){n1/2dPn(z) − n1/2dP0(z) − S(z)dP0(z)} 55 also converges to zero uniformly in f. This implies the first part of (17); the second part follows using the same reasoning. □ D.10. Proof of Theorem

work page 1996

[13] [13]

Joint weak convergence of θ+ n,δ(f) and θ− n,δ(f) under Pn can be established as follows

Namely, uniform asymptotic linearity under P0 of θ+ n,δ(f) and θ− n,δ(f) follows from Theorem 1, contiguity w.r.t P + n and P − n follows from Lemma 3.10.11 of van der Vaart and Wellner (1996), uniform asymptotic linearity under P + n and P − n follows from Theorem 3.10.5 of van der Vaart and Wellner (1996) and the resulting weak convergence result follow...

work page 1996