Nonparametric inference for sublevel-set probabilities of conditional average treatment effect functions
Pith reviewed 2026-05-19 15:22 UTC · model grok-4.3
pith:WGRJXBIQ Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{WGRJXBIQ}
Prints a linked pith:WGRJXBIQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
The probability that a conditional average treatment effect falls below a given threshold produces a monotone curve summarizing treatment heterogeneity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalize the curve of sublevel-set probabilities of a CATE function as a target parameter. This curve is not pathwise differentiable under a nonparametric model. To address this, we leverage advances in monotone function estimation and develop a Grenander-type estimator that incorporates machine learning. We also show that the best piecewise linear approximation to the curve is pathwise differentiable and develop a debiased machine learning estimator for it. The methods are studied in numerical experiments based on data synthesized from randomized trials and illustrated on a diabetes medication trial.
What carries the argument
The sublevel-set probability of the CATE function, defined as the probability that CATE(X) does not exceed a prespecified threshold, which traces a univariate monotone curve as the threshold varies.
If this is right
- Varying the threshold produces a univariate monotone curve that visualizes the overall type and degree of heterogeneity in a population.
- The curve can be targeted and estimated via monotone function techniques combined with machine learning.
- The best piecewise linear approximation to the curve is pathwise differentiable and admits a debiased machine learning estimator.
- Finite-sample performance of the estimators can be assessed in numerical studies based on synthesized randomized trial data.
Where Pith is reading between the lines
- The same sublevel-probability construction could be applied to other causal functionals that lack pathwise differentiability once monotonicity is imposed.
- The resulting curve offers a direct way to communicate the fraction of a population expected to benefit or be harmed by treatment at any chosen effect size.
- Extensions to observational data would require only that the identification assumptions for the CATE remain valid and that the monotonicity structure is preserved.
Load-bearing premise
The conditional average treatment effect function is identifiable from observed data under randomized treatment assignment.
What would settle it
A simulation or randomized trial in which the true proportion of individuals whose CATE lies below each threshold is known from the data-generating process and the proposed Grenander-type estimator fails to recover that proportion at the rates predicted by the theory.
Figures
read the original abstract
The average treatment effect can obscure important heterogeneity when individuals respond differently to a treatment. While the conditional average treatment effect (CATE) function captures such heterogeneity, it is difficult to communicate when it depends on many covariates. Sublevels sets of a multivariate CATE function are equally complicated objects, but the probability of a sublevel set of a CATE function is a single number with a simple interpretation as the proportion of individuals whose expected treatment effect does not exceed a prespecified threshold. By varying the threshold, a univariate monotone curve appears which can be used to visualize the overall type and degree of heterogeneity in a population. We formalize this curve as a target parameter and show that it is not pathwise differentiable under a nonparametric model. To address this nonstandard estimation problem, we leverage recent advances in monotone function estimation and develop a Grenander-type estimator that incorporates machine learning. We also show that the best piecewise linear approximation to the curve of interest is a pathwise differentiable parameter, and we develop a debiased machine learning estimator of this approximation. We investigate our proposed estimators' finite sample performance in a sequence of numerical studies based on data synthesized from a randomized trial. The methods are illustrated in data from a randomized trial on diabetes medication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript formalizes the sublevel-set probability curve θ(t) = P(τ(X) ≤ t) as a univariate monotone summary of treatment-effect heterogeneity for the conditional average treatment effect function τ. It proves that θ(t) is not pathwise differentiable under a nonparametric model, develops a Grenander-type estimator that incorporates machine learning, and constructs a debiased machine-learning estimator for the best piecewise-linear approximation to the curve. Finite-sample performance is examined in numerical studies on synthesized randomized-trial data, and the methods are illustrated on data from a diabetes-medication trial.
Significance. If the non-differentiability result holds, the work supplies a simple, interpretable univariate curve for visualizing the overall type and degree of heterogeneity that is otherwise difficult to communicate from a high-dimensional CATE. The integration of recent monotone-function estimation techniques with modern machine learning is a methodological contribution, and the dual-estimator strategy (Grenander-type for the original parameter and debiased ML for the differentiable approximation) is a pragmatic response to the non-regularity. The numerical studies on synthesized data and the real-data illustration provide concrete evidence of applicability in randomized-trial settings.
major comments (3)
- [Section on target parameter and non-differentiability] Section formalizing the target parameter and the non-differentiability result: the tangent-space argument for non-pathwise differentiability should explicitly treat the case in which the distribution of τ(X) places positive mass at the level t. When P(τ(X)=t)>0, a directional derivative may still exist, which would undermine the claim that the functional is non-differentiable and therefore the justification for abandoning standard debiased ML in favor of the Grenander-type estimator.
- [Estimator construction] Description of the Grenander-type estimator (presumably §3 or §4): the precise regularity conditions under which the machine-learning plug-in for the CATE is inserted into the monotone estimator, and the resulting asymptotic distribution, are not fully stated. Without these conditions it is difficult to verify that the proposed estimator attains the expected cube-root rate or that the confidence bands are valid.
- [Numerical studies] Numerical studies section: the manuscript states that finite-sample performance is investigated, yet no quantitative summaries (bias, MSE, coverage rates, or comparison to oracle estimators) are provided in the text or tables. This omission prevents assessment of whether the estimators behave as predicted by the theory under the synthesized randomized-trial designs.
minor comments (2)
- [Abstract] The abstract could explicitly name the non-differentiability result and the two proposed estimators to give readers an immediate overview of the technical contribution.
- [Throughout] Notation for the CATE function and the sublevel probability should be introduced once and used consistently; occasional switches between τ(X) and other symbols for the same object reduce readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each of the three major comments in turn below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Section on target parameter and non-differentiability] Section formalizing the target parameter and the non-differentiability result: the tangent-space argument for non-pathwise differentiability should explicitly treat the case in which the distribution of τ(X) places positive mass at the level t. When P(τ(X)=t)>0, a directional derivative may still exist, which would undermine the claim that the functional is non-differentiable and therefore the justification for abandoning standard debiased ML in favor of the Grenander-type estimator.
Authors: We appreciate the referee highlighting the need to treat atoms explicitly. The manuscript's non-differentiability argument relies on the fact that θ(t) is defined via the distribution function of the random variable τ(X), and the tangent-space calculation shows that no linear representation exists for arbitrary perturbations of the law of (X, Y(0), Y(1)). When P(τ(X)=t)>0 the functional has a discontinuity in t, but this does not restore pathwise differentiability under the nonparametric model; suitable score functions can still produce second-order changes that prevent a first-order representation. To make this transparent, we will revise the relevant section to include a separate paragraph (or lemma) that directly addresses the atomic case and confirms that the directional derivative fails to exist in the full nonparametric tangent space. revision: yes
-
Referee: [Estimator construction] Description of the Grenander-type estimator (presumably §3 or §4): the precise regularity conditions under which the machine-learning plug-in for the CATE is inserted into the monotone estimator, and the resulting asymptotic distribution, are not fully stated. Without these conditions it is difficult to verify that the proposed estimator attains the expected cube-root rate or that the confidence bands are valid.
Authors: The referee correctly notes that the regularity conditions and limiting distribution for the Grenander-type estimator are stated at a high level. In the revision we will add an explicit theorem (with numbered assumptions) that lists the required convergence rates for the machine-learning estimator of τ, the smoothness conditions on the density of τ(X), and the resulting cube-root-n asymptotic distribution of the estimator and its associated confidence bands. This will make verification of the cube-root rate and band validity straightforward. revision: yes
-
Referee: [Numerical studies] Numerical studies section: the manuscript states that finite-sample performance is investigated, yet no quantitative summaries (bias, MSE, coverage rates, or comparison to oracle estimators) are provided in the text or tables. This omission prevents assessment of whether the estimators behave as predicted by the theory under the synthesized randomized-trial designs.
Authors: We agree that the numerical studies would be more informative with explicit quantitative summaries. The current version emphasizes visual diagnostics; the revised manuscript will include a table (or set of tables) reporting bias, MSE, coverage probabilities of the confidence bands, and comparisons against oracle estimators that use the true CATE, for each of the simulation designs described in the section. revision: yes
Circularity Check
No circularity: target parameter and non-differentiability shown via external tangent-space methods
full rationale
The paper defines the sublevel-set probability θ(t) = P(τ(X) ≤ t) directly from the identifiable CATE function under randomized assignment, then invokes standard nonparametric tangent-space arguments to establish lack of pathwise differentiability. Estimation proceeds by importing Grenander-type monotone estimators and debiased ML from the cited literature on monotone functions and double ML, without any reduction of θ(t) to a fitted parameter or self-referential construction. No self-citation is load-bearing for the core non-differentiability claim, and the piecewise-linear approximation is handled separately with its own influence function. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The conditional average treatment effect function is identifiable from the observed data under randomized treatment assignment.
- standard math The sublevel-set probability curve is monotone non-decreasing in the threshold value.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize this curve as a target parameter and show that it is not pathwise differentiable under a nonparametric model.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the efficient influence function of Γ(α) is υ_α(P)(O) = 1{τ(P)(W)≤α}(α−φ(P)(O))−Γ(P)(α)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
J.-Y. Audibert and A. B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of Statistics, 2007
work page 2007
-
[2]
P. J. Bickel, C. A. Klaassen, Y. Ritov, and J. A. Wellner. Efficient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore, 1993
work page 1993
-
[3]
M. Bonvini, E. H. Kennedy, and L. J. Keele. Minimax optimal subgroup identification. arXiv preprint arXiv:2306.17464, 2023
-
[4]
L. Breiman. Stacked regressions. Machine learning, 24 0 (1): 0 49--64, 1996
work page 1996
-
[5]
L. Breiman. Random forests. Machine Learning, 45 0 (1): 0 5--32, 2001. doi:10.1023/A:1010933404324
- [6]
-
[7]
XGBoost: A scalable tree boosting system,
T. Chen and C. Guestrin. XGBoost : A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD '16, pages 785--794, New York, NY, USA, 2016. Association for Computing Machinery. doi:10.1145/2939672.2939785
-
[8]
Y.-C. Chen, C. R. Genovese, and L. Wasserman. Density level sets: Asymptotics, inference, and visualization. Journal of the American Statistical Association, 112 0 (520): 0 1684--1696, 2017
work page 2017
-
[9]
V. Chernozhukov, I. Fernandez-Val, and A. Galichon. Improving point and interval estimators of monotone functions by rearrangement. Biometrika, 96 0 (3): 0 559--575, 2009
work page 2009
-
[10]
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch
V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters . The Econometrics Journal, 21 0 (1): 0 C1--C68, 01 2018 a . doi:10.1111/ectj.12097. URL https://doi.org/10.1111/ectj.12097
-
[11]
V. Chernozhukov, I. Fern \'a ndez-Val, and Y. Luo. The sorted effects method: Discovering heterogeneous effects beyond their averages. Econometrica, 86 0 (6): 0 1911--1938, 2018 b
work page 1911
-
[12]
V. Chernozhukov, M. Demirer, E. Duflo, and I. Fern \'a ndez-Val. Fisher-schultz lecture: Generic machine learning inference on heterogenous treatment effects in randomized experiments, with an application to immunization in india. arXiv preprint arXiv:1712.04802, 2023
-
[13]
C. De Boor. A practical guide to splines, volume 27. springer New York, 1978
work page 1978
-
[14]
L. Devroye, L. Gy \"o rfi, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 1996
work page 1996
- [15]
-
[16]
J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33 0 (1): 0 1--22, 2010. doi:10.18637/jss.v033.i01
-
[17]
C. J. Geyer. On the asymptotics of constrained m-estimation. The Annals of statistics, pages 1993--2010, 1994
work page 1993
-
[18]
R. D. Gill, M. J. Laan, and J. M. Robins. Coarsening at random: Characterizations, conjectures, counter-examples. In Proceedings of the First Seattle Symposium in Biostatistics, pages 255--294. Springer, 1997
work page 1997
-
[19]
P. Groeneboom. Estimating a monotone density. In Proceedings of the Berkeley conference in honor of Jerzy Neyman and Jack Kiefer, Vol. II, 1983
work page 1983
-
[20]
P. Groeneboom and J. A. Wellner. Computing chernoff's distribution. Journal of Computational and Graphical Statistics, 10 0 (2): 0 388--400, 2001
work page 2001
-
[21]
M. Hernán and J. Robins. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC, 2020
work page 2020
-
[22]
J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20 0 (1): 0 217--240, 2011
work page 2011
- [23]
- [24]
- [25]
-
[26]
E. H. Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17 0 (2): 0 3008--3049, 2023
work page 2023
-
[27]
E. H. Kennedy, S. Balakrishnan, and L. Wasserman. Semiparametric counterfactual density estimation. Biometrika, 110 0 (4): 0 875--896, 2023
work page 2023
-
[28]
S. R. K \"u nzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116 0 (10): 0 4156--4165, 2019
work page 2019
-
[29]
J. Levy, M. van der Laan, A. Hubbard, and R. Pirracchio. A fundamental measure of treatment effect heterogeneity. Journal of Causal Inference, 9 0 (1): 0 83--108, 2021
work page 2021
-
[30]
M. Lu, S. Sadiq, D. J. Feaster, and H. Ishwaran. Estimating individual treatment effect in observational data using random forest methods. Journal of Computational and Graphical Statistics, 27 0 (1): 0 209--219, 2018
work page 2018
-
[31]
E. Mammen and W. Polonik. Confidence regions for level sets. Journal of Multivariate Analysis, 122: 0 202--214, 2013
work page 2013
-
[32]
E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27 0 (6): 0 1808--1829, 1999
work page 1999
-
[33]
S. P. Marso, G. H. Daniels, K. Brown-Frandsen, P. Kristensen, J. F. Mann, M. A. Nauck, S. E. Nissen, S. Pocock, N. R. Poulter, L. S. Ravn, et al. Liraglutide and cardiovascular outcomes in type 2 diabetes. New England Journal of Medicine, 375 0 (4): 0 311--322, 2016
work page 2016
-
[34]
D. M. Mason and W. Polonik. Asymptotic normality of plug-in level set estimates. Annals of Applied Probability, 2009
work page 2009
-
[35]
J. L. Montiel Olea and M. Plagborg-M ller. Simultaneous confidence bands: Theory, implementation, and an application to svars. Journal of Applied Econometrics, 34 0 (1): 0 1--17, 2019
work page 2019
-
[36]
J. Neyman. Sur les applications de la th \'e orie des probabilit \'e s aux experiences agricoles: Essai des principes. Roczniki Nauk Rolniczych, 10 0 (1): 0 1--51, 1923
work page 1923
- [37]
-
[38]
J. Pfanzagl and W. Wefelmeyer. Contributions to a general asymptotic statistical theory. Springer, 1982
work page 1982
-
[39]
W. Polonik. Measuring mass concentrations and estimating density contour clusters-an excess mass approach. The annals of Statistics, pages 855--881, 1995
work page 1995
-
[40]
W. Qiao and W. Polonik. Nonparametric confidence regions for level sets: Statistical properties and geometry. Electronic Journal of Statistics, 2019
work page 2019
-
[41]
H. W. Reeve, T. I. Cannings, and R. J. Samworth. Optimal subgroup selection. The Annals of Statistics, 51 0 (6): 0 2342--2365, 2023
work page 2023
-
[42]
P. Rigollet and R. Vert. Optimal rates for plug-in estimators of density level sets. Bernoulli, 2009
work page 2009
-
[43]
J. Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling, 7 0 (9-12): 0 1393--1512, 1986
work page 1986
-
[44]
D. B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66 0 (5): 0 688, 1974
work page 1974
-
[45]
V. Semenova and V. Chernozhukov. Debiased machine learning of conditional average treatment effects and other causal functions. The Econometrics Journal, 24 0 (2): 0 264--289, 2021
work page 2021
-
[46]
A. Shapiro. On the asymptotics of constrained local m-estimators. Annals of statistics, pages 948--960, 2000
work page 2000
-
[47]
J. Tibshirani, S. Athey, E. S. Sverdrup, and S. Wager. grf: Generalized Random Forests, 2024. URL https://CRAN.R-project.org/package=grf. R package version 2.4.0
work page 2024
-
[48]
A. B. Tsybakov. On nonparametric estimation of density level sets. The Annals of Statistics, 25 0 (3): 0 948--969, 1997
work page 1997
-
[49]
M. J. van der Laan and A. R. Luedtke. Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. Technical report, Berkeley Division of Biostatistics Working Paper Series, 2014
work page 2014
-
[50]
M. J. van der Laan and J. M. Robins. Unified methods for censored longitudinal data and causality. Springer Science & Business Media, 2003
work page 2003
-
[51]
M. J. van der Laan and S. Rose. Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media, 2011
work page 2011
-
[52]
M. J. van der Laan, E. C. Polley, and A. E. Hubbard. Super learner. Statistical applications in genetics and molecular biology, 6 0 (1), 2007
work page 2007
-
[53]
A. W. van der Vaart. On differentiable functionals. The Annals of Statistics, pages 178--204, 1991
work page 1991
-
[54]
A. W. van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000
work page 2000
-
[55]
A. W. van der Vaart. Semiparametric statistics. In Lectures on probability theory and statistics (Saint-Flour, 1999), pages 331--457. Springer, 2002
work page 1999
-
[56]
A. W. van der Vaart and M. J. van der Laan. Estimating a survival distribution with current status data and high-dimensional covariates. The International Journal of Biostatistics, 2 0 (1), 2006
work page 2006
-
[57]
A. W. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Science & Business Media, 1996
work page 1996
-
[58]
S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113 0 (523): 0 1228--1242, 2018
work page 2018
- [59]
-
[60]
T. Westling and M. Carone. A unified study of nonparametric inference for monotone functions. Annals of statistics, 48 0 (2): 0 1001, 2020
work page 2020
-
[61]
M. N. Wright and A. Ziegler. ranger: A fast implementation of random forests for high dimensional data in C++ and R . Journal of Statistical Software, 77 0 (1): 0 1--17, 2017. doi:10.18637/jss.v077.i01
-
[62]
S. Yadlowsky, S. Fleming, N. Shah, E. Brunskill, and S. Wager. Evaluating treatment prioritization rules via rank-weighted average treatment effects. Journal of the American Statistical Association, 120 0 (549): 0 38--51, 2025
work page 2025
-
[63]
S. C. Ziersen and T. Martinussen. Variable importance measures for heterogeneous treatment effects with survival outcome. Scandinavian Journal of Statistics, 2025
work page 2025
- [64]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.