Semi-Supervised Treatment Effect Estimation with Unlabeled Covariates for Prediction-Powered Causal Inference
Pith reviewed 2026-05-17 23:44 UTC · model grok-4.3
The pith
Incorporating auxiliary covariates lowers the efficiency bound for treatment effect estimation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In both the one-sample (censoring) and two-sample (case-control) settings, incorporating auxiliary unlabeled covariates lowers the efficiency bound and yields estimators whose asymptotic variance is smaller than that of estimators that use only the labeled triple of covariates, treatment, and outcome.
What carries the argument
Efficiency bounds derived separately for the one-sample and two-sample semi-supervised data-generating processes, together with the corresponding efficient estimators that attain those bounds.
If this is right
- The efficiency bound is strictly lower once auxiliary covariates enter the problem.
- Efficient estimators exist whose asymptotic variance matches the improved bound in each setting.
- The variance reduction holds without requiring labels on the auxiliary covariates.
- The same improvement appears in both the one-sample and two-sample designs.
Where Pith is reading between the lines
- Practitioners could collect inexpensive auxiliary covariates to tighten causal estimates whenever full labeling is costly.
- The efficiency argument may carry over to other causal functionals beyond the average treatment effect.
- Finite-sample behavior and robustness to model misspecification remain open questions suggested by the asymptotic results.
Load-bearing premise
The one-sample and two-sample data-generating processes permit derivation of achievable efficiency bounds under standard regularity conditions for asymptotic analysis of estimators.
What would settle it
An estimator that incorporates the auxiliary covariates yet exhibits the same or larger asymptotic variance as the estimator that ignores them, in either the one-sample or two-sample regime, would falsify the central claim.
Figures
read the original abstract
This study investigates treatment effect estimation in the semi-supervised setting, also can be interpreted as prediction-powered inference. In our setting, we can use not only the standard triple of covariates, treatment indicator, and outcome, but also unlabeled auxiliary covariates. For this problem, we develop efficiency bounds and efficient estimators whose asymptotic variance aligns with the efficiency bound. In the analysis, we introduce two different data-generating processes: the one-sample setting and the two-sample setting. The one-sample setting considers the case where we can observe treatment indicators and outcomes for a part of the dataset, which is also called the censoring setting. In contrast, the two-sample setting considers two independent datasets with labeled and unlabeled data, which is also called the case-control setting or the stratified setting. In both settings, we find that by incorporating auxiliary covariates, we can lower the efficiency bound and obtain an estimator with an asymptotic variance smaller than that without such auxiliary covariates. We frame our framework as prediction-powered causal inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops semiparametric efficiency bounds and matching estimators for the average treatment effect in two semi-supervised regimes (one-sample censoring and two-sample case-control) that incorporate unlabeled auxiliary covariates. It derives the bounds under standard regularity conditions, constructs estimators whose asymptotic variance matches the bound, and shows that the auxiliary covariates strictly lower the bound relative to the labeled-only case, framing the approach as prediction-powered causal inference.
Significance. If the bounds are correctly derived and the estimators attain them, the work supplies a rigorous efficiency theory for using abundant unlabeled covariates in causal estimation, which is practically relevant when labeled outcomes are expensive. The explicit comparison of one-sample versus two-sample settings and the demonstration of variance reduction constitute a clear theoretical contribution.
major comments (3)
- [§4.1, Theorem 1] §4.1, Theorem 1: the efficiency bound for the one-sample setting is stated to be strictly smaller when auxiliary covariates are included, yet the proof sketch does not explicitly verify that the additional covariates enter the efficient influence function in a way that reduces the variance term without introducing new bias; a direct comparison of the two influence functions (with and without auxiliaries) is needed to confirm the reduction is not an artifact of the censoring mechanism.
- [§5.2, Eq. (18)] §5.2, Eq. (18): the claim that the proposed estimator attains the efficiency bound relies on nuisance estimators (propensity and outcome regression) trained on the pooled labeled+unlabeled sample satisfying product-rate conditions o_p(n^{-1/2}). The manuscript provides no entropy or Donsker-class arguments for the function classes that now include the auxiliary covariates, leaving open whether the semi-supervised nuisance rates are sufficient for asymptotic efficiency.
- [§6] §6, simulation design: the reported variance reduction is shown only for correctly specified parametric nuisances; it is unclear whether the same reduction persists under nonparametric nuisance estimation with the auxiliary covariates, which is the regime where the efficiency-bound claim is most relevant.
minor comments (2)
- Notation for the auxiliary covariate vector is introduced inconsistently between the one-sample and two-sample sections; a single global definition would improve readability.
- The abstract states that the estimators are 'efficient' but the main text should explicitly reference the theorem number that establishes asymptotic normality and efficiency.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4.1, Theorem 1] §4.1, Theorem 1: the efficiency bound for the one-sample setting is stated to be strictly smaller when auxiliary covariates are included, yet the proof sketch does not explicitly verify that the additional covariates enter the efficient influence function in a way that reduces the variance term without introducing new bias; a direct comparison of the two influence functions (with and without auxiliaries) is needed to confirm the reduction is not an artifact of the censoring mechanism.
Authors: We agree that an explicit side-by-side comparison of the efficient influence functions would clarify the source of the variance reduction. In the revised manuscript we will insert a direct comparison of the EIFs (with and without auxiliary covariates) for the one-sample setting. This comparison will show that the auxiliary covariates enter only through an additional variance-reduction term in the EIF while leaving the bias term unchanged, confirming that the efficiency gain is not an artifact of the censoring mechanism. revision: yes
-
Referee: [§5.2, Eq. (18)] §5.2, Eq. (18): the claim that the proposed estimator attains the efficiency bound relies on nuisance estimators (propensity and outcome regression) trained on the pooled labeled+unlabeled sample satisfying product-rate conditions o_p(n^{-1/2}). The manuscript provides no entropy or Donsker-class arguments for the function classes that now include the auxiliary covariates, leaving open whether the semi-supervised nuisance rates are sufficient for asymptotic efficiency.
Authors: The referee correctly identifies that the current text assumes the product-rate conditions without supplying supporting entropy or Donsker arguments for the enlarged function classes. We will revise Section 5.2 to include explicit entropy-integral bounds (or Donsker-class assumptions) that cover the semi-supervised nuisance estimators trained on the pooled sample, thereby rigorously justifying that the required o_p(n^{-1/2}) rates are attainable. revision: yes
-
Referee: [§6] §6, simulation design: the reported variance reduction is shown only for correctly specified parametric nuisances; it is unclear whether the same reduction persists under nonparametric nuisance estimation with the auxiliary covariates, which is the regime where the efficiency-bound claim is most relevant.
Authors: We acknowledge that the present simulations are limited to correctly specified parametric nuisances. To address this gap we will expand the simulation study to include nonparametric nuisance estimators (e.g., random forests and neural networks) trained on the pooled labeled-plus-unlabeled data. The new experiments will report the realized variance reduction under these nonparametric regimes, directly supporting the efficiency-bound claims. revision: yes
Circularity Check
No circularity: efficiency bounds derived independently via semiparametric theory
full rationale
The paper derives efficiency bounds for treatment effect estimation in one-sample (censoring) and two-sample settings by incorporating auxiliary covariates into the data-generating process, then constructs estimators whose asymptotic variance matches the bound under standard regularity conditions. This follows conventional influence-function and semiparametric efficiency arguments without reducing to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The claim that auxiliary covariates lower the bound is a direct consequence of the expanded model class rather than an input-output equivalence by construction. The derivations remain self-contained against external benchmarks in semiparametric statistics.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop efficiency bounds and efficient estimators whose asymptotic variance aligns with the efficiency bound... using generalized Riesz regression... Neyman orthogonal scores
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
efficiency bound... V^OS := E[ψ_OS(...)^2] ... asymptotic normality √n(τ̂ - τ0) → N(0, V^OS)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao
David Azriel, Lawrence D. Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. Journal of the American Statistical Association, 117 0 (540): 0 2238--2251, 2022
work page 2022
-
[2]
Heejung Bang and James M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61 0 (4): 0 962--973, 2005
work page 2005
-
[3]
Augmented balancing weights as linear regression
David Bruns-Smith, Oliver Dukes, Avi Feller, and Elizabeth L Ogburn. Augmented balancing weights as linear regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 04 2025
work page 2025
-
[4]
Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Semi-Supervised Learning. MIT Press, 2006
work page 2006
-
[5]
Double/debiased machine learning for treatment and structural parameters
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 2018
work page 2018
-
[6]
arXiv preprint arXiv:2104.14737 , year=
Victor Chernozhukov, Whitney K. Newey, Victor Quintas-Martinez, and Vasilis Syrgkanis. Automatic debiased machine learning via riesz regression, 2021. a rXiv:2104.14737
-
[7]
Victor Chernozhukov, Whitney Newey, V\' ctor M Quintas-Mart\' nez, and Vasilis Syrgkanis. R iesz N et and F orest R iesz: Automatic debiased machine learning with neural nets and random forests. In International Conference on Machine Learning (ICML), 2022 a
work page 2022
-
[8]
Victor Chernozhukov, Whitney K. Newey, and Rahul Singh. Automatic debiased machine learning of causal and structural effects. Econometrica, 90 0 (3): 0 967--1027, 2022 b
work page 2022
-
[9]
Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms
Alicia Curth and Mihaela van der Schaar. Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), 2021
work page 2021
-
[10]
Marthinus Christoffel du Plessis, Gang. Niu, and Masashi Sugiyama. Convex formulation for learning from positive and unlabeled data. In International Conference on Machine Learning (ICML), 2015
work page 2015
-
[11]
Learning classifiers from only positive and unlabeled data
Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. In International Conference on Knowledge Discovery and Data Mining (KDD), 2008
work page 2008
-
[12]
Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey
Vitor Hadad, David A. Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. Confidence intervals for policy evaluation in adaptive experiments. Proceedings of the National Academy of Sciences (PNAS), 118 0 (15), 2021
work page 2021
-
[13]
Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66 0 (2): 0 315--331, 1998
work page 1998
-
[14]
Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20 0 (1): 0 25--46, 2012
work page 2012
-
[15]
Shadow prices, market wages, and labor supply
James Heckman. Shadow prices, market wages, and labor supply. Econometrica, 42 0 (4): 0 679--694, 1974
work page 1974
-
[16]
Daniel G. Horvitz and Donovan J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47 0 (260): 0 663--685, 1952
work page 1952
-
[17]
Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Sch \"o lkopf, and Alex J. Smola. Correcting sample selection bias by unlabeled data. In NeurIPS, pp.\ 601--608. MIT Press, 2007
work page 2007
-
[18]
Kosuke Imai and Aaron Strauss. Estimation of heterogeneous treatment effects from randomized experiments, with application to the optimal planning of the get-out-the-vote campaign. Political Analysis, 19 0 (1): 0 1--19, 2011
work page 2011
-
[19]
Guido W. Imbens and Tony Lancaster. Efficient estimation and stratified sampling. Journal of Econometrics, 74 0 (2): 0 289--318, 1996
work page 1996
-
[20]
Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015
work page 2015
-
[21]
A least-squares approach to direct importance estimation
Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10 0 (Jul.): 0 1391--1445, 2009
work page 2009
-
[22]
Masahiro Kato. Direct bias-correction term estimation for propensity scores and average treatment effect estimation, 2025 a . a rXiv: 2509.22122
-
[23]
Direct debiased machine learning via bregman divergence minimization, 2025 b
Masahiro Kato. Direct debiased machine learning via bregman divergence minimization, 2025 b . a Xiv: 2510.23534
-
[24]
Nearest neighbor matching as least squares density ratio estimation and riesz regression, 2025 c
Masahiro Kato. Nearest neighbor matching as least squares density ratio estimation and riesz regression, 2025 c . a rXiv: 2510.24433
-
[25]
Masahiro Kato. A unified theory for causal inference: Direct debiased machine learning via bregman-riesz regression, 2025 d
work page 2025
-
[26]
Non-negative bregman divergence minimization for deep direct density ratio estimation
Masahiro Kato and Takeshi Teshima. Non-negative bregman divergence minimization for deep direct density ratio estimation. In International Conference on Machine Learning (ICML), 2021
work page 2021
-
[27]
Active adaptive experimental design for treatment effect estimation with covariate choice
Masahiro Kato, Akihiro Oga, Wataru Komatsubara, and Ryo Inokuchi. Active adaptive experimental design for treatment effect estimation with covariate choice. In International Conference on Machine Learning (ICML), 2024
work page 2024
-
[28]
Masahiro Kato, Fumiaki Kozai, and Ryo Inokuchi. Puate: Semiparametric efficient average treatment effect estimation from treated (positive) and unlabeled units, 2025. a rXiv:2501.19345
-
[29]
Semi-supervised learning with density-ratio estimation
Masanori Kawakita and Takafumi Kanamori. Semi-supervised learning with density-ratio estimation. Machine Learning, 91 0 (2): 0 189--209, 2013
work page 2013
-
[30]
Edward H. Kennedy. Efficient nonparametric causal inference with missing exposure information. The International Journal of Biostatistics, 16 0 (1), 2020
work page 2020
-
[31]
Kennedy, Sivaraman Balakrishnan, James M
Edward H. Kennedy, Sivaraman Balakrishnan, James M. Robins, and Larry Wasserman. Minimax rates for heterogeneous causal effect estimation. The Annals of Statistics, 52 0 (2): 0 793 -- 816, 2024
work page 2024
-
[32]
Positive-unlabeled learning with non-negative risk estimator
Ryuichi Kiryo, Gang Niu, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In Advances in Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[33]
Chris A. J. Klaassen. Consistent estimation of the influence function of locally asymptotically linear estimators. Annals of Statistics, 15, 1987
work page 1987
-
[34]
Kaitlyn J. Lee and Alejandro Schuler. Rieszboost: Gradient boosting for riesz regression, 2025. a rXiv: 2501.04871
-
[35]
Estimation based on nearest neighbor matching: from density ratio to average treatment effect
Zhexiao Lin, Peng Ding, and Fang Han. Estimation based on nearest neighbor matching: from density ratio to average treatment effect. Econometrica, 91 0 (6): 0 2187--2217, 2023
work page 2023
-
[36]
Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes
Jerzy Neyman. Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes. Statistical Science, 5: 0 463--472, 1923
work page 1923
-
[37]
Theoretical comparisons of positive-unlabeled learning against positive-negative learning
Gang Niu, Marthinus Christoffel du Plessis, Tomoya Sakai, Yao Ma, and Masashi Sugiyama. Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In Advances in Neural Information Processing Systems (NeurIPS), 2016
work page 2016
-
[38]
Benjamin Rhodes, Kai Xu, and Michael U. Gutmann. Telescoping density-ratio estimation. In Advances in Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[39]
J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89: 0 846--866, 1994
work page 1994
-
[40]
Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66: 0 688--701, 1974
work page 1974
-
[41]
Nonparametric regression using deep neural networks with ReLU activation function
Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. Annals of Statistics, 48 0 (4): 0 1875--1897, 2020
work page 2020
-
[42]
Density ratio matching under the bregman divergence: A unified framework of density ratio estimation
Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio matching under the bregman divergence: A unified framework of density ratio estimation. Annals of the Institute of Statistical Mathematics, 64, 10 2011
work page 2011
-
[43]
Density Ratio Estimation in Machine Learning
Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012
work page 2012
- [44]
-
[45]
Off-policy evaluation and learning for external validity under a covariate shift
Masatoshi Uehara, Masahiro Kato, and Shota Yasui. Off-policy evaluation and learning for external validity under a covariate shift. In Conference on Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[46]
M.J. van der Laan and S. Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Series in Statistics. Springer New York, 2011
work page 2011
-
[47]
Aad W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998
work page 1998
-
[48]
Estimation and inference of heterogeneous treatment effects using random forests
Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113 0 (523): 0 1228--1242, 2018
work page 2018
-
[49]
Jeffrey M. Wooldridge. Asymptotic properties of weighted m-estimation for standard stratified samples. Econometric Theory, 2001
work page 2001
-
[50]
Relative density-ratio estimation for robust distribution comparison
Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. In Advances in Neural Information Processing Systems (NeurIPS), volume 24. Curran Associates, Inc., 2011
work page 2011
-
[51]
Policy learning with adaptively collected data
Ruohan Zhan, Zhimei Ren, Susan Athey, and Zhengyuan Zhou. Policy learning with adaptively collected data. Management Science, 70 0 (8): 0 5270--5297, 2024
work page 2024
-
[52]
Covariate balancing propensity score by tailored loss functions
Qingyuan Zhao. Covariate balancing propensity score by tailored loss functions. The Annals of Statistics, 47 0 (2): 0 965 -- 993, 2019
work page 2019
-
[53]
Jos \'e R. Zubizarreta. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110 0 (511): 0 910--922, 2015
work page 2015
-
[54]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.