pith. sign in

arxiv: 2511.08303 · v2 · submitted 2025-11-11 · 📊 stat.ML · cs.LG· econ.EM· math.ST· stat.ME· stat.TH

Semi-Supervised Treatment Effect Estimation with Unlabeled Covariates for Prediction-Powered Causal Inference

Pith reviewed 2026-05-17 23:44 UTC · model grok-4.3

classification 📊 stat.ML cs.LGecon.EMmath.STstat.MEstat.TH
keywords semi-supervised learningtreatment effect estimationcausal inferenceefficiency boundsunlabeled covariatesprediction-powered inferenceasymptotic variance
0
0 comments X

The pith

Incorporating auxiliary covariates lowers the efficiency bound for treatment effect estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines treatment effect estimation in a semi-supervised regime where only some observations include the treatment indicator and outcome, but additional unlabeled covariates are available for all units. It analyzes this under two data-generating processes: a one-sample setting in which labels appear on a subset of a single dataset, and a two-sample setting with separate labeled and unlabeled collections. Efficiency bounds are derived for both cases, and the central result is that the auxiliary covariates tighten the bound, so that efficient estimators attain strictly smaller asymptotic variance than estimators that ignore the extra covariates. The work frames the procedure as prediction-powered causal inference.

Core claim

In both the one-sample (censoring) and two-sample (case-control) settings, incorporating auxiliary unlabeled covariates lowers the efficiency bound and yields estimators whose asymptotic variance is smaller than that of estimators that use only the labeled triple of covariates, treatment, and outcome.

What carries the argument

Efficiency bounds derived separately for the one-sample and two-sample semi-supervised data-generating processes, together with the corresponding efficient estimators that attain those bounds.

If this is right

  • The efficiency bound is strictly lower once auxiliary covariates enter the problem.
  • Efficient estimators exist whose asymptotic variance matches the improved bound in each setting.
  • The variance reduction holds without requiring labels on the auxiliary covariates.
  • The same improvement appears in both the one-sample and two-sample designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners could collect inexpensive auxiliary covariates to tighten causal estimates whenever full labeling is costly.
  • The efficiency argument may carry over to other causal functionals beyond the average treatment effect.
  • Finite-sample behavior and robustness to model misspecification remain open questions suggested by the asymptotic results.

Load-bearing premise

The one-sample and two-sample data-generating processes permit derivation of achievable efficiency bounds under standard regularity conditions for asymptotic analysis of estimators.

What would settle it

An estimator that incorporates the auxiliary covariates yet exhibits the same or larger asymptotic variance as the estimator that ignores them, in either the one-sample or two-sample regime, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.08303 by Masahiro Kato.

Figure 1
Figure 1. Figure 1: Illustration of the one-sample and two-sample scenarios. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

This study investigates treatment effect estimation in the semi-supervised setting, also can be interpreted as prediction-powered inference. In our setting, we can use not only the standard triple of covariates, treatment indicator, and outcome, but also unlabeled auxiliary covariates. For this problem, we develop efficiency bounds and efficient estimators whose asymptotic variance aligns with the efficiency bound. In the analysis, we introduce two different data-generating processes: the one-sample setting and the two-sample setting. The one-sample setting considers the case where we can observe treatment indicators and outcomes for a part of the dataset, which is also called the censoring setting. In contrast, the two-sample setting considers two independent datasets with labeled and unlabeled data, which is also called the case-control setting or the stratified setting. In both settings, we find that by incorporating auxiliary covariates, we can lower the efficiency bound and obtain an estimator with an asymptotic variance smaller than that without such auxiliary covariates. We frame our framework as prediction-powered causal inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper develops semiparametric efficiency bounds and matching estimators for the average treatment effect in two semi-supervised regimes (one-sample censoring and two-sample case-control) that incorporate unlabeled auxiliary covariates. It derives the bounds under standard regularity conditions, constructs estimators whose asymptotic variance matches the bound, and shows that the auxiliary covariates strictly lower the bound relative to the labeled-only case, framing the approach as prediction-powered causal inference.

Significance. If the bounds are correctly derived and the estimators attain them, the work supplies a rigorous efficiency theory for using abundant unlabeled covariates in causal estimation, which is practically relevant when labeled outcomes are expensive. The explicit comparison of one-sample versus two-sample settings and the demonstration of variance reduction constitute a clear theoretical contribution.

major comments (3)
  1. [§4.1, Theorem 1] §4.1, Theorem 1: the efficiency bound for the one-sample setting is stated to be strictly smaller when auxiliary covariates are included, yet the proof sketch does not explicitly verify that the additional covariates enter the efficient influence function in a way that reduces the variance term without introducing new bias; a direct comparison of the two influence functions (with and without auxiliaries) is needed to confirm the reduction is not an artifact of the censoring mechanism.
  2. [§5.2, Eq. (18)] §5.2, Eq. (18): the claim that the proposed estimator attains the efficiency bound relies on nuisance estimators (propensity and outcome regression) trained on the pooled labeled+unlabeled sample satisfying product-rate conditions o_p(n^{-1/2}). The manuscript provides no entropy or Donsker-class arguments for the function classes that now include the auxiliary covariates, leaving open whether the semi-supervised nuisance rates are sufficient for asymptotic efficiency.
  3. [§6] §6, simulation design: the reported variance reduction is shown only for correctly specified parametric nuisances; it is unclear whether the same reduction persists under nonparametric nuisance estimation with the auxiliary covariates, which is the regime where the efficiency-bound claim is most relevant.
minor comments (2)
  1. Notation for the auxiliary covariate vector is introduced inconsistently between the one-sample and two-sample sections; a single global definition would improve readability.
  2. The abstract states that the estimators are 'efficient' but the main text should explicitly reference the theorem number that establishes asymptotic normality and efficiency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4.1, Theorem 1] §4.1, Theorem 1: the efficiency bound for the one-sample setting is stated to be strictly smaller when auxiliary covariates are included, yet the proof sketch does not explicitly verify that the additional covariates enter the efficient influence function in a way that reduces the variance term without introducing new bias; a direct comparison of the two influence functions (with and without auxiliaries) is needed to confirm the reduction is not an artifact of the censoring mechanism.

    Authors: We agree that an explicit side-by-side comparison of the efficient influence functions would clarify the source of the variance reduction. In the revised manuscript we will insert a direct comparison of the EIFs (with and without auxiliary covariates) for the one-sample setting. This comparison will show that the auxiliary covariates enter only through an additional variance-reduction term in the EIF while leaving the bias term unchanged, confirming that the efficiency gain is not an artifact of the censoring mechanism. revision: yes

  2. Referee: [§5.2, Eq. (18)] §5.2, Eq. (18): the claim that the proposed estimator attains the efficiency bound relies on nuisance estimators (propensity and outcome regression) trained on the pooled labeled+unlabeled sample satisfying product-rate conditions o_p(n^{-1/2}). The manuscript provides no entropy or Donsker-class arguments for the function classes that now include the auxiliary covariates, leaving open whether the semi-supervised nuisance rates are sufficient for asymptotic efficiency.

    Authors: The referee correctly identifies that the current text assumes the product-rate conditions without supplying supporting entropy or Donsker arguments for the enlarged function classes. We will revise Section 5.2 to include explicit entropy-integral bounds (or Donsker-class assumptions) that cover the semi-supervised nuisance estimators trained on the pooled sample, thereby rigorously justifying that the required o_p(n^{-1/2}) rates are attainable. revision: yes

  3. Referee: [§6] §6, simulation design: the reported variance reduction is shown only for correctly specified parametric nuisances; it is unclear whether the same reduction persists under nonparametric nuisance estimation with the auxiliary covariates, which is the regime where the efficiency-bound claim is most relevant.

    Authors: We acknowledge that the present simulations are limited to correctly specified parametric nuisances. To address this gap we will expand the simulation study to include nonparametric nuisance estimators (e.g., random forests and neural networks) trained on the pooled labeled-plus-unlabeled data. The new experiments will report the realized variance reduction under these nonparametric regimes, directly supporting the efficiency-bound claims. revision: yes

Circularity Check

0 steps flagged

No circularity: efficiency bounds derived independently via semiparametric theory

full rationale

The paper derives efficiency bounds for treatment effect estimation in one-sample (censoring) and two-sample settings by incorporating auxiliary covariates into the data-generating process, then constructs estimators whose asymptotic variance matches the bound under standard regularity conditions. This follows conventional influence-function and semiparametric efficiency arguments without reducing to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The claim that auxiliary covariates lower the bound is a direct consequence of the expanded model class rather than an input-output equivalence by construction. The derivations remain self-contained against external benchmarks in semiparametric statistics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the claims rest on unspecified standard statistical regularity conditions for efficiency bounds.

pith-pipeline@v0.9.0 · 5481 in / 1058 out tokens · 34769 ms · 2026-05-17T23:44:32.786338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

  1. [1]

    Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao

    David Azriel, Lawrence D. Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. Journal of the American Statistical Association, 117 0 (540): 0 2238--2251, 2022

  2. [2]

    Heejung Bang and James M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61 0 (4): 0 962--973, 2005

  3. [3]

    Augmented balancing weights as linear regression

    David Bruns-Smith, Oliver Dukes, Avi Feller, and Elizabeth L Ogburn. Augmented balancing weights as linear regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 04 2025

  4. [4]

    Semi-Supervised Learning

    Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Semi-Supervised Learning. MIT Press, 2006

  5. [5]

    Double/debiased machine learning for treatment and structural parameters

    Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 2018

  6. [6]

    arXiv preprint arXiv:2104.14737 , year=

    Victor Chernozhukov, Whitney K. Newey, Victor Quintas-Martinez, and Vasilis Syrgkanis. Automatic debiased machine learning via riesz regression, 2021. a rXiv:2104.14737

  7. [7]

    R iesz N et and F orest R iesz: Automatic debiased machine learning with neural nets and random forests

    Victor Chernozhukov, Whitney Newey, V\' ctor M Quintas-Mart\' nez, and Vasilis Syrgkanis. R iesz N et and F orest R iesz: Automatic debiased machine learning with neural nets and random forests. In International Conference on Machine Learning (ICML), 2022 a

  8. [8]

    Newey, and Rahul Singh

    Victor Chernozhukov, Whitney K. Newey, and Rahul Singh. Automatic debiased machine learning of causal and structural effects. Econometrica, 90 0 (3): 0 967--1027, 2022 b

  9. [9]

    Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms

    Alicia Curth and Mihaela van der Schaar. Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), 2021

  10. [10]

    Niu, and Masashi Sugiyama

    Marthinus Christoffel du Plessis, Gang. Niu, and Masashi Sugiyama. Convex formulation for learning from positive and unlabeled data. In International Conference on Machine Learning (ICML), 2015

  11. [11]

    Learning classifiers from only positive and unlabeled data

    Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. In International Conference on Knowledge Discovery and Data Mining (KDD), 2008

  12. [12]

    Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey

    Vitor Hadad, David A. Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. Confidence intervals for policy evaluation in adaptive experiments. Proceedings of the National Academy of Sciences (PNAS), 118 0 (15), 2021

  13. [13]

    On the role of the propensity score in efficient semiparametric estimation of average treatment effects

    Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66 0 (2): 0 315--331, 1998

  14. [14]

    Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies

    Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20 0 (1): 0 25--46, 2012

  15. [15]

    Shadow prices, market wages, and labor supply

    James Heckman. Shadow prices, market wages, and labor supply. Econometrica, 42 0 (4): 0 679--694, 1974

  16. [16]

    Horvitz and Donovan J

    Daniel G. Horvitz and Donovan J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47 0 (260): 0 663--685, 1952

  17. [17]

    Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Sch \"o lkopf, and Alex J. Smola. Correcting sample selection bias by unlabeled data. In NeurIPS, pp.\ 601--608. MIT Press, 2007

  18. [18]

    Estimation of heterogeneous treatment effects from randomized experiments, with application to the optimal planning of the get-out-the-vote campaign

    Kosuke Imai and Aaron Strauss. Estimation of heterogeneous treatment effects from randomized experiments, with application to the optimal planning of the get-out-the-vote campaign. Political Analysis, 19 0 (1): 0 1--19, 2011

  19. [19]

    Imbens and Tony Lancaster

    Guido W. Imbens and Tony Lancaster. Efficient estimation and stratified sampling. Journal of Econometrics, 74 0 (2): 0 289--318, 1996

  20. [20]

    Imbens and Donald B

    Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015

  21. [21]

    A least-squares approach to direct importance estimation

    Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10 0 (Jul.): 0 1391--1445, 2009

  22. [22]

    Direct bias-correction term estimation for propensity scores and average treatment effect estimation, 2025 a

    Masahiro Kato. Direct bias-correction term estimation for propensity scores and average treatment effect estimation, 2025 a . a rXiv: 2509.22122

  23. [23]

    Direct debiased machine learning via bregman divergence minimization, 2025 b

    Masahiro Kato. Direct debiased machine learning via bregman divergence minimization, 2025 b . a Xiv: 2510.23534

  24. [24]

    Nearest neighbor matching as least squares density ratio estimation and riesz regression, 2025 c

    Masahiro Kato. Nearest neighbor matching as least squares density ratio estimation and riesz regression, 2025 c . a rXiv: 2510.24433

  25. [25]

    A unified theory for causal inference: Direct debiased machine learning via bregman-riesz regression, 2025 d

    Masahiro Kato. A unified theory for causal inference: Direct debiased machine learning via bregman-riesz regression, 2025 d

  26. [26]

    Non-negative bregman divergence minimization for deep direct density ratio estimation

    Masahiro Kato and Takeshi Teshima. Non-negative bregman divergence minimization for deep direct density ratio estimation. In International Conference on Machine Learning (ICML), 2021

  27. [27]

    Active adaptive experimental design for treatment effect estimation with covariate choice

    Masahiro Kato, Akihiro Oga, Wataru Komatsubara, and Ryo Inokuchi. Active adaptive experimental design for treatment effect estimation with covariate choice. In International Conference on Machine Learning (ICML), 2024

  28. [28]

    Puate: Semiparametric efficient average treatment effect estimation from treated (positive) and unlabeled units, 2025

    Masahiro Kato, Fumiaki Kozai, and Ryo Inokuchi. Puate: Semiparametric efficient average treatment effect estimation from treated (positive) and unlabeled units, 2025. a rXiv:2501.19345

  29. [29]

    Semi-supervised learning with density-ratio estimation

    Masanori Kawakita and Takafumi Kanamori. Semi-supervised learning with density-ratio estimation. Machine Learning, 91 0 (2): 0 189--209, 2013

  30. [30]

    Edward H. Kennedy. Efficient nonparametric causal inference with missing exposure information. The International Journal of Biostatistics, 16 0 (1), 2020

  31. [31]

    Kennedy, Sivaraman Balakrishnan, James M

    Edward H. Kennedy, Sivaraman Balakrishnan, James M. Robins, and Larry Wasserman. Minimax rates for heterogeneous causal effect estimation. The Annals of Statistics, 52 0 (2): 0 793 -- 816, 2024

  32. [32]

    Positive-unlabeled learning with non-negative risk estimator

    Ryuichi Kiryo, Gang Niu, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  33. [33]

    Chris A. J. Klaassen. Consistent estimation of the influence function of locally asymptotically linear estimators. Annals of Statistics, 15, 1987

  34. [34]

    Lee and Alejandro Schuler

    Kaitlyn J. Lee and Alejandro Schuler. Rieszboost: Gradient boosting for riesz regression, 2025. a rXiv: 2501.04871

  35. [35]

    Estimation based on nearest neighbor matching: from density ratio to average treatment effect

    Zhexiao Lin, Peng Ding, and Fang Han. Estimation based on nearest neighbor matching: from density ratio to average treatment effect. Econometrica, 91 0 (6): 0 2187--2217, 2023

  36. [36]

    Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes

    Jerzy Neyman. Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes. Statistical Science, 5: 0 463--472, 1923

  37. [37]

    Theoretical comparisons of positive-unlabeled learning against positive-negative learning

    Gang Niu, Marthinus Christoffel du Plessis, Tomoya Sakai, Yao Ma, and Masashi Sugiyama. Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In Advances in Neural Information Processing Systems (NeurIPS), 2016

  38. [38]

    Benjamin Rhodes, Kai Xu, and Michael U. Gutmann. Telescoping density-ratio estimation. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  39. [39]

    J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89: 0 846--866, 1994

  40. [40]

    Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66: 0 688--701, 1974

  41. [41]

    Nonparametric regression using deep neural networks with ReLU activation function

    Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. Annals of Statistics, 48 0 (4): 0 1875--1897, 2020

  42. [42]

    Density ratio matching under the bregman divergence: A unified framework of density ratio estimation

    Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio matching under the bregman divergence: A unified framework of density ratio estimation. Annals of the Institute of Statistical Mathematics, 64, 10 2011

  43. [43]

    Density Ratio Estimation in Machine Learning

    Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012

  44. [44]

    Tsybakov

    Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008

  45. [45]

    Off-policy evaluation and learning for external validity under a covariate shift

    Masatoshi Uehara, Masahiro Kato, and Shota Yasui. Off-policy evaluation and learning for external validity under a covariate shift. In Conference on Neural Information Processing Systems (NeurIPS), 2020

  46. [46]

    van der Laan and S

    M.J. van der Laan and S. Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Series in Statistics. Springer New York, 2011

  47. [47]

    van der Vaart

    Aad W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998

  48. [48]

    Estimation and inference of heterogeneous treatment effects using random forests

    Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113 0 (523): 0 1228--1242, 2018

  49. [49]

    Wooldridge

    Jeffrey M. Wooldridge. Asymptotic properties of weighted m-estimation for standard stratified samples. Econometric Theory, 2001

  50. [50]

    Relative density-ratio estimation for robust distribution comparison

    Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. In Advances in Neural Information Processing Systems (NeurIPS), volume 24. Curran Associates, Inc., 2011

  51. [51]

    Policy learning with adaptively collected data

    Ruohan Zhan, Zhimei Ren, Susan Athey, and Zhengyuan Zhou. Policy learning with adaptively collected data. Management Science, 70 0 (8): 0 5270--5297, 2024

  52. [52]

    Covariate balancing propensity score by tailored loss functions

    Qingyuan Zhao. Covariate balancing propensity score by tailored loss functions. The Annals of Statistics, 47 0 (2): 0 965 -- 993, 2019

  53. [53]

    Zubizarreta

    Jos \'e R. Zubizarreta. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110 0 (511): 0 910--922, 2015

  54. [54]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...