Robust Causal Inference for EHR-based Studies of Point Exposures with Missingness in Eligibility Criteria
Pith reviewed 2026-05-22 18:16 UTC · model grok-4.3
The pith
A new estimator recovers the average treatment effect on the treated from EHR data even when eligibility covariates are missing at random.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a robust and efficient estimator of the causal average treatment effect on the treated, defined in the study eligible population, in cohort studies where eligibility-defining covariates are missing at random. The approach facilitates the use of flexible machine-learning strategies for component nuisance functions while maintaining appropriate convergence rates for valid asymptotic inference.
What carries the argument
A robust estimator of the average treatment effect on the treated that adjusts for missing-at-random eligibility covariates and permits machine-learning estimation of nuisance functions at rates that preserve asymptotic normality.
If this is right
- Patients with incomplete eligibility information can be retained in the analysis without selection bias.
- Machine-learning methods can be plugged in for nuisance functions without invalidating inference.
- Valid confidence intervals are available for the treatment effect defined in the eligible population.
- The same framework applies directly to observational comparisons of bariatric surgeries for long-term weight and glycemic control.
Where Pith is reading between the lines
- Similar missing-data adjustments could be useful in other observational studies that rely on baseline covariates recorded in electronic records.
- The approach might be extended to handle missingness in time-varying eligibility criteria or in survival outcomes.
- Testing the method on data sets with deliberately introduced missingness patterns would quantify finite-sample performance.
Load-bearing premise
Eligibility-defining covariates are missing at random given the observed data, and the machine-learning estimators for the nuisance functions converge fast enough to keep the overall estimator asymptotically normal.
What would settle it
Simulate data with a known true average treatment effect on the treated, impose missingness at random on eligibility covariates, and check whether the estimator recovers the true effect within sampling variability.
Figures
read the original abstract
Missingness in variables that define study eligibility criteria is a seldom addressed challenge in electronic health record (EHR)-based settings. It is typically the case that patients with incomplete eligibility information are excluded from analysis without consideration of (implicit) assumptions that are being made, leaving study conclusions subject to potential selection bias. In an effort to ascertain eligibility for more patients, researchers may look back further in time prior to study baseline, and in using outdated values of eligibility-defining covariates may inappropriately be including individuals who, unbeknownst to the researcher, fail to meet eligibility at baseline. To the best of our knowledge, however, very little work has been done to mitigate these concerns. We propose a robust and efficient estimator of the causal average treatment effect on the treated, defined in the study eligible population, in cohort studies where eligibility-defining covariates are missing at random. The approach facilitates the use of flexible machine-learning strategies for component nuisance functions while maintaining appropriate convergence rates for valid asymptotic inference. This method is directly motivated by, and applied throughout to EHR data from Kaiser Permanente to analyze differences between two common bariatric surgical interventions for long-term weight and glycemic outcomes among a cohort of severely obese patients with type II diabetes mellitus.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a robust and efficient estimator for the causal average treatment effect on the treated (ATT) defined in the study-eligible population for cohort studies using electronic health record (EHR) data, where eligibility-defining covariates are missing at random. The method allows for flexible machine learning estimation of nuisance functions while ensuring appropriate convergence rates for valid asymptotic inference. It is motivated and applied to data from Kaiser Permanente on bariatric surgical interventions for obese patients with type II diabetes.
Significance. If the theoretical results and supporting derivations hold, this addresses a practical gap in EHR-based causal inference by enabling inclusion of patients with incomplete eligibility data without selection bias under MAR. The integration of semiparametric efficiency theory with modern ML nuisance estimation is a clear strength, as is the direct application to real Kaiser Permanente bariatric surgery data for weight and glycemic outcomes. This could support more inclusive observational analyses in health services research.
major comments (2)
- [Identification and Assumptions] The identification of the eligible-population ATT relies on the MAR assumption for eligibility covariates conditional on observed data (as noted in the abstract and motivating setup). The paper should include a dedicated discussion or sensitivity analysis for plausible violations in EHR contexts, where missingness may correlate with unmeasured factors influencing treatment assignment or outcomes; this is load-bearing for recovering the correct target subpopulation even if nuisance convergence rates hold.
- [Theoretical Properties] The abstract states that the estimator maintains appropriate convergence rates for asymptotic inference under flexible ML strategies, but the full manuscript should provide explicit rate conditions (e.g., n^{-1/4} or faster) and verification steps for the specific nuisance functions involved in the reweighting or imputation for missing eligibility covariates.
minor comments (2)
- [Application to Kaiser Permanente Data] In the application section, report the proportion of patients with missing eligibility covariates and describe how the observed-data likelihood is constructed to recover eligible-population quantities.
- [Notation and Setup] Clarify notation for the observed versus full-data quantities when eligibility covariates are missing to avoid ambiguity in the estimator definition.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review of our manuscript. We appreciate the recognition of the practical relevance of our doubly robust estimator for the eligible-population ATT under missing eligibility covariates in EHR data. We address the major comments below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Identification and Assumptions] The identification of the eligible-population ATT relies on the MAR assumption for eligibility covariates conditional on observed data (as noted in the abstract and motivating setup). The paper should include a dedicated discussion or sensitivity analysis for plausible violations in EHR contexts, where missingness may correlate with unmeasured factors influencing treatment assignment or outcomes; this is load-bearing for recovering the correct target subpopulation even if nuisance convergence rates hold.
Authors: We agree that the MAR assumption is foundational to identification and merits explicit discussion in EHR settings. In the revised manuscript, we will add a new subsection in the Discussion that elaborates on the plausibility of MAR for eligibility covariates (e.g., missingness due to irregular care-seeking patterns that may be independent of unmeasured confounders given observed data). We will also include a sensitivity analysis, implemented via a simulation study that introduces controlled violations of MAR and reports the resulting bias in ATT estimates, to illustrate robustness and limitations. revision: yes
-
Referee: [Theoretical Properties] The abstract states that the estimator maintains appropriate convergence rates for asymptotic inference under flexible ML strategies, but the full manuscript should provide explicit rate conditions (e.g., n^{-1/4} or faster) and verification steps for the specific nuisance functions involved in the reweighting or imputation for missing eligibility covariates.
Authors: We thank the referee for this suggestion. The current manuscript states the general rate conditions required for the asymptotic normality result in Theorem 1 (nuisance estimators converging faster than n^{-1/4} so that the cross-term remainder is o_p(n^{-1/2})), but we will make these conditions more explicit in the revised version. We will add a dedicated remark specifying the required rates for each nuisance function (propensity score, outcome regression, and missingness model) and include brief verification guidance, such as references to known convergence rates for common ML methods (e.g., random forests or neural networks) under standard regularity conditions. revision: yes
Circularity Check
Derivation follows standard semiparametric efficiency theory under MAR; no reduction of target functional to fitted inputs or self-citation chain.
full rationale
The paper constructs its estimator for the ATT in the eligible subpopulation by combining the missing-at-random assumption on eligibility covariates with standard doubly robust or efficient influence function techniques from semiparametric theory. The target parameter is identified directly from the observed-data likelihood under the stated MAR condition, and nuisance functions are estimated at the required rates without the estimator being defined in terms of itself or a post-hoc fit. No load-bearing step reduces by the paper's own equations to a self-citation or to a quantity that is tautologically equal to its inputs. The approach is therefore self-contained against external benchmarks and receives a zero circularity score.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Eligibility-defining covariates are missing at random conditional on observed data.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosurereality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a robust and efficient estimator of the causal average treatment effect on the treated, defined in the study eligible population, in cohort studies where eligibility-defining covariates are missing at random.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Density ratio estimation in machine learning
Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning . Cambridge University Press, 2012
work page 2012
-
[2]
Iv´ an D ´ ıaz, Nicholas Williams, and Edward J. Hoffman Katherine L., Schenck. Non- parametric causal effects based on longitudinal modified tre atment policies. Journal of the American Statistical Association , 118(542):846–857, 2023
work page 2023
-
[3]
E. H. Kennedy, A. Sj¨ olander, and D. S. Small. Semiparamet ric causal inference in matched cohort studies. Biometrika, 102(3):739–746, September 2015
work page 2015
-
[4]
Andrea Mercatanti and Fan Li. Do debit cards increase hous ehold spending? evi- dence from a semiparametric causal analysis of a survey. Annals of Applied Statistics , 8(4):2485–2508, 2014
work page 2014
-
[5]
A dou bly robust weighting estimator of the average treatment effect on the treated
Erica EM Moodie, Olli Saarela, and David A Stephens. A dou bly robust weighting estimator of the average treatment effect on the treated. Stat, 7(1):e205, 2018
work page 2018
-
[6]
Semiparam etric counterfactual density estimation
EH Kennedy, S Balakrishnan, and LA Wasserman. Semiparam etric counterfactual density estimation. Biometrika, page asad017, 2023
work page 2023
-
[7]
Andrea Rotnitzky and Ezequiel Smucler. Efficient adjustme nt sets for population average causal treatment effect estimation in graphical mod els. Journal of Machine Learning Research, 21:1–86, 2020
work page 2020
-
[8]
Alexander W. Levis, Edward H. Kennedy, and Luke Keele. Nonpar ametric identifica- tion and efficient estimation of causal effects with instrumen tal variables, 2024
work page 2024
-
[9]
Efficient nonparametric causal inferen ce with missing exposure information
Edward H Kennedy. Efficient nonparametric causal inferen ce with missing exposure information. The International Journal of Biostatistics , 16(1), 2020
work page 2020
-
[10]
Li, Lisa Liu, David Arterburn, et al
Ron A. Li, Lisa Liu, David Arterburn, et al. Five-year longi tudinal cohort study of reinterventions after sleeve gastrectomy and roux-en-y gastric bypass. Annals of surgery, 273(4):758–765, 2021
work page 2021
-
[11]
We ight outcomes of sleeve gastrectomy and gastric bypass compared to nonsurgical tre atment
David E Arterburn, Eric Johnson, Karen J Coleman, et al. We ight outcomes of sleeve gastrectomy and gastric bypass compared to nonsurgical tre atment. Annals of Surgery, 274(6):e1269–e1276, 2020
work page 2020
-
[12]
Mark J van der Laan, Eric C Polley, and Alan E Hubbard. Super learner. Statistical applications in genetics and molecular biology , 6:Article25, 2007
work page 2007
-
[13]
Superlearner: Super learner prediction
Eric Polley, Erin LeDell, Chris Kennedy, and Mark van de r Laan. Superlearner: Super learner prediction. https://CRAN.R-project.org/package=SuperLearner, 2023. R package version 2.0-28.1
work page 2023
- [14]
-
[15]
McTigue, Rebecca Wellman, Eric Nauman, et al
Kathleen M. McTigue, Rebecca Wellman, Eric Nauman, et al . Comparing the 5-year diabetes outcomes of sleeve gastrectomy and gastric bypass : The pcornet bariatric study. JAMA Surgery , 155(10):1–9, 2020. 35
work page 2020
-
[16]
National Kidney Foundatation. CKD-EPI Creatinine Equa tion. https://www.kidney.org/content/ckd-epi-creatinine-equation-2021, 2021
work page 2021
-
[17]
Adjustin g for Selection Bias due to Missing Eligibility Criteria in Emulated Target Trials
Luke Benz, Rajarshi Mukjerkee, Rui Wang, et al. Adjustin g for Selection Bias due to Missing Eligibility Criteria in Emulated Target Trials. American Journal of Epidemi- ology, 2024
work page 2024
-
[18]
Tanayott Thaweethai, David E. Arterburn, Karen J. Colema n, and Sebastien Haneuse. Robust inference when combining inverse-probability weig hting and multiple imputa- tion to address missing data with application to an electron ic health records-based study of bariatric surgery. Ann. Appl. Stat. , 15(1):126–147, 2021. 36
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.