pith. sign in

arxiv: 2504.16230 · v3 · submitted 2025-04-22 · 📊 stat.ME · stat.AP

Robust Causal Inference for EHR-based Studies of Point Exposures with Missingness in Eligibility Criteria

Pith reviewed 2026-05-22 18:16 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords causal inferencemissing dataelectronic health recordsaverage treatment effectmachine learningeligibility criteriabariatric surgery
0
0 comments X

The pith

A new estimator recovers the average treatment effect on the treated from EHR data even when eligibility covariates are missing at random.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a statistical estimator for causal effects in cohort studies that use electronic health records. It targets the average treatment effect among patients who actually meet the eligibility criteria. Standard practice excludes patients with incomplete eligibility data, which can create selection bias. The proposed method incorporates those patients under a missing-at-random assumption while still supporting machine-learning fits for the required nuisance functions. It maintains the convergence rates needed for valid asymptotic inference and is demonstrated on Kaiser Permanente data comparing two bariatric procedures for weight and blood-sugar outcomes in patients with type II diabetes.

Core claim

We propose a robust and efficient estimator of the causal average treatment effect on the treated, defined in the study eligible population, in cohort studies where eligibility-defining covariates are missing at random. The approach facilitates the use of flexible machine-learning strategies for component nuisance functions while maintaining appropriate convergence rates for valid asymptotic inference.

What carries the argument

A robust estimator of the average treatment effect on the treated that adjusts for missing-at-random eligibility covariates and permits machine-learning estimation of nuisance functions at rates that preserve asymptotic normality.

If this is right

  • Patients with incomplete eligibility information can be retained in the analysis without selection bias.
  • Machine-learning methods can be plugged in for nuisance functions without invalidating inference.
  • Valid confidence intervals are available for the treatment effect defined in the eligible population.
  • The same framework applies directly to observational comparisons of bariatric surgeries for long-term weight and glycemic control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar missing-data adjustments could be useful in other observational studies that rely on baseline covariates recorded in electronic records.
  • The approach might be extended to handle missingness in time-varying eligibility criteria or in survival outcomes.
  • Testing the method on data sets with deliberately introduced missingness patterns would quantify finite-sample performance.

Load-bearing premise

Eligibility-defining covariates are missing at random given the observed data, and the machine-learning estimators for the nuisance functions converge fast enough to keep the overall estimator asymptotically normal.

What would settle it

Simulate data with a known true average treatment effect on the treated, impose missingness at random on eligibility covariates, and check whether the estimator recovers the true effect within sampling variability.

Figures

Figures reproduced from arXiv: 2504.16230 by Alexander W. Levis, Catherine Lee, David Arterburn, Heidi Fischer, Luke Benz, Rajarshi Mukherjee, Rui Wang, Sebastien Haneuse, Susan M. Shortreed.

Figure 1
Figure 1. Figure 1: Hemoglobin A1c % (dark red dots) and diabetic medication usage for six patients undergoing bariatric surgery, information that establishes T2DM status, and thus study eligibility in the 24 months prior to surgery. A1c measurements are shown in relation to a cutoff of 6.5%, the typical clinical cutoff for T2DM. For medications, points indicate the start of a prescription while shaded bars indicate duration.… view at source ↗
Figure 2
Figure 2. Figure 2: A) Joint distribution of eligibility ascertainment (R) and status (E) across 40 different possible ways to operationalize the study eligibility criteria. B) Distribution of (n11, n10) where nre denotes the number of ways of operationalizing the study eligibility criteria that a subject has R = r, E = e. A similar figure for the remission outcome is available in the Supplementary Materials. which is likely … view at source ↗
Figure 3
Figure 3. Figure 3: Distributions of time relative to the date of surgery at which the most recent BMI or A1c measure was collected (40). Thus, it seems plausible that whether or not measures of BMI and A1c or medica￾tion usage are collected during this preoperative period would depend only on information recorded in the EHR. By contrast, when patients do not have BMI or A1c measures avail￾able in the 1-2 years prior to surge… view at source ↗
Figure 4
Figure 4. Figure 4: Distributions of select nuisance function estimates from θbEIF related to ascertainment (ηb) and eligibility (εb) for relative weight change outcome. An analogous figure for T2DM remission is available in the Supplementary Materials. Results for BMI lookback of 3 months are similar to those of 1 month BMI lookback, and omitted for space considerations. To get a sense of how eligibility-related nuisance fun… view at source ↗
Figure 5
Figure 5. Figure 5: Point estimates and 95% confidence intervals for four estimators of the average treatment effect bariatric surgery type on eligible RYGB patients. Estimates are presented for difference in % weight change and diabetes remission rate 3 years post surgery. Results for BMI lookback of 3 months are similar to those of 1 month BMI lookback, and omitted for space considerations. Finally, we note that in [PITH_F… view at source ↗
read the original abstract

Missingness in variables that define study eligibility criteria is a seldom addressed challenge in electronic health record (EHR)-based settings. It is typically the case that patients with incomplete eligibility information are excluded from analysis without consideration of (implicit) assumptions that are being made, leaving study conclusions subject to potential selection bias. In an effort to ascertain eligibility for more patients, researchers may look back further in time prior to study baseline, and in using outdated values of eligibility-defining covariates may inappropriately be including individuals who, unbeknownst to the researcher, fail to meet eligibility at baseline. To the best of our knowledge, however, very little work has been done to mitigate these concerns. We propose a robust and efficient estimator of the causal average treatment effect on the treated, defined in the study eligible population, in cohort studies where eligibility-defining covariates are missing at random. The approach facilitates the use of flexible machine-learning strategies for component nuisance functions while maintaining appropriate convergence rates for valid asymptotic inference. This method is directly motivated by, and applied throughout to EHR data from Kaiser Permanente to analyze differences between two common bariatric surgical interventions for long-term weight and glycemic outcomes among a cohort of severely obese patients with type II diabetes mellitus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a robust and efficient estimator for the causal average treatment effect on the treated (ATT) defined in the study-eligible population for cohort studies using electronic health record (EHR) data, where eligibility-defining covariates are missing at random. The method allows for flexible machine learning estimation of nuisance functions while ensuring appropriate convergence rates for valid asymptotic inference. It is motivated and applied to data from Kaiser Permanente on bariatric surgical interventions for obese patients with type II diabetes.

Significance. If the theoretical results and supporting derivations hold, this addresses a practical gap in EHR-based causal inference by enabling inclusion of patients with incomplete eligibility data without selection bias under MAR. The integration of semiparametric efficiency theory with modern ML nuisance estimation is a clear strength, as is the direct application to real Kaiser Permanente bariatric surgery data for weight and glycemic outcomes. This could support more inclusive observational analyses in health services research.

major comments (2)
  1. [Identification and Assumptions] The identification of the eligible-population ATT relies on the MAR assumption for eligibility covariates conditional on observed data (as noted in the abstract and motivating setup). The paper should include a dedicated discussion or sensitivity analysis for plausible violations in EHR contexts, where missingness may correlate with unmeasured factors influencing treatment assignment or outcomes; this is load-bearing for recovering the correct target subpopulation even if nuisance convergence rates hold.
  2. [Theoretical Properties] The abstract states that the estimator maintains appropriate convergence rates for asymptotic inference under flexible ML strategies, but the full manuscript should provide explicit rate conditions (e.g., n^{-1/4} or faster) and verification steps for the specific nuisance functions involved in the reweighting or imputation for missing eligibility covariates.
minor comments (2)
  1. [Application to Kaiser Permanente Data] In the application section, report the proportion of patients with missing eligibility covariates and describe how the observed-data likelihood is constructed to recover eligible-population quantities.
  2. [Notation and Setup] Clarify notation for the observed versus full-data quantities when eligibility covariates are missing to avoid ambiguity in the estimator definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We appreciate the recognition of the practical relevance of our doubly robust estimator for the eligible-population ATT under missing eligibility covariates in EHR data. We address the major comments below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Identification and Assumptions] The identification of the eligible-population ATT relies on the MAR assumption for eligibility covariates conditional on observed data (as noted in the abstract and motivating setup). The paper should include a dedicated discussion or sensitivity analysis for plausible violations in EHR contexts, where missingness may correlate with unmeasured factors influencing treatment assignment or outcomes; this is load-bearing for recovering the correct target subpopulation even if nuisance convergence rates hold.

    Authors: We agree that the MAR assumption is foundational to identification and merits explicit discussion in EHR settings. In the revised manuscript, we will add a new subsection in the Discussion that elaborates on the plausibility of MAR for eligibility covariates (e.g., missingness due to irregular care-seeking patterns that may be independent of unmeasured confounders given observed data). We will also include a sensitivity analysis, implemented via a simulation study that introduces controlled violations of MAR and reports the resulting bias in ATT estimates, to illustrate robustness and limitations. revision: yes

  2. Referee: [Theoretical Properties] The abstract states that the estimator maintains appropriate convergence rates for asymptotic inference under flexible ML strategies, but the full manuscript should provide explicit rate conditions (e.g., n^{-1/4} or faster) and verification steps for the specific nuisance functions involved in the reweighting or imputation for missing eligibility covariates.

    Authors: We thank the referee for this suggestion. The current manuscript states the general rate conditions required for the asymptotic normality result in Theorem 1 (nuisance estimators converging faster than n^{-1/4} so that the cross-term remainder is o_p(n^{-1/2})), but we will make these conditions more explicit in the revised version. We will add a dedicated remark specifying the required rates for each nuisance function (propensity score, outcome regression, and missingness model) and include brief verification guidance, such as references to known convergence rates for common ML methods (e.g., random forests or neural networks) under standard regularity conditions. revision: yes

Circularity Check

0 steps flagged

Derivation follows standard semiparametric efficiency theory under MAR; no reduction of target functional to fitted inputs or self-citation chain.

full rationale

The paper constructs its estimator for the ATT in the eligible subpopulation by combining the missing-at-random assumption on eligibility covariates with standard doubly robust or efficient influence function techniques from semiparametric theory. The target parameter is identified directly from the observed-data likelihood under the stated MAR condition, and nuisance functions are estimated at the required rates without the estimator being defined in terms of itself or a post-hoc fit. No load-bearing step reduces by the paper's own equations to a self-citation or to a quantity that is tautologically equal to its inputs. The approach is therefore self-contained against external benchmarks and receives a zero circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the missing-at-random assumption for eligibility covariates and standard regularity conditions for semiparametric estimators; no free parameters or new entities are introduced in the abstract description.

axioms (1)
  • domain assumption Eligibility-defining covariates are missing at random conditional on observed data.
    Invoked to allow unbiased estimation by incorporating patients with incomplete eligibility information rather than excluding them.

pith-pipeline@v0.9.0 · 5776 in / 1410 out tokens · 93266 ms · 2026-05-22T18:16:44.498214+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    Density ratio estimation in machine learning

    Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning . Cambridge University Press, 2012

  2. [2]

    Hoffman Katherine L., Schenck

    Iv´ an D ´ ıaz, Nicholas Williams, and Edward J. Hoffman Katherine L., Schenck. Non- parametric causal effects based on longitudinal modified tre atment policies. Journal of the American Statistical Association , 118(542):846–857, 2023

  3. [3]

    E. H. Kennedy, A. Sj¨ olander, and D. S. Small. Semiparamet ric causal inference in matched cohort studies. Biometrika, 102(3):739–746, September 2015

  4. [4]

    Do debit cards increase hous ehold spending? evi- dence from a semiparametric causal analysis of a survey

    Andrea Mercatanti and Fan Li. Do debit cards increase hous ehold spending? evi- dence from a semiparametric causal analysis of a survey. Annals of Applied Statistics , 8(4):2485–2508, 2014

  5. [5]

    A dou bly robust weighting estimator of the average treatment effect on the treated

    Erica EM Moodie, Olli Saarela, and David A Stephens. A dou bly robust weighting estimator of the average treatment effect on the treated. Stat, 7(1):e205, 2018

  6. [6]

    Semiparam etric counterfactual density estimation

    EH Kennedy, S Balakrishnan, and LA Wasserman. Semiparam etric counterfactual density estimation. Biometrika, page asad017, 2023

  7. [7]

    Efficient adjustme nt sets for population average causal treatment effect estimation in graphical mod els

    Andrea Rotnitzky and Ezequiel Smucler. Efficient adjustme nt sets for population average causal treatment effect estimation in graphical mod els. Journal of Machine Learning Research, 21:1–86, 2020

  8. [8]

    Levis, Edward H

    Alexander W. Levis, Edward H. Kennedy, and Luke Keele. Nonpar ametric identifica- tion and efficient estimation of causal effects with instrumen tal variables, 2024

  9. [9]

    Efficient nonparametric causal inferen ce with missing exposure information

    Edward H Kennedy. Efficient nonparametric causal inferen ce with missing exposure information. The International Journal of Biostatistics , 16(1), 2020

  10. [10]

    Li, Lisa Liu, David Arterburn, et al

    Ron A. Li, Lisa Liu, David Arterburn, et al. Five-year longi tudinal cohort study of reinterventions after sleeve gastrectomy and roux-en-y gastric bypass. Annals of surgery, 273(4):758–765, 2021

  11. [11]

    We ight outcomes of sleeve gastrectomy and gastric bypass compared to nonsurgical tre atment

    David E Arterburn, Eric Johnson, Karen J Coleman, et al. We ight outcomes of sleeve gastrectomy and gastric bypass compared to nonsurgical tre atment. Annals of Surgery, 274(6):e1269–e1276, 2020

  12. [12]

    Super learner

    Mark J van der Laan, Eric C Polley, and Alan E Hubbard. Super learner. Statistical applications in genetics and molecular biology , 6:Article25, 2007

  13. [13]

    Superlearner: Super learner prediction

    Eric Polley, Erin LeDell, Chris Kennedy, and Mark van de r Laan. Superlearner: Super learner prediction. https://CRAN.R-project.org/package=SuperLearner, 2023. R package version 2.0-28.1

  14. [14]

    Insulin, 2023

    S Thota and A Akbar. Insulin, 2023

  15. [15]

    McTigue, Rebecca Wellman, Eric Nauman, et al

    Kathleen M. McTigue, Rebecca Wellman, Eric Nauman, et al . Comparing the 5-year diabetes outcomes of sleeve gastrectomy and gastric bypass : The pcornet bariatric study. JAMA Surgery , 155(10):1–9, 2020. 35

  16. [16]

    CKD-EPI Creatinine Equa tion

    National Kidney Foundatation. CKD-EPI Creatinine Equa tion. https://www.kidney.org/content/ckd-epi-creatinine-equation-2021, 2021

  17. [17]

    Adjustin g for Selection Bias due to Missing Eligibility Criteria in Emulated Target Trials

    Luke Benz, Rajarshi Mukjerkee, Rui Wang, et al. Adjustin g for Selection Bias due to Missing Eligibility Criteria in Emulated Target Trials. American Journal of Epidemi- ology, 2024

  18. [18]

    Arterburn, Karen J

    Tanayott Thaweethai, David E. Arterburn, Karen J. Colema n, and Sebastien Haneuse. Robust inference when combining inverse-probability weig hting and multiple imputa- tion to address missing data with application to an electron ic health records-based study of bariatric surgery. Ann. Appl. Stat. , 15(1):126–147, 2021. 36