Causal Inference with Missing Exposures and Missing Outcomes
Pith reviewed 2026-05-19 10:26 UTC · model grok-4.3
The pith
Causal effects with missing exposures and baseline outcomes can be identified using counterfactual strata effects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that causal estimands can be defined on counterfactual strata to incorporate missing exposures and missingness on the baseline outcome that restricts the population of interest, yielding identification results under standard missing-at-random and no-unmeasured-confounding assumptions, with practical estimation demonstrated via TMLE and Super Learner in the alcohol-TB setting.
What carries the argument
Counterfactual Strata Effects: causal estimands in which the focus population is defined by potential values of the exposure and outcome that are themselves subject to missingness.
If this is right
- The effect of alcohol consumption on TB risk can be estimated without bias from missing exposure data, missing baseline risk status, or missing follow-up infection status.
- Causal models can be identified when missingness on the baseline outcome changes which individuals belong to the population of interest.
- Targeted maximum likelihood estimation combined with Super Learner yields practical estimates under the extended identification results.
- The framework directly addresses the combination of confounding, missing exposure, and dual missing outcomes observed in the Uganda TB study.
Where Pith is reading between the lines
- The same strata-based approach could be adapted to other cohort studies where missing behavioral data and incomplete outcome ascertainment occur together.
- Extensions to time-varying exposures and outcomes with intermittent missingness would follow naturally from the identification strategy.
- Sensitivity analyses that vary the missingness model could quantify how much the conclusions depend on the missing-at-random assumption.
Load-bearing premise
Data are missing at random and there is no unmeasured confounding given the observed covariates.
What would settle it
Re-estimating the alcohol-TB effect in the SEARCH-TB data after altering the missingness mechanism to violate missing-at-random and observing whether the point estimate and confidence interval change beyond what sampling variability would explain.
Figures
read the original abstract
Missing data are ubiquitous in public health research. When estimating causal effects, there are well-established methods to address bias to due missing outcomes. Commonly, causal estimands are defined under hypothetical interventions to "set" the exposure and to prevent missingness. We demonstrate how this framework can be extended to missing exposures. We further extend this framework to incorporate missingness on the baseline outcome, which induces missingness on the population of interest. To do so, we highlight the use of Counterfactual Strata Effects: causal estimands where the focus population is subject to missingness and/or impacted by the exposure. Our work is motivated by SEARCH-TB's investigation of the effect of alcohol consumption on the risk of incident tuberculosis (TB) infection in rural Uganda. This study posed several real-world challenges: confounding, missingness on the exposure (alcohol use), missingness on the baseline outcome (defining who was at-risk of TB and, thus, in the focus population), and missingness on the outcome at follow-up (capturing who acquired TB). We present a series of causal models and identification results to demonstrate the handling of missingness in these settings. We highlight the use of TMLE with Super Learner and the real-world consequences of our approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends causal inference methods to settings with missing exposures and missingness on baseline outcomes that define the population of interest. It introduces Counterfactual Strata Effects as the target estimands and provides identification results under MAR and conditional exchangeability assumptions. The framework is applied to the SEARCH-TB study on alcohol use and incident TB risk using TMLE with Super Learner, with emphasis on the real-world consequences of properly accounting for these missingness patterns.
Significance. If the identification results hold and the positivity conditions are satisfied, the work provides a coherent way to define and estimate causal effects when missingness affects both the exposure and the very definition of the target population. The use of TMLE with Super Learner and the concrete SEARCH-TB application are strengths that could make the approach useful for other public-health studies with similar missing-data structures.
major comments (2)
- [§4] §4 (Identification results for Counterfactual Strata Effects): The identification of the strata-specific effects under baseline-outcome missingness requires stratum-specific positivity (P(baseline observed, exposure level, outcome observed | covariates) > 0 within each observed-covariate pattern). The manuscript invokes standard MAR and no-unmeasured-confounding assumptions but does not report any empirical check, trimming, or sensitivity analysis for this condition in the SEARCH-TB data or the simulations; violation would render the TMLE targeting step unstable or biased even when the stated assumptions hold.
- [§5] §5 (Application and estimation): The paper claims that the approach correctly handles missingness on the population of interest, yet the reported results do not include diagnostics for effective sample size after stratification or for the performance of the Super Learner under the induced missingness mechanism; without these, it is difficult to assess whether the estimated effects are driven by extrapolation in sparse strata.
minor comments (2)
- The notation for the three missingness indicators and the counterfactual strata is introduced without a consolidated table; adding one would improve readability when comparing the different estimands.
- Several sentences in the introduction repeat the motivation from the abstract; tightening this overlap would reduce redundancy.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate additional diagnostics and discussion where appropriate.
read point-by-point responses
-
Referee: [§4] §4 (Identification results for Counterfactual Strata Effects): The identification of the strata-specific effects under baseline-outcome missingness requires stratum-specific positivity (P(baseline observed, exposure level, outcome observed | covariates) > 0 within each observed-covariate pattern). The manuscript invokes standard MAR and no-unmeasured-confounding assumptions but does not report any empirical check, trimming, or sensitivity analysis for this condition in the SEARCH-TB data or the simulations; violation would render the TMLE targeting step unstable or biased even when the stated assumptions hold.
Authors: We thank the referee for emphasizing the critical role of the stratum-specific positivity assumption in the identification of Counterfactual Strata Effects. The manuscript explicitly lists the required positivity conditions alongside the MAR and conditional exchangeability assumptions. However, we did not include empirical assessments such as propensity score distributions within strata, trimming procedures, or sensitivity analyses for the SEARCH-TB data or the simulation studies. In the revision we will add a dedicated subsection on practical positivity diagnostics, including reporting of minimum estimated probabilities within observed covariate patterns and a brief sensitivity analysis exploring the impact of near-violations. revision: yes
-
Referee: [§5] §5 (Application and estimation): The paper claims that the approach correctly handles missingness on the population of interest, yet the reported results do not include diagnostics for effective sample size after stratification or for the performance of the Super Learner under the induced missingness mechanism; without these, it is difficult to assess whether the estimated effects are driven by extrapolation in sparse strata.
Authors: We agree that reporting effective sample size after stratification and Super Learner performance metrics would strengthen the application section. The current manuscript presents the TMLE estimates with Super Learner but omits these specific diagnostics. We will add tables or text reporting the effective sample sizes for each counterfactual stratum in the SEARCH-TB analysis and include summaries of the cross-validated performance of the nuisance estimators (e.g., risk or R-squared) under the observed missingness patterns to help readers evaluate potential extrapolation. revision: yes
Circularity Check
No circularity: standard causal identification extended to missing data without self-referential reductions
full rationale
The paper defines counterfactual strata effects as an extension of existing causal frameworks to handle missing exposures, baseline outcomes, and follow-up outcomes. Identification relies on standard MAR assumptions and conditional exchangeability given observed covariates, which are invoked explicitly rather than derived from the paper's own fitted quantities or equations. TMLE with Super Learner is applied as an established estimation procedure to the SEARCH-TB data; no central claim reduces by construction to a fitted parameter renamed as a prediction, a self-citation chain, or an ansatz smuggled in via prior work. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Missing at random conditional on observed covariates for exposures and outcomes
- domain assumption No unmeasured confounding for the exposure-outcome relationship
invented entities (1)
-
Counterfactual Strata Effects
no independent evidence
Reference graph
Works this paper leans on
-
[1]
The prevention and treatment of missing data in clinical trials
Roderick J Little, Ralph D’Agostino, Michael L Cohen, Kay Dickersin, Scott S Emerson, John T Farrar, Constantine Frangakis, Joseph W Hogan, Geert Molenberghs, Susan A Murphy, et al. The prevention and treatment of missing data in clinical trials. New England Journal of Medicine , 367(14): 1355–1360, 2012
work page 2012
-
[2]
Strategies for handling missing data in electronic health record derived data
Brian J Wells, Kevin M Chagin, Amy S Nowacki, and Michael W Kattan. Strategies for handling missing data in electronic health record derived data. Egems, 1(3), 2013
work page 2013
-
[3]
Jonathan AC Sterne, Ian R White, John B Carlin, Michael Spratt, Patrick Royston, Michael G Kenward, Angela M Wood, and James R Carpenter. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ, 338, 2009
work page 2009
-
[4]
Canonical causal diagrams to guide the treatment of missing data in epidemiologic studies
Margarita Moreno-Betancur, Katherine J Lee, Finbarr P Leacy, Ian R White, Julie A Simpson, and John B Carlin. Canonical causal diagrams to guide the treatment of missing data in epidemiologic studies. American Journal of Epidemiology , 187(12):2705–2715, 2018
work page 2018
-
[5]
Far from MCAR: obtaining population-level estimates of HIV viral suppression
Laura B Balzer, James Ayieko, Dalsone Kwarisiima, Gabriel Chamie, Edwin D Charlebois, Joshua Schwab, Mark J van der Laan, Moses R Kamya, Diane V Havlir, and Maya L Petersen. Far from MCAR: obtaining population-level estimates of HIV viral suppression. Epidemiology (Cambridge, Mass.), 31(5):620, 2020
work page 2020
-
[6]
Missing outcome data in epidemiologic studies
Stephen R Cole, Paul N Zivich, Jessie K Edwards, Rachael K Ross, Bonnie E Shook-Sa, Joan T Price, and Jeffrey SA Stringer. Missing outcome data in epidemiologic studies. American Journal of Epidemiology, 192(1):6–10, 2023
work page 2023
-
[7]
Sophie Juul, Pascal Faltermeier, Johanne Juul Petersen, Markus Harboe Olsen, Rebecca Kjaer Andersen, Caroline Barkholt Kamp, Faiza Siddiqui, Sebastian Simonsen, Lawrence Mbuagbaw, Lehana Thabane, et al. Missing outcome data in randomised clinical trials of psychological interventions: a review of published trial reports in major psychiatry journals. BMC p...
work page 2024
-
[8]
Addressing missing outcome data in randomised controlled trials: a methodological scoping review
Ellie Medcalf, Robin M Turner, David Espinoza, Vicky He, and Katy JL Bell. Addressing missing outcome data in randomised controlled trials: a methodological scoping review. Contemporary clinical trials, page 107602, 2024
work page 2024
-
[9]
Donald B Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976
work page 1976
-
[10]
Addressing missing data in randomized clinical trials: A causal inference perspective
Ilja Cornelisz, Pim Cuijpers, Tara Donker, and Chris van Klaveren. Addressing missing data in randomized clinical trials: A causal inference perspective. PloS One, 15(7):e0234349, 2020
work page 2020
-
[11]
D. G. Horvitz and D. J. Thompson. A Generalization of Sampling Without Replacement From a Finite Universe. Journal of the American Statistical Association , 47(260):663–685, 1952. ISSN 0162-1459
work page 1952
-
[12]
James M. Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9): 1393–1512, 1986
work page 1986
-
[13]
M.J. van der Laan and J.M. Robins. Unified Methods for Censored Longitudinal Data and Causality . Springer-Verlag, New York Berlin Heidelberg, 2003
work page 2003
-
[14]
Targeted learning: Causal inference for observational and experimental data, volume 4
Mark J van der Laan, Sherri Rose, et al. Targeted learning: Causal inference for observational and experimental data, volume 4. Springer, 2011
work page 2011
-
[15]
Jessica G Young, Mats J Stensrud, Eric J Tchetgen Tchetgen, and Miguel A Hern´ an. A causal framework for classical statistical estimands in failure-time settings with competing events. Statistics in medicine, 39(8):1199–1236, 2020
work page 2020
-
[16]
A targeted maximum likelihood estimator for two-stage designs
Sherri Rose and Mark J van der Laan. A targeted maximum likelihood estimator for two-stage designs. The international journal of biostatistics , 7(1):0000102202155746791217, 2011
work page 2011
-
[17]
Handling missing data when estimating causal effects with targeted maximum likelihood estimation
S Ghazaleh Dashti, Katherine J Lee, Julie A Simpson, Ian R White, John B Carlin, and Margarita Moreno-Betancur. Handling missing data when estimating causal effects with targeted maximum likelihood estimation. American Journal of Epidemiology , 193(7):1019–1030, 2024. 25
work page 2024
-
[18]
Causal inference with missing exposure information: Methods and applications to an obstetric study
Zhiwei Zhang, Wei Liu, Bo Zhang, Li Tang, and Jun Zhang. Causal inference with missing exposure information: Methods and applications to an obstetric study. Statistical Methods in Medical Research, 25(5):2053–2066, 2016
work page 2053
-
[19]
Efficient nonparametric causal inference with missing exposure information
Edward H Kennedy. Efficient nonparametric causal inference with missing exposure information. The International Journal of Biostatistics , 16(1):20190087, 2020
work page 2020
-
[20]
K.J. Rothman, S. Greenland, and T.L. Lash. Modern Epidemiology. Lippincott Williams & Wilkins, Phildelphia, 3rd edition, 2008
work page 2008
-
[21]
L.B. Balzer, J. Schwab, M.J. van der Laan, and M.L. Petersen. Evaluation of progress towards the UNAIDS 90-90-90 HIV care cascade: A description of statistical methods used in an interim analysis of the intervention communities in the SEARCH study. Technical Report 357, University of California at Berkeley, 2017. URL http://biostats.bepress.com/ucbbiostat...
work page 2017
- [22]
-
[23]
Joshua R Nugent, Carina Marquez, Edwin D Charlebois, Rachel Abbott, Laura B Balzer, and SEARCH Collaboration. Blurring cluster randomized trials and observational studies: Two-stage TMLE for subsampling, missingness, and few independent units. Biostatistics, 24:kxad015, 2023
work page 2023
-
[24]
The Causal Roadmap in the age of AI: from all wheel drive to formula 1
Maya Petersen. The Causal Roadmap in the age of AI: from all wheel drive to formula 1. In European Causal Inference Meeting, Copenhagen, Denmark, 2024
work page 2024
-
[25]
Shalika Gupta, Laura B. Balzer, Moses R. Kamya, Diane V. Havlir, and Maya L. Petersen. When exposure affects subgroup membership: Framing relevant causal questions in perinatal epidemiology and beyond, January 2024. URL http://arxiv.org/abs/2401.11368. arXiv:2401.11368 [stat]
-
[26]
Balzer, and the OPAL Study team
Joy Nakato, Laura B. Balzer, and the OPAL Study team. When measurement mediates the causal effect of interest. In Society of Epidemiologic Research (SER) , Austin, TX, 2024
work page 2024
-
[27]
Diane V. Havlir, Laura B. Balzer, Edwin D. Charlebois, Tamara D. Clark, Dalsone Kwarisiima, James Ayieko, Jane Kabami, Norton Sang, Teri Liegler, Gabriel Chamie, and et al. HIV Testing and 26 Treatment with the Use of a Community Health Approach in Rural Africa. New England Journal of Medicine, 381(3):219–229, 2019. ISSN 0028-4793. doi: 10.1056/NEJMoa1809...
-
[28]
Gabriel Chamie, Tamara D Clark, Jane Kabami, Kevin Kadede, Emmanuel Ssemmondo, Rachel Steinfeld, Geoff Lavoy, Dalsone Kwarisiima, Norton Sang, Vivek Jain, Harsha Thirumurthy, Teri Liegler, Laura B Balzer, Maya L Petersen, Craig R Cohen, Elizabeth A Bukusi, Moses R Kamya, Diane V Havlir, and Edwin D Charlebois. A hybrid mobile approach for population-wide ...
-
[29]
C. Marquez, M. Atukunda, L.B. Balzer, G. Chamie, et al. The age-specific burden and household and school-based predictors of child and adolescent tuberculosis infection in rural uganda. PloS ONE, 15 (1):e0228102, 2020
work page 2020
-
[30]
Carina Marquez, Mucunguzi Atukunda, Joshua Nugent, Edwin D Charlebois, Gabriel Chamie, Florence Mwangwa, Emmanuel Ssemmondo, Joel Kironde, Jane Kabami, Asiphas Owaraganise, et al. Community-wide universal human immunodeficiency virus (HIV) test and treat intervention reduces tuberculosis transmission in rural Uganda: A cluster-randomized trial. Clinical I...
work page 2024
-
[31]
Incident tuberculosis infection is associated with alcohol use in adults in rural Uganda
Rachel Abbott, Kirsten Landsiedel, Mucunguzi Atukunda, Sarah B Puryear, Gabriel Chamie, Judith A Hahn, Florence Mwangwa, Elijah Kakande, Maya L Petersen, Diane V Havlir, et al. Incident tuberculosis infection is associated with alcohol use in adults in rural Uganda. Clinical Infectious Diseases, 78:ciae304, 2024
work page 2024
-
[32]
H. Bang and J.M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61:962–972, 2005
work page 2005
-
[33]
M.J. van der Laan and S. Gruber. Targeted minimum loss based estimation of causal effects of multiple time point interventions. The International Journal of Biostatistics , 8(1), 2012. 27
work page 2012
-
[34]
Comparison of dynamic treatment regimes via inverse probability weighting
Miguel A Hern´ an, Emilie Lanoy, Dominique Costagliola, and James M Robins. Comparison of dynamic treatment regimes via inverse probability weighting. Basic & clinical pharmacology & toxicology, 98(3):237–242, 2006
work page 2006
-
[35]
Causal effect models for realistic individualized treatment and intention to treat rules
Mark J Van der Laan and Maya L Petersen. Causal effect models for realistic individualized treatment and intention to treat rules. The international journal of biostatistics , 3(1), 2007
work page 2007
-
[36]
Estimation and extrapolation of optimal treatment and testing strategies
James Robins, Liliana Orellana, and Andrea Rotnitzky. Estimation and extrapolation of optimal treatment and testing strategies. Statistics in medicine , 27(23):4678–4721, 2008
work page 2008
-
[37]
Principal stratification in causal inference
Constantine E Frangakis and Donald B Rubin. Principal stratification in causal inference. Biometrics, 58(1):21–29, 2002
work page 2002
-
[38]
Leonardo Grilli and Fabrizia Mealli. University studies and employment: An application of the principal strata approach to causal analysis. Effectiveness of University Education in Italy: Employability, Competences, Human Capital , pages 219–231, 2007
work page 2007
-
[39]
Leonardo Grilli and Fabrizia Mealli. Nonparametric bounds on the causal effect of university studies on job opportunities using principal stratification. Journal of Educational and Behavioral Statistics , 33 (1):111–130, 2008
work page 2008
-
[40]
Lindsay C Page, Avi Feller, Todd Grindal, Luke Miratrix, and Marie-Andree Somers. Principal stratification: A tool for understanding variation in program effects across endogenous subgroups. American Journal of Evaluation , 36(4):514–531, 2015
work page 2015
-
[41]
Study designs for dependent happenings
M Elizabeth Halloran and Claudio J Struchiner. Study designs for dependent happenings. Epidemiology, 2(5):331–338, 1991
work page 1991
-
[42]
Causal inference in infectious diseases
M Elizabeth Halloran and Claudio J Struchiner. Causal inference in infectious diseases. Epidemiology, pages 142–151, 1995
work page 1995
-
[43]
Toward causal inference with interference
Michael G Hudgens and M Elizabeth Halloran. Toward causal inference with interference. Journal of the american statistical association , 103(482):832–842, 2008. 28
work page 2008
-
[44]
Laura B Balzer, Wenjing Zheng, Mark J van der Laan, and Maya L Petersen. A new approach to hierarchical data analysis: Targeted maximum likelihood estimation for the causal effect of a cluster-level exposure. Stat Methods Med Res , 28(6):1761–1780, June 2019. ISSN 0962-2802. doi: 10.1177/0962280218774936. URL https://doi.org/10.1177/0962280218774936
-
[45]
M.L. Petersen and M.J. van der Laan. Causal models and learning from data: Integrating causal modeling and statistical estimation. Epidemiology, 25(3):418–426, 2014
work page 2014
-
[46]
M.A. Hern´ an and J.M. Robins. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology , 183(8):758–764, 2016
work page 2016
-
[47]
van der Laan, Maya Petersen, and Wenjing Zheng
Mark J. van der Laan, Maya Petersen, and Wenjing Zheng. Estimating the Effect of a Community-Based Intervention with Two Communities. Journal of Causal Inference , 1(1):83–106, May 2013. ISSN 2193-3685. URL http://www.degruyter.com/document/doi/10.1515/jci-2012-0011/html
-
[48]
Causal inference in randomized trials with partial clustering and imbalanced dependence structures
Joshua R Nugent, Elijah Kakande, Gabriel Chamie, Jane Kabami, Asiphas Owaraganise, Diane V Havlir, Moses Kamya, and Laura B Balzer. Causal inference in randomized trials with partial clustering and imbalanced dependence structures. arXiv preprint arXiv:2406.04505 , 2024
-
[49]
Mark J van der Laan, Eric C Polley, and Alan E Hubbard. Super learner. Statistical Applications in Genetics and Molecular Biology , 6(1), 2007
work page 2007
-
[50]
A.W. van der Vaart. Asymptotic Statistics. Cambridge University Press, New York, 1998
work page 1998
-
[51]
Mireille E Schnitzer, Mark J van der Laan, Erica EM Moodie, and Robert W Platt. Effect of breastfeeding on gastrointestinal infection in infants: A targeted maximum likelihood approach for clustered longitudinal data. The Annals of Applied Statistics , 8(2):703, 2014
work page 2014
-
[52]
Susan Gruber, Rachael V. Phillips, Hana Lee, Martin Ho, John Concato, and Mark J. van der Laan and. Targeted learning: Toward a future informed by real-world evidence. Statistics in Biopharmaceutical Research, 16(1):11–25, 2024. doi: 10.1080/19466315.2023.2182356. 29
- [53]
-
[54]
Donald B. Rubin. Multiple Imputation for Nonresponse in Surveys . Wiley Series in Probability and Statistics. John Wiley & Sons, New York, 1987. ISBN 9780471087052. doi: 10.1002/9780470316696
-
[55]
MISL: Multiple imputation by super learning
Thomas Carpenito and Justin Manjourides. MISL: Multiple imputation by super learning. Statistical Methods in Medical Research, 31(10):1904–1915, 2022
work page 1904
-
[56]
SuperMICE: An ensemble machine learning approach to multiple imputation by chained equations
Hannah S Laqueur, Aaron B Shev, and Rose MC Kagawa. SuperMICE: An ensemble machine learning approach to multiple imputation by chained equations. American Journal of Epidemiology , 191(3):516–525, 2022
work page 2022
-
[57]
Good practices for quantitative bias analysis
Timothy L Lash, Matthew P Fox, Richard F MacLehose, George Maldonado, Lawrence C McCandless, and Sander Greenland. Good practices for quantitative bias analysis. International Journal of Epidemiology , 43(6):1969–1985, 07 2014. ISSN 0300-5771. doi: 10.1093/ije/dyu149. URL https://doi.org/10.1093/ije/dyu149
-
[58]
L.E. Dang and L.B. Balzer. Start with the target trial protocol; then follow the Roadmap for causal inference. Epidemiology, 34(5):619–623, 2023
work page 2023
-
[59]
A generalized theory of separable effects in competing event settings
Mats J Stensrud, Miguel A Hern´ an, Eric J Tchetgen Tchetgen, James M Robins, Vanessa Didelez, and Jessica G Young. A generalized theory of separable effects in competing event settings. Lifetime data analysis, 27(4):588–631, 2021
work page 2021
-
[60]
Separable effects for causal inference in the presence of competing events
Mats J Stensrud, Jessica G Young, Vanessa Didelez, James M Robins, and Miguel A Hern´ an. Separable effects for causal inference in the presence of competing events. Journal of the American Statistical Association, 117(537):175–183, 2022. 30
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.