pith. sign in

arxiv: 2506.03336 · v4 · pith:MJA5W4CWnew · submitted 2025-06-03 · 📊 stat.ME

Causal Inference with Missing Exposures and Missing Outcomes

Pith reviewed 2026-05-19 10:26 UTC · model grok-4.3

classification 📊 stat.ME
keywords causal inferencemissing datacounterfactual strata effectstargeted maximum likelihood estimationtuberculosisalcohol consumptionmissing at random
0
0 comments X

The pith

Causal effects with missing exposures and baseline outcomes can be identified using counterfactual strata effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper extends causal inference methods to handle missing data on both exposures like alcohol consumption and outcomes like TB infection, while also addressing missing baseline information that defines the at-risk population. It introduces counterfactual strata effects as a way to define causal questions focused on groups affected by missingness or exposure. The approach is motivated by real challenges in the SEARCH-TB study in rural Uganda, where confounding and multiple layers of missing data complicate estimating alcohol's impact on incident TB. Under missing-at-random assumptions and no unmeasured confounding, identification results allow consistent estimation via targeted maximum likelihood estimation. This matters for public health research because incomplete data on behaviors and health events is routine, and the method provides a structured way to proceed without discarding cases or biasing results.

Core claim

The authors show that causal estimands can be defined on counterfactual strata to incorporate missing exposures and missingness on the baseline outcome that restricts the population of interest, yielding identification results under standard missing-at-random and no-unmeasured-confounding assumptions, with practical estimation demonstrated via TMLE and Super Learner in the alcohol-TB setting.

What carries the argument

Counterfactual Strata Effects: causal estimands in which the focus population is defined by potential values of the exposure and outcome that are themselves subject to missingness.

If this is right

  • The effect of alcohol consumption on TB risk can be estimated without bias from missing exposure data, missing baseline risk status, or missing follow-up infection status.
  • Causal models can be identified when missingness on the baseline outcome changes which individuals belong to the population of interest.
  • Targeted maximum likelihood estimation combined with Super Learner yields practical estimates under the extended identification results.
  • The framework directly addresses the combination of confounding, missing exposure, and dual missing outcomes observed in the Uganda TB study.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same strata-based approach could be adapted to other cohort studies where missing behavioral data and incomplete outcome ascertainment occur together.
  • Extensions to time-varying exposures and outcomes with intermittent missingness would follow naturally from the identification strategy.
  • Sensitivity analyses that vary the missingness model could quantify how much the conclusions depend on the missing-at-random assumption.

Load-bearing premise

Data are missing at random and there is no unmeasured confounding given the observed covariates.

What would settle it

Re-estimating the alcohol-TB effect in the SEARCH-TB data after altering the missingness mechanism to violate missing-at-random and observing whether the point estimate and confidence interval change beyond what sampling variability would explain.

Figures

Figures reproduced from arXiv: 2506.03336 by Atukunda Mucunguzi, Carina Marquez, Edwin D. Charlebois, Elijah Kakande, Florence Mwangwa, Kirsten E. Landsiedel, Laura B. Balzer, Moses R. Kamya, Rachel Abbott.

Figure 1
Figure 1. Figure 1: Directed acyclic graph (DAG) for a classic point-treatment problem with complete measurement of [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: To define causal effects when the exposure is subject to missingness, we now consider [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DAG with missingness on the exposure and outcome: [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: DAG with missingness on the exposure, the baseline outcome, and the follow-up outcome: [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results from SEARCH-TB for the association of alcohol use on incident tuberculosis (TB) in [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
read the original abstract

Missing data are ubiquitous in public health research. When estimating causal effects, there are well-established methods to address bias to due missing outcomes. Commonly, causal estimands are defined under hypothetical interventions to "set" the exposure and to prevent missingness. We demonstrate how this framework can be extended to missing exposures. We further extend this framework to incorporate missingness on the baseline outcome, which induces missingness on the population of interest (e.g., persons at-risk). To do so, we highlight Counterfactual Strata Effects, a general class of causal estimands where the focus population is subject to missingness and/or impacted by the exposure. They are termed such because the estimand involves conditioning on a counterfactual variable.For each setting, we present the causal model, relevant counterfactuals, causal estimand, and identification result. We demonstrate with a real-data example to investigate the effect of alcohol consumption on the risk of incident tuberculosis (TB) infection in rural Uganda. We highlight the use of TMLE with Super Learner for estimation and inference and discuss the practical consequences of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper extends causal inference methods to settings with missing exposures and missingness on baseline outcomes that define the population of interest. It introduces Counterfactual Strata Effects as the target estimands and provides identification results under MAR and conditional exchangeability assumptions. The framework is applied to the SEARCH-TB study on alcohol use and incident TB risk using TMLE with Super Learner, with emphasis on the real-world consequences of properly accounting for these missingness patterns.

Significance. If the identification results hold and the positivity conditions are satisfied, the work provides a coherent way to define and estimate causal effects when missingness affects both the exposure and the very definition of the target population. The use of TMLE with Super Learner and the concrete SEARCH-TB application are strengths that could make the approach useful for other public-health studies with similar missing-data structures.

major comments (2)
  1. [§4] §4 (Identification results for Counterfactual Strata Effects): The identification of the strata-specific effects under baseline-outcome missingness requires stratum-specific positivity (P(baseline observed, exposure level, outcome observed | covariates) > 0 within each observed-covariate pattern). The manuscript invokes standard MAR and no-unmeasured-confounding assumptions but does not report any empirical check, trimming, or sensitivity analysis for this condition in the SEARCH-TB data or the simulations; violation would render the TMLE targeting step unstable or biased even when the stated assumptions hold.
  2. [§5] §5 (Application and estimation): The paper claims that the approach correctly handles missingness on the population of interest, yet the reported results do not include diagnostics for effective sample size after stratification or for the performance of the Super Learner under the induced missingness mechanism; without these, it is difficult to assess whether the estimated effects are driven by extrapolation in sparse strata.
minor comments (2)
  1. The notation for the three missingness indicators and the counterfactual strata is introduced without a consolidated table; adding one would improve readability when comparing the different estimands.
  2. Several sentences in the introduction repeat the motivation from the abstract; tightening this overlap would reduce redundancy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate additional diagnostics and discussion where appropriate.

read point-by-point responses
  1. Referee: [§4] §4 (Identification results for Counterfactual Strata Effects): The identification of the strata-specific effects under baseline-outcome missingness requires stratum-specific positivity (P(baseline observed, exposure level, outcome observed | covariates) > 0 within each observed-covariate pattern). The manuscript invokes standard MAR and no-unmeasured-confounding assumptions but does not report any empirical check, trimming, or sensitivity analysis for this condition in the SEARCH-TB data or the simulations; violation would render the TMLE targeting step unstable or biased even when the stated assumptions hold.

    Authors: We thank the referee for emphasizing the critical role of the stratum-specific positivity assumption in the identification of Counterfactual Strata Effects. The manuscript explicitly lists the required positivity conditions alongside the MAR and conditional exchangeability assumptions. However, we did not include empirical assessments such as propensity score distributions within strata, trimming procedures, or sensitivity analyses for the SEARCH-TB data or the simulation studies. In the revision we will add a dedicated subsection on practical positivity diagnostics, including reporting of minimum estimated probabilities within observed covariate patterns and a brief sensitivity analysis exploring the impact of near-violations. revision: yes

  2. Referee: [§5] §5 (Application and estimation): The paper claims that the approach correctly handles missingness on the population of interest, yet the reported results do not include diagnostics for effective sample size after stratification or for the performance of the Super Learner under the induced missingness mechanism; without these, it is difficult to assess whether the estimated effects are driven by extrapolation in sparse strata.

    Authors: We agree that reporting effective sample size after stratification and Super Learner performance metrics would strengthen the application section. The current manuscript presents the TMLE estimates with Super Learner but omits these specific diagnostics. We will add tables or text reporting the effective sample sizes for each counterfactual stratum in the SEARCH-TB analysis and include summaries of the cross-validated performance of the nuisance estimators (e.g., risk or R-squared) under the observed missingness patterns to help readers evaluate potential extrapolation. revision: yes

Circularity Check

0 steps flagged

No circularity: standard causal identification extended to missing data without self-referential reductions

full rationale

The paper defines counterfactual strata effects as an extension of existing causal frameworks to handle missing exposures, baseline outcomes, and follow-up outcomes. Identification relies on standard MAR assumptions and conditional exchangeability given observed covariates, which are invoked explicitly rather than derived from the paper's own fitted quantities or equations. TMLE with Super Learner is applied as an established estimation procedure to the SEARCH-TB data; no central claim reduces by construction to a fitted parameter renamed as a prediction, a self-citation chain, or an ansatz smuggled in via prior work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond standard causal assumptions and the new term Counterfactual Strata Effects.

axioms (2)
  • domain assumption Missing at random conditional on observed covariates for exposures and outcomes
    Invoked when extending the framework to missing exposures and baseline missingness
  • domain assumption No unmeasured confounding for the exposure-outcome relationship
    Standard assumption for causal identification in observational data
invented entities (1)
  • Counterfactual Strata Effects no independent evidence
    purpose: Causal estimands focused on populations subject to missingness or impacted by the exposure
    Introduced to handle missing baseline outcome that defines the focus population

pith-pipeline@v0.9.0 · 5784 in / 1444 out tokens · 27583 ms · 2026-05-19T10:26:48.120123+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.