pith. sign in

arxiv: 2606.17232 · v1 · pith:EPWXCW6Inew · submitted 2026-06-15 · 📊 stat.ME

Semiparametric Mediation Analysis with Separately Observed Mediator and Outcome under Unmeasured Confounding

Pith reviewed 2026-06-27 02:16 UTC · model grok-4.3

classification 📊 stat.ME
keywords mediation analysisdata fusioninstrumental variablesunmeasured confoundingsemiparametric estimationnatural direct effectsnatural indirect effects
0
0 comments X

The pith

Shared instrumental variables identify natural direct and indirect effects from separate mediator and outcome samples under unmeasured confounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mediation analysis requires identifying how an exposure affects an outcome through a mediator, but standard methods fail when the mediator and outcome are observed in different samples. This paper introduces a data fusion approach that combines one dataset measuring the mediator and another measuring the outcome by using shared instrumental variables. The method remains valid even with unmeasured confounding between variables thanks to a no-interaction condition and handles differences across datasets with a latent alignment condition. It develops semiparametric estimators that are multiply robust and can achieve the efficiency bound. The framework is illustrated by estimating how much a genetic variant affects dementia risk through immune gene expression.

Core claim

Natural direct and indirect effects can be identified and estimated semiparametrically by fusing two data sources—one with the mediator and the other with the outcome—using shared instrumental variables, under a no-interaction condition that validates the approach despite unmeasured confounding and a latent alignment condition that accommodates shifts in covariates and exposure between sources. Two identification strategies are established, one assuming known valid IVs and another that learns them, along with influence-function-based estimators that possess multiple robustness properties.

What carries the argument

The data fusion framework leveraging shared instrumental variables to identify natural direct and indirect effects from separately observed mediator and outcome data, under no-interaction and latent alignment conditions.

If this is right

  • Natural direct and indirect effects become identifiable without joint observation of mediator and outcome.
  • Estimators remain consistent under unmeasured confounding if the no-interaction condition holds.
  • Multiple robustness allows consistent estimation even if some nuisance models are misspecified.
  • The approach accommodates covariate and exposure shifts across data sources.
  • Semiparametric efficiency can be attained under suitable conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be extended to settings with more than two data sources or additional missingness patterns.
  • Sensitivity analyses for the no-interaction assumption would help assess robustness in applications.
  • Learning valid IVs from data opens possibilities for automated causal discovery in mediation contexts.
  • This framework might apply to other causal parameters like controlled direct effects in fragmented data environments.

Load-bearing premise

The no-interaction condition, meaning the mediator-outcome relationship does not vary with the exposure level, combined with latent alignment of the data sources.

What would settle it

Jointly observing mediator and outcome in an auxiliary sample and comparing the proposed estimator to the standard complete-data estimator; substantial discrepancy would indicate failure of the identifying assumptions.

Figures

Figures reproduced from arXiv: 2606.17232 by Ruoyu Wang, Sijia Li.

Figure 1
Figure 1. Figure 1: An illustration of the data generating process. Dotted polygons indicate the variables jointly [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Estimated natural indirect effects mediated by selected gene expressions. The estimated [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗
read the original abstract

Mediation analysis is widely used to disentangle causal pathways, yet in many real-world studies the mediator M and outcome Y are never jointly observed. This incompleteness breaks the standard identification strategy for natural direct and indirect effects. We introduce a novel data fusion framework that restores the identification by combining two incomplete data sources, one measuring $M$ and the other measuring Y. Our approach leverages shared instrumental variables (IVs) to circumvent the need to observe (M,Y) jointly, remains valid under unmeasured confounding via a no-interaction condition, and accommodates covariate and exposure shifts across data sources under a latent alignment condition. We establish two identification strategies, one for settings with a known set of valid IVs, and another for settings where valid IVs must be learned. We further develop semiparametric, influence-function-based estimators with multiple robustness properties, and propose an estimator that attains the semiparametric efficiency bound under appropriate conditions. We apply our framework to quantify the extent to which the effect of SNP rs610932 on dementia risk is mediated through immune-related gene-expression pathways.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a data fusion framework for identifying natural direct and indirect effects in mediation analysis when the mediator M and outcome Y are observed in separate datasets. It combines two incomplete data sources using shared instrumental variables, establishes identification under a no-interaction condition (to handle unmeasured confounding) and a latent alignment condition (to handle covariate/exposure shifts), presents two strategies (known IVs and learned IVs), derives multiple-robust influence-function estimators that attain the semiparametric efficiency bound under suitable conditions, and applies the method to assess mediation of SNP rs610932 effects on dementia risk via immune gene-expression pathways.

Significance. If the identification results and estimator properties hold, the work addresses a common practical barrier in mediation studies where joint (M,Y) observation is infeasible, such as in genetics or epidemiology. The multiple-robustness properties and efficiency bound attainment, together with explicit handling of data shifts and unmeasured confounding via the stated conditions, represent substantive contributions to semiparametric causal inference and data fusion methods.

major comments (2)
  1. [§3.1] §3.1 (known-IV identification): the no-interaction condition is load-bearing for validity under unmeasured confounding; the manuscript should provide a concrete sensitivity analysis or bounding strategy for violations, as this directly affects whether the natural effects remain identified when the condition fails even mildly.
  2. [§4] §4 (estimators): while multiple robustness is claimed, the explicit form of the influence functions and the precise conditions under which they achieve the efficiency bound are not fully cross-referenced to the identification results; this makes it difficult to verify that the estimators remain consistent when either the outcome or mediator model is misspecified.
minor comments (2)
  1. Notation for the two data sources (e.g., superscripts or subscripts on (M,Y)) should be introduced earlier and used consistently to avoid ambiguity when discussing shifts.
  2. The application section would benefit from reporting the estimated proportions mediated with confidence intervals and a brief discussion of how the learned-IV strategy was implemented in the genetic data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of the identification and estimation strategy. We respond point-by-point below and indicate where revisions will be made to improve clarity and robustness.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (known-IV identification): the no-interaction condition is load-bearing for validity under unmeasured confounding; the manuscript should provide a concrete sensitivity analysis or bounding strategy for violations, as this directly affects whether the natural effects remain identified when the condition fails even mildly.

    Authors: We agree that the no-interaction condition is central to identification when unmeasured confounding is present. The current results are derived under this assumption (Assumption 3), which is standard in the literature on mediation with unmeasured confounding. To address potential violations, we will add a new subsection in Section 3.1 that discusses the implications of mild departures and provides a simple sensitivity analysis based on a parameterized deviation from the no-interaction condition, along with corresponding bounds on the natural effects. This will be supported by a small simulation study in the supplement. revision: yes

  2. Referee: [§4] §4 (estimators): while multiple robustness is claimed, the explicit form of the influence functions and the precise conditions under which they achieve the efficiency bound are not fully cross-referenced to the identification results; this makes it difficult to verify that the estimators remain consistent when either the outcome or mediator model is misspecified.

    Authors: The explicit influence functions are derived in Supplementary Material Section S.3, with the multiple-robustness property stated in Theorem 4 and the efficiency bound attainment in Corollary 1. We acknowledge that the main-text cross-references in Section 4 could be strengthened. We will revise Section 4 to include explicit pointers to the relevant identification results (Theorems 1–2), the IF expressions in S.3, and the precise conditions (correct specification of at least one of the outcome or mediator nuisance functions, plus the latent alignment condition) under which consistency and efficiency hold when one model is misspecified. revision: yes

Circularity Check

0 steps flagged

No significant circularity in identification or estimation strategy

full rationale

The paper's central claims rest on external domain conditions (no-interaction for unmeasured confounding, latent alignment for data-source shifts) and standard IV validity assumptions to enable identification when (M,Y) are never jointly observed. Identification strategies and influence-function estimators are developed from these premises without any quoted reduction of a 'prediction' or result to a fitted parameter by construction, and without load-bearing self-citations that substitute for independent justification. The framework therefore remains self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions required for identification when (M,Y) are never jointly observed; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption No-interaction condition between exposure and mediator
    Invoked to maintain validity under unmeasured confounding when mediator and outcome are separately observed.
  • domain assumption Latent alignment condition for covariate and exposure shifts across data sources
    Required to accommodate differences between the two incomplete datasets.

pith-pipeline@v0.9.1-grok · 5722 in / 1318 out tokens · 30486 ms · 2026-06-27T02:16:01.923974+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

161 extracted references · 15 canonical work pages · 4 internal anchors

  1. [1]

    Computational Statistics & Data Analysis , volume=

    Calibrated regression estimation using empirical likelihood under data fusion , author=. Computational Statistics & Data Analysis , volume=. 2024 , publisher=

  2. [2]

    2013 , publisher=

    Linear Integral Equations , author=. 2013 , publisher=

  3. [3]

    Handbook of Econometrics , volume=

    The econometrics of data combination , author=. Handbook of Econometrics , volume=. 2007 , publisher=

  4. [4]

    arXiv preprint arXiv:2510.20404 , year=

    Identification and Debiased Learning of Causal Effects with General Instrumental Variables , author=. arXiv preprint arXiv:2510.20404 , year=

  5. [5]

    arXiv preprint arXiv:2510.14368 , year=

    Marginal causal effect estimation with continuous instrumental variables , author=. arXiv preprint arXiv:2510.14368 , year=

  6. [6]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2018 , publisher=

  7. [7]

    Journal of the Royal Statistical Society Series A: Statistics in Society , volume=

    Experimental designs for identifying causal mechanisms , author=. Journal of the Royal Statistical Society Series A: Statistics in Society , volume=. 2013 , publisher=

  8. [8]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Direct and indirect treatment effects--causal chains and mediation analysis with instrumental variables , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2017 , publisher=

  9. [9]

    Journal of the American Statistical Association , volume=

    Complier stochastic direct effects: identification and robust estimation , author=. Journal of the American Statistical Association , volume=. 2021 , publisher=

  10. [10]

    Journal of the American Statistical Association , volume=

    Aberrant effects of treatment , author=. Journal of the American Statistical Association , volume=. 2008 , publisher=

  11. [11]

    The Annals of Statistics , volume=

    Extremal quantile treatment effects , author=. The Annals of Statistics , volume=. 2018 , publisher=

  12. [12]

    irregular

    Identification and estimation of average partial effects in “irregular” correlated random coefficient panel data models , author=. Econometrica , volume=. 2012 , publisher=

  13. [13]

    Journal of Econometrics , volume=

    Unconditional effects of general policy interventions , author=. Journal of Econometrics , volume=. 2024 , publisher=

  14. [14]

    Biometrika , volume=

    Sensitivity analysis for certain permutation inferences in matched observational studies , author=. Biometrika , volume=. 1987 , publisher=

  15. [15]

    The Annals of Statistics , volume=

    Bounds on the conditional and average treatment effect with unobserved confounding factors , author=. The Annals of Statistics , volume=. 2022 , publisher=

  16. [16]

    Journal of the American Statistical Association , volume=

    Selecting and ranking individualized treatment rules with unmeasured confounding , author=. Journal of the American Statistical Association , volume=. 2021 , publisher=

  17. [17]

    arXiv preprint arXiv:1612.06850 , year=

    Extremal quantile regression: An overview , author=. arXiv preprint arXiv:1612.06850 , year=

  18. [18]

    The Annals of Statistics , volume=

    Some asymptotic theory for the bootstrap , author=. The Annals of Statistics , volume=. 1981 , publisher=

  19. [19]

    Statistical Science , volume=

    An introduction to proximal causal inference , author=. Statistical Science , volume=. 2024 , publisher=

  20. [20]

    Journal of the American Statistical Association , pages=

    Semiparametric proximal causal inference , author=. Journal of the American Statistical Association , pages=. 2023 , publisher=

  21. [21]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Multiply robust causal inference with double-negative control adjustment for categorical unmeasured confounding , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2020 , publisher=

  22. [22]

    Journal of Statistical Computation and Simulation , volume=

    Nonparametric estimation of extreme conditional quantiles , author=. Journal of Statistical Computation and Simulation , volume=. 2004 , publisher=

  23. [23]

    Econometric Theory , pages=

    Subsampling inference for nonparametric extremal conditional quantiles , author=. Econometric Theory , pages=. 2023 , publisher=

  24. [24]

    Nonparametric smoothing for extremal quantile regression with heavy tailed distributions

    Nonparametric smoothing for extremal quantile regression with heavy tailed distributions , author=. arXiv preprint arXiv:1903.03242 , year=

  25. [25]

    1996 , publisher=

    Weak Convergence and Empirical Processes With Applications to Statistics , author=. 1996 , publisher=

  26. [26]

    Econometric Theory , volume=

    Asymptotics for least absolute deviation regression estimators , author=. Econometric Theory , volume=. 1991 , publisher=

  27. [27]

    The Quarterly Journal of Economics , volume=

    Measuring the sensitivity of parameter estimates to estimation moments , author=. The Quarterly Journal of Economics , volume=. 2017 , publisher=

  28. [28]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , year=

    GENIUS-MAWII: For robust Mendelian randomization with many weak invalid instruments , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , year=

  29. [29]

    2008 , publisher=

    Asymptotic theory of statistics and probability , author=. 2008 , publisher=

  30. [30]

    Group Lasso for high dimensional sparse quantile regression models

    Group Lasso for high dimensional sparse quantile regression models , author=. arXiv preprint arXiv:1103.1458 , year=

  31. [31]

    Journal of Econometrics , volume=

    Smoothed quantile regression with large-scale inference , author=. Journal of Econometrics , volume=. 2023 , publisher=

  32. [32]

    Statistics in Medicine , volume=

    Parametric survival analysis and taxonomy of hazard functions for the generalized gamma distribution , author=. Statistics in Medicine , volume=. 2007 , publisher=

  33. [33]

    Journal of Econometrics , volume=

    Estimating distributions of potential outcomes using local instrumental variables with an application to changes in college enrollment and wage inequality , author=. Journal of Econometrics , volume=. 2009 , publisher=

  34. [34]

    The Annals of Statistics , volume=

    Causal inference in partially linear structural equation models , author=. The Annals of Statistics , volume=. 2018 , publisher=

  35. [35]

    Biometrika , volume=

    Ancestor regression in linear structural equation models , author=. Biometrika , volume=. 2023 , publisher=

  36. [36]

    The Annals of Mathematical Statistics , volume=

    Estimation of the parameters of a single equation in a complete system of stochastic equations , author=. The Annals of Mathematical Statistics , volume=. 1949 , publisher=

  37. [37]

    International Journal of Epidemiology , volume=

    Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression , author=. International Journal of Epidemiology , volume=. 2015 , publisher=

  38. [38]

    Journal of The Royal Statistical Society Series B: Statistical Methodology , year=

    On the instrumental variable estimation with many weak and invalid instruments , author=. Journal of The Royal Statistical Society Series B: Statistical Methodology , year=

  39. [39]

    The Annals of Mathematical Statistics , volume=

    Values of Mills' ratio of area to bounding ordinate and of the normal probability integral for large values of the argument , author=. The Annals of Mathematical Statistics , volume=. 1941 , publisher=

  40. [40]

    2024 , journal=

    Robustness Against Weak or Invalid Instruments: Exploring Nonlinear Treatment Models with Machine Learning , author=. 2024 , journal=

  41. [41]

    Biometrika , volume=

    Semiparametric efficient G-estimation with invalid instrumental variables , author=. Biometrika , volume=. 2023 , publisher=

  42. [42]

    Statistical Science , volume=

    The GENIUS approach to robust Mendelian randomization inference , author=. Statistical Science , volume=. 2021 , publisher=

  43. [43]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    The confidence interval method for selecting valid instrumental variables , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2021 , publisher=

  44. [44]

    Journal of the American Statistical Association , volume=

    Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization , author=. Journal of the American Statistical Association , volume=. 2016 , publisher=

  45. [45]

    The Annals of Statistics , volume=

    Extremal quantile regression , author=. The Annals of Statistics , volume=. 2005 , publisher=

  46. [46]

    2019 , publisher=

    High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=

  47. [47]

    The Annals of Statistics , volume=

    Causal discovery in heavy-tailed models , author=. The Annals of Statistics , volume=. 2021 , publisher=

  48. [48]

    Journal of Econometrics , volume=

    Extremal quantile regressions for selection models and the black--white wage gap , author=. Journal of Econometrics , volume=. 2018 , publisher=

  49. [49]

    Econometric Theory , volume=

    Another look at the identification at infinity of sample selection models , author=. Econometric Theory , volume=. 2013 , publisher=

  50. [50]

    Proceedings of the National Academy of Sciences , volume=

    Quantitative trait analysis in sequencing studies under trait-dependent sampling , author=. Proceedings of the National Academy of Sciences , volume=. 2013 , publisher=

  51. [51]

    Journal of Econometrics , volume=

    Endogenous selection or treatment model estimation , author=. Journal of Econometrics , volume=. 2007 , publisher=

  52. [52]

    Econometrica , pages=

    Semiparametric latent variable model estimation with endogenous or mismeasured regressors , author=. Econometrica , pages=. 1998 , publisher=

  53. [53]

    Econometrica , volume=

    Robust confidence intervals for average treatment effects under limited overlap , author=. Econometrica , volume=. 2017 , publisher=

  54. [54]

    2011 , publisher=

    Econometrics , author=. 2011 , publisher=

  55. [55]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Confidence intervals for causal effects with invalid instruments by using two-stage hard thresholding with voting , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2018 , publisher=

  56. [56]

    1986 , publisher=

    Approximation of Functions , author=. 1986 , publisher=

  57. [57]

    Handbook of Econometrics , volume=

    Large sample sieve estimation of semi-nonparametric models , author=. Handbook of Econometrics , volume=. 2007 , publisher=

  58. [58]

    Journal of Machine Learning Research , volume=

    High-Dimensional Inference for Generalized Linear Models with Hidden Confounding , author=. Journal of Machine Learning Research , volume=

  59. [59]

    The Annals of Statistics , volume=

    Doubly debiased lasso: High-dimensional inference under hidden confounding , author=. The Annals of Statistics , volume=. 2022 , publisher=

  60. [60]

    arXiv preprint arXiv:2304.01098 , year=

    The synthetic instrument: From sparse association to sparse causation , author=. arXiv preprint arXiv:2304.01098 , year=

  61. [61]

    The Annals of Statistics , pages=

    Confounder Adjustment in Multiple Hypothesis Testing , author=. The Annals of Statistics , pages=. 2017 , publisher=

  62. [62]

    Statistical Science , volume=

    Instrumental Variable Estimation with a Stochastic Monotonicity Assumption , author=. Statistical Science , volume=

  63. [63]

    Biometrika , volume=

    Identifying causal effects with proxy variables of an unmeasured confounder , author=. Biometrika , volume=. 2018 , publisher=

  64. [64]

    Journal of Econometrics , volume=

    Using invalid instruments on purpose: Focused moment selection and averaging for GMM , author=. Journal of Econometrics , volume=. 2016 , publisher=

  65. [65]

    Econometrica , volume=

    Automobile Prices in Market Equilibrium , author=. Econometrica , volume=

  66. [66]

    The Annals of Statistics , volume=

    Limiting distributions for L1 regression estimators under general conditions , author=. The Annals of Statistics , volume=

  67. [67]

    The Annals of Statistics , volume=

    Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions , author=. The Annals of Statistics , volume=

  68. [68]

    Journal of the American Statistical Association , pages=

    Identifying effects of multiple treatments in the presence of unmeasured confounding , author=. Journal of the American Statistical Association , pages=. 2022 , publisher=

  69. [69]

    Econometrica , volume=

    Instrumental variable estimation of nonparametric models , author=. Econometrica , volume=. 2003 , publisher=

  70. [70]

    Econometrica , pages=

    The estimation of economic relationships using instrumental variables , author=. Econometrica , pages=. 1958 , publisher=

  71. [71]

    Statistical Science , volume=

    Instrumental Variables: An Econometrician’s Perspective , author=. Statistical Science , volume=

  72. [72]

    Econometrica , pages=

    Efficient instrumental variables estimation of nonlinear models , author=. Econometrica , pages=. 1990 , publisher=

  73. [73]

    Journal of Economic Perspectives , volume=

    Instrumental variables and the search for identification: From supply and demand to natural experiments , author=. Journal of Economic Perspectives , volume=

  74. [74]

    1928 , publisher=

    Tariff on animal and vegetable oils , author=. 1928 , publisher=

  75. [75]

    Biometrika , volume=

    Efficient estimation under data fusion , author=. Biometrika , volume=. 2023 , publisher=

  76. [76]

    1, smoking, and lung cancer: an assessment of mediation and interaction , author=

    Genetic variants on 15q25. 1, smoking, and lung cancer: an assessment of mediation and interaction , author=. American journal of epidemiology , volume=. 2012 , publisher=

  77. [77]

    American journal of epidemiology , volume=

    Causal mediation analysis with observational data: considerations and illustration examining mechanisms linking neighborhood poverty to adolescent substance use , author=. American journal of epidemiology , volume=. 2019 , publisher=

  78. [78]

    Epidemiology , volume=

    Identifiability and exchangeability for direct and indirect effects , author=. Epidemiology , volume=. 1992 , publisher=

  79. [79]

    Probabilistic and causal inference: the works of Judea Pearl , pages=

    Direct and indirect effects , author=. Probabilistic and causal inference: the works of Judea Pearl , pages=

  80. [80]

    Analysis of mediating variables in prevention and intervention research In C

    MacKinnon, DP , journal=. Analysis of mediating variables in prevention and intervention research In C

Showing first 80 references.