Semiparametric Causal Mediation Analysis for Linear Models with Non-Gaussian Errors: Applications to Drug Treatment and Social Program Evaluation

Mijeong Kim

arxiv: 2604.08346 · v1 · submitted 2026-04-09 · 📊 stat.ME

Semiparametric Causal Mediation Analysis for Linear Models with Non-Gaussian Errors: Applications to Drug Treatment and Social Program Evaluation

Mijeong Kim This is my paper

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification 📊 stat.ME

keywords causal mediation analysissemiparametric estimationnon-Gaussian errorslinear modelsdirect and indirect effectstreatment evaluationsocial program evaluation

0 comments

The pith

Semiparametric methods deliver more precise estimates of direct and indirect effects in linear mediation models with non-Gaussian errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a semiparametric framework for causal mediation analysis in linear models that accommodates errors that are not normally distributed. It uses efficient estimation techniques and stacked equations to build confidence intervals without relying on Gaussian assumptions. Simulations across different error distributions show lower root mean squared error and shorter intervals than ordinary least squares, with bigger improvements when errors are skewed or mixed. Real-world applications to drug treatment and job training data reveal sharper estimates and the ability to detect mediation effects that standard methods miss.

Core claim

We developed a semiparametric causal mediation framework for linear models allowing possibly non-Gaussian errors, covering both standard models and models with treatment-mediator interaction. The method combines semiparametric efficient regression estimation, a reproducible multi-start fitting algorithm, and stacked estimating equations for confidence-interval construction. Across simulations and in the uis and jobs datasets, this yields reduced root mean squared error, shorter confidence intervals, and detection of significant effects where OLS does not.

What carries the argument

The semiparametric efficient regression estimator integrated with stacked estimating equations and a multi-start algorithm for stable fitting.

If this is right

The framework allows reliable decomposition of treatment effects into direct and indirect parts even when outcome errors deviate from normality.
Under treatment-mediator interaction, it provides treatment-specific effect estimates with improved precision.
Applied researchers can obtain shorter confidence intervals and higher power to detect mediation in non-Gaussian settings.
Policy and clinical decisions based on indirect effects may change when using this method instead of OLS in relevant data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If residual diagnostics indicate non-normality, switching to this semiparametric approach could alter conclusions about mediation.
The method's numerical stability via multi-start fitting makes it feasible for routine use in applied studies.
Extensions could incorporate additional covariates or nonlinear terms while retaining the semiparametric advantages.

Load-bearing premise

The underlying relationships must follow a linear structural model, and the estimating equations must be correctly specified for the estimator to perform as claimed.

What would settle it

Running the same simulations but finding that the semiparametric estimator has larger root mean squared error or wider intervals than OLS under non-Gaussian errors would contradict the performance results.

Figures

Figures reproduced from arXiv: 2604.08346 by Mijeong Kim.

**Figure 2.** Figure 2: Estimated treatment-specific mediation effects, direct effects, and total effects in the [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

read the original abstract

\textbf{Background:} Mediation analysis is widely used to investigate how treatments and programs exert their effects, but standard ordinary least squares (OLS) inference can be unreliable when regression errors are non-Gaussian. In medical and public-health studies, this can affect whether indirect and direct effects are judged clinically or scientifically meaningful. \textbf{Methods:} We developed a semiparametric causal mediation framework for linear models allowing possibly non-Gaussian errors, covering both standard models and models with treatment--mediator interaction. The method combines semiparametric efficient regression estimation, a reproducible multi-start fitting algorithm for numerical stability, and stacked estimating equations for confidence-interval construction without requiring Gaussian error assumptions. \textbf{Results:} Across Gaussian, skewed, and mixture-error simulations, the semiparametric estimator reduced root mean squared error and confidence-interval length relative to OLS, with the largest gains under non-Gaussian errors. In a near-boundary power design, the OLS confidence interval achieved 18.3\% empirical power, whereas the semiparametric confidence interval identified significant effects in all replications. In the \textit{uis} drug-treatment data, it yielded sharper treatment-specific effect estimates under clear treatment--mediator interaction. In the \textit{jobs} social-program data, the semiparametric analysis produced shorter confidence intervals for mediated effects and detected nonzero mediation where OLS did not. \textbf{Conclusions:} Semiparametric mediation analysis can improve the precision and reliability of effect decomposition in studies with non-Gaussian outcomes, offering a practical alternative to OLS when indirect and direct effects may inform clinical or policy decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a semiparametric estimator for linear mediation models that drops the Gaussian error assumption and reports clear gains over OLS in simulations and two applications, though the gains rest on the multi-start algorithm behaving reliably.

read the letter

The main thing to know is that this work builds a semiparametric efficient estimator for causal mediation in linear models, with or without treatment-mediator interaction, and pairs it with stacked estimating equations so that inference does not require Gaussian errors. The authors also supply a reproducible multi-start fitting routine to handle the numerical side. In simulations the estimator cuts root mean squared error and shortens confidence intervals relative to OLS, with the largest improvements under skewed or mixture errors, and in one near-boundary power setup it moves from 18 percent to 100 percent empirical power. The uis drug-treatment example shows sharper treatment-specific effects when interaction is present, and the jobs social-program data yields shorter intervals plus detection of mediation that OLS misses. Those are the concrete results on offer. The approach is new in its specific combination of efficient regression and stacked equations for this mediation setting, and the simulation design covers the error distributions that matter in practice. The soft spots are real but not fatal. Everything depends on the multi-start algorithm consistently locating the global solution to the efficient score equations; if local minima appear once the nonparametric error-score component is estimated, the reported efficiency and coverage gains will not materialize in finite samples. The framework also stays inside the linear structural model, so any misspecification there carries through. The derivations look standard semiparametric efficiency theory, but the stacked variance estimator needs the full algebra to confirm it is correctly centered without Gaussianity. This is for applied statisticians and public-health researchers who already use linear mediation and want something more robust when residuals are clearly non-normal. A reader who runs those kinds of analyses would find the comparisons and the real-data illustrations useful. I would send it to referees because the practical problem is common, the proposed fix is concrete, and the simulation evidence is on point even if the numerical stability claim needs checking.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a semiparametric causal mediation framework for linear models with possibly non-Gaussian errors, covering both standard mediation and models with treatment-mediator interaction. It combines semiparametric efficient regression estimation, a reproducible multi-start fitting algorithm for numerical stability, and stacked estimating equations for confidence-interval construction that avoid Gaussian error assumptions. Simulations under Gaussian, skewed, and mixture-error distributions report lower root mean squared error and shorter confidence intervals relative to OLS, with the largest gains under non-Gaussian errors; a near-boundary power design shows OLS power of 18.3% versus 100% for the semiparametric estimator. Applications to the uis drug-treatment data yield sharper treatment-specific effects under interaction, while the jobs social-program data detect nonzero mediation where OLS does not.

Significance. If the numerical stability of the multi-start algorithm and the attainment of the semiparametric efficiency bound are confirmed, the approach could provide a practical and more reliable alternative to OLS for decomposing direct and indirect effects in medical and social-science studies where error distributions are frequently non-normal. The explicit provision of a multi-start procedure and stacked-equation inference without parametric error assumptions addresses real implementation barriers and could improve precision in effect estimation.

major comments (2)

[Methods (algorithm description)] Methods (algorithm description): The central performance claims rest on the multi-start algorithm consistently locating the semiparametric efficient estimator rather than local solutions. The manuscript provides insufficient detail on starting-value generation, number of starts, convergence criteria, and post-fit diagnostics (e.g., proportion of replications in which multiple starts converged to the same point or checks that the efficient score equations are satisfied). Without such verification, the reported RMSE reductions and the 18.3% versus 100% power difference cannot be confidently attributed to efficiency gains.
[Simulation study (near-boundary design)] Simulation study (near-boundary design): The stacked variance estimator is evaluated at the fitted point; if the multi-start procedure fails to reach the global solution in some replications, both the point estimates and the reported confidence-interval lengths become unreliable. The manuscript should report diagnostics confirming that the efficient estimator was attained across all simulation replications before claiming superiority over OLS.

minor comments (2)

[Abstract] Abstract: The phrase 'reproducible multi-start fitting algorithm' is used without specifying the number of starts or selection rule; adding these parameters would improve immediate reproducibility for readers.
[Notation] Notation: The definition of the efficient score and the stacked estimating equations would benefit from an explicit statement of the nonparametric component (e.g., estimated error score) to clarify how the method departs from standard OLS.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. The comments highlight important aspects of reproducibility and verification that strengthen the presentation of our semiparametric approach. We address each major comment below and have revised the manuscript to incorporate the requested details and diagnostics.

read point-by-point responses

Referee: The central performance claims rest on the multi-start algorithm consistently locating the semiparametric efficient estimator rather than local solutions. The manuscript provides insufficient detail on starting-value generation, number of starts, convergence criteria, and post-fit diagnostics (e.g., proportion of replications in which multiple starts converged to the same point or checks that the efficient score equations are satisfied). Without such verification, the reported RMSE reductions and the 18.3% versus 100% power difference cannot be confidently attributed to efficiency gains.

Authors: We agree that additional documentation of the multi-start algorithm is required to support the performance claims. In the revised manuscript we have added a new subsection in the Methods section that specifies the reproducible procedure for generating starting values (OLS estimates together with multiple random perturbations drawn from a normal distribution centered at the OLS solution), the number of starts used, the convergence criterion based on the norm of the efficient score, and post-fit diagnostics. These diagnostics now include the proportion of starts that converge to the same point and explicit verification that the efficient score equations are satisfied at the selected solution. The same diagnostics are reported for all simulation settings, confirming that the reported reductions in RMSE and the power difference arise from the semiparametric estimator rather than from optimization failures. revision: yes
Referee: The stacked variance estimator is evaluated at the fitted point; if the multi-start procedure fails to reach the global solution in some replications, both the point estimates and the reported confidence-interval lengths become unreliable. The manuscript should report diagnostics confirming that the efficient estimator was attained across all simulation replications before claiming superiority over OLS.

Authors: We acknowledge the referee's concern that the near-boundary power results could be affected by incomplete convergence. In the revised Simulation Study section we have inserted a dedicated paragraph presenting the requested diagnostics for the near-boundary design. These diagnostics confirm that the multi-start procedure reached the same solution satisfying the stacked estimating equations in every replication. With this verification now included, the comparison of empirical power (18.3% for OLS versus 100% for the semiparametric estimator) can be attributed to the efficiency properties of the estimator rather than to numerical artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on independent semiparametric efficiency theory and simulation benchmarks

full rationale

The paper constructs a new semiparametric estimator for linear mediation models by combining efficient score equations (from semiparametric theory) with stacked estimating equations and a multi-start numerical solver. Performance claims (lower RMSE, shorter CIs) are evaluated via Monte Carlo simulations under controlled error distributions and two real datasets; these comparisons are external to the estimator definition itself. No step equates a fitted quantity to a 'prediction' of itself, renames a known result, or reduces the central efficiency claim to a self-citation chain. The multi-start algorithm is presented as a practical implementation detail rather than a load-bearing uniqueness theorem. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard linear structural assumptions for mediation and on semiparametric efficiency results from the broader statistical literature; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption The data-generating process follows a linear model with or without treatment-mediator interaction
Explicitly stated as the setting for the semiparametric framework.
standard math Semiparametric efficient estimators and stacked estimating equations yield valid inference without Gaussian error assumptions
Invoked for confidence-interval construction.

pith-pipeline@v0.9.0 · 5597 in / 1306 out tokens · 38714 ms · 2026-05-10T17:54:55.824829+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Baron, R. M. and Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research.Journal of Personality and Social Psychology51, 1173–1182

work page 1986
[2]

J., Klaassen, C

Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993).Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press

work page 1993
[3]

and Yamamoto, T

Imai, K., Keele, L. and Yamamoto, T. (2010a). Identification, inference and sensitivity analysis for causal mediation effects.Statistical Science25, 51–71

work page
[4]

and Tingley, D

Imai, K., Keele, L. and Tingley, D. (2010b). A general approach to causal mediation analysis.Psychological Methods15, 309–334

work page
[5]

Hosmer, D. W. and Lemeshow, S. (1998).Applied Survival Analysis: Regression Modeling of Time to Event Data. New York: Wiley

work page 1998
[6]

Semiparametricefficientestimationforregressionmodelswithindependent errors.Statistical Analysis and Data Mining16, 623–636

Kim, M.(2023). Semiparametricefficientestimationforregressionmodelswithindependent errors.Statistical Analysis and Data Mining16, 623–636

work page 2023
[7]

(2025).quantreg: Quantile Regression

Koenker, R. (2025).quantreg: Quantile Regression. R package version 6.1

work page 2025
[8]

P., Lockwood, C

MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G. and Sheets, V. (2002). A comparison of methods to test mediation and other intervening variable effects.Psycho- logical Methods7, 83–104

work page 2002
[9]

Pearl, J. (2001). Direct and indirect effects. InProceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, 411–420

work page 2001
[10]

Robins, J. M. and Greenland, S. (2003). Identifiability and exchangeability for direct and indirect effects.Epidemiology3, 143–155

work page 2003
[11]

and VanderWeele, T

Valeri, L. and VanderWeele, T. J. (2013). Mediation analysis allowing for exposure– mediator interactions and causal interpretation: theoretical assumptions and implemen- tation with SAS and SPSS macros.Psychological Methods18, 137–150

work page 2013
[12]

and Imai, K

Tingley, D., Yamamoto, T., Hirose, K., Keele, L. and Imai, K. (2014). Mediation: R package for causal mediation analysis.Journal of Statistical Software59, 1–38. 23

work page 2014
[13]

Tsiatis, A. A. (2006).Semiparametric Theory and Missing Data. New York: Springer

work page 2006
[14]

van der Vaart, A. W. (1998).Asymptotic Statistics. Cambridge: Cambridge University Press

work page 1998
[15]

VanderWeele, T. J. (2009). Mediation and mechanism.European Journal of Epidemiology 24, 217–224

work page 2009
[16]

VanderWeele, T. J. (2010). Bias formulas for sensitivity analysis for direct and indirect effects.Epidemiology21, 540–551

work page 2010
[17]

VanderWeele, T. J. (2016). Mediation analysis: a practitioner’s guide.Annual Review of Public Health37, 17–32

work page 2016
[18]

VanderWeele, T. J. and Vansteelandt, S. (2009). Conceptual issues concerning mediation, interventions and composition.Statistics and Its Interface2, 457–468

work page 2009
[19]

VanderWeele, T. J. and Vansteelandt, S. (2010). Odds ratios for mediation analysis for a dichotomous outcome.American Journal of Epidemiology172, 1339–1348

work page 2010
[20]

VanderWeele, T. J. (2015).Explanation in Causal Inference. Oxford: Oxford University Press. 24

work page 2015

[1] [1]

Baron, R. M. and Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research.Journal of Personality and Social Psychology51, 1173–1182

work page 1986

[2] [2]

J., Klaassen, C

Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993).Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press

work page 1993

[3] [3]

and Yamamoto, T

Imai, K., Keele, L. and Yamamoto, T. (2010a). Identification, inference and sensitivity analysis for causal mediation effects.Statistical Science25, 51–71

work page

[4] [4]

and Tingley, D

Imai, K., Keele, L. and Tingley, D. (2010b). A general approach to causal mediation analysis.Psychological Methods15, 309–334

work page

[5] [5]

Hosmer, D. W. and Lemeshow, S. (1998).Applied Survival Analysis: Regression Modeling of Time to Event Data. New York: Wiley

work page 1998

[6] [6]

Semiparametricefficientestimationforregressionmodelswithindependent errors.Statistical Analysis and Data Mining16, 623–636

Kim, M.(2023). Semiparametricefficientestimationforregressionmodelswithindependent errors.Statistical Analysis and Data Mining16, 623–636

work page 2023

[7] [7]

(2025).quantreg: Quantile Regression

Koenker, R. (2025).quantreg: Quantile Regression. R package version 6.1

work page 2025

[8] [8]

P., Lockwood, C

MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G. and Sheets, V. (2002). A comparison of methods to test mediation and other intervening variable effects.Psycho- logical Methods7, 83–104

work page 2002

[9] [9]

Pearl, J. (2001). Direct and indirect effects. InProceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, 411–420

work page 2001

[10] [10]

Robins, J. M. and Greenland, S. (2003). Identifiability and exchangeability for direct and indirect effects.Epidemiology3, 143–155

work page 2003

[11] [11]

and VanderWeele, T

Valeri, L. and VanderWeele, T. J. (2013). Mediation analysis allowing for exposure– mediator interactions and causal interpretation: theoretical assumptions and implemen- tation with SAS and SPSS macros.Psychological Methods18, 137–150

work page 2013

[12] [12]

and Imai, K

Tingley, D., Yamamoto, T., Hirose, K., Keele, L. and Imai, K. (2014). Mediation: R package for causal mediation analysis.Journal of Statistical Software59, 1–38. 23

work page 2014

[13] [13]

Tsiatis, A. A. (2006).Semiparametric Theory and Missing Data. New York: Springer

work page 2006

[14] [14]

van der Vaart, A. W. (1998).Asymptotic Statistics. Cambridge: Cambridge University Press

work page 1998

[15] [15]

VanderWeele, T. J. (2009). Mediation and mechanism.European Journal of Epidemiology 24, 217–224

work page 2009

[16] [16]

VanderWeele, T. J. (2010). Bias formulas for sensitivity analysis for direct and indirect effects.Epidemiology21, 540–551

work page 2010

[17] [17]

VanderWeele, T. J. (2016). Mediation analysis: a practitioner’s guide.Annual Review of Public Health37, 17–32

work page 2016

[18] [18]

VanderWeele, T. J. and Vansteelandt, S. (2009). Conceptual issues concerning mediation, interventions and composition.Statistics and Its Interface2, 457–468

work page 2009

[19] [19]

VanderWeele, T. J. and Vansteelandt, S. (2010). Odds ratios for mediation analysis for a dichotomous outcome.American Journal of Epidemiology172, 1339–1348

work page 2010

[20] [20]

VanderWeele, T. J. (2015).Explanation in Causal Inference. Oxford: Oxford University Press. 24

work page 2015