pith. sign in

arxiv: 2604.08507 · v1 · submitted 2026-04-09 · 📊 stat.ME · q-bio.QM· stat.AP

A Quasi-Regression Method for the Mediation Analysis of Zero-Inflated Single-Cell Data

Pith reviewed 2026-05-10 16:57 UTC · model grok-4.3

classification 📊 stat.ME q-bio.QMstat.AP
keywords mediation analysissingle-cell datazero-inflatedquasi-regressioncausal inferenceindirect effectsgene regulation
0
0 comments X

The pith

QuasiMed performs mediation analysis on zero-inflated single-cell data by specifying only mean functions in a quasi-regression framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces QuasiMed as a mediation framework for single-cell data, which differs from bulk data due to both expression levels and the proportion of expressing cells. It consists of screening mediator candidates with penalized regression, estimating indirect effects on average expression and cell proportions via quasi-regression, and testing hypotheses under multiplicity control. By modeling only the mean functions rather than requiring full probability distributions, the method relaxes the strict distributional assumptions that limit other approaches. Simulations modeled on real data indicate high power, false discovery rate control, and speed. Application to ROSMAP single-cell data shows its use for uncovering mediating causal pathways in gene regulation.

Core claim

QuasiMed is a three-step mediation framework for single-cell data consisting of penalized regression screening of mediator candidates, quasi-regression estimation of indirect effects through the average expression and the proportion of expressing cells, and hypothesis testing with multiplicity control. The key benefit is that it specifies only the mean functions of the mediation models through a quasi-regression framework, thereby relaxing strict distributional assumptions.

What carries the argument

The quasi-regression framework that specifies only the mean functions of the mediation models for indirect effects on expression levels and proportions of expressing cells.

If this is right

  • High power and false discovery rate control in real-data-inspired simulations of zero-inflated counts.
  • Computational efficiency for high-dimensional single-cell datasets.
  • Ability to identify mediating causal pathways in applications to datasets such as ROSMAP single-cell data.
  • Estimation of indirect effects without specifying full distributions for both average expression and cell expression proportions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same quasi-regression idea could apply to mediation analysis in other zero-inflated biological data types such as bulk RNA-seq with many non-expressed genes.
  • Extensions might add robustness checks for the screening step when the number of candidate mediators grows very large.
  • The method could support multi-omics integration by treating different data layers as sequential mediators.

Load-bearing premise

The quasi-regression mean functions combined with the screening and estimation steps correctly capture indirect effects in zero-inflated single-cell data without hidden biases from the relaxed distributional assumptions.

What would settle it

A simulation study with known true indirect effects where QuasiMed produces biased estimates or loses power when the mean models match the data structure but the zero-inflation mechanism deviates from the quasi-regression specification.

read the original abstract

Recent advances in single-cell technologies have advanced our understanding of gene regulation and cellular heterogeneity at single-cell resolution. Single-cell data contain both gene expression levels and the proportion of expressing cells, which makes them structurally different from bulk data. Currently, methodological work on causal mediation analysis for single-cell data remains limited and often requires specific distributional assumptions. To address this challenge, we present QuasiMed, a mediation framework specialized for single-cell data. Our proposed method comprises three steps, including (i) screening mediator candidates through penalized regression and marginal models (similar to sure independence screening), (ii) estimation of indirect effects through the average expression and the proportion of expressing cells, (iii) and hypothesis testing with multiplicity control. The key benefit of QuasiMed is that it specifies only the mean functions of the mediation models through a quasi-regression framework, thereby relaxing strict distributional assumptions. The method performance was evaluated through the real-data-inspired simulations, and demonstrated high power, false discovery rate control, and computational efficiency. Lastly, we applied QuasiMed to ROSMAP single-cell data to illustrate its potential to identify mediating causal pathways. R package is freely available on GitHub repository at https://github.com/sjahnn/QuasiMed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces QuasiMed, a three-step quasi-regression framework for causal mediation analysis of zero-inflated single-cell data. Step (i) screens mediator candidates via penalized regression and marginal models; step (ii) estimates natural indirect effects using mean functions for average gene expression and the proportion of expressing cells; step (iii) performs hypothesis testing with multiplicity control. The central claim is that specifying only mean functions relaxes strict distributional assumptions while still correctly identifying mediating pathways. Performance is assessed via real-data-inspired simulations (high power, FDR control, efficiency) and illustrated on ROSMAP single-cell data, with an accompanying R package.

Significance. If the mean-only quasi-regression estimators remain consistent for natural indirect effects, the work would provide a useful flexible alternative to fully parametric mediation methods for single-cell data, where zero-inflation and cellular heterogeneity make distributional assumptions difficult. Credit is due for the reproducible R package, simulation design inspired by real data, and the applied analysis on ROSMAP data demonstrating potential biological utility.

major comments (1)
  1. [Estimation step (ii)] Step (ii), estimation of indirect effects: the central claim that quasi-regression mean functions suffice to capture indirect effects is load-bearing. For natural indirect effects, identification requires E[Y(a, M(a'))], which equals the outcome mean function evaluated at E[M(a')] only under linearity or specific distributional conditions. With the nonlinear links typical for zero-inflated counts or proportions, Jensen's inequality implies that plugging in the mediator mean generally produces bias. The manuscript should state the precise identification formulas used, whether integration over the mediator distribution is performed, and under what conditions the mean-based procedure is consistent.
minor comments (2)
  1. [Abstract] The abstract asserts 'high power, false discovery rate control, and computational efficiency' from simulations but provides no quantitative metrics, baseline comparisons, or simulation design details (e.g., zero-inflation rates, sample sizes, or effect sizes). These should be summarized with tables or figures in the main text.
  2. Notation for the quasi-regression models (mean functions for expression level and expressing proportion) is not fully specified in the provided description; explicit link functions, variance functions, and how the two are combined for the indirect-effect estimator would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The single major comment raises an important point about identification in the estimation step, which we address below. We will revise the manuscript to provide the requested clarifications.

read point-by-point responses
  1. Referee: [Estimation step (ii)] Step (ii), estimation of indirect effects: the central claim that quasi-regression mean functions suffice to capture indirect effects is load-bearing. For natural indirect effects, identification requires E[Y(a, M(a'))], which equals the outcome mean function evaluated at E[M(a')] only under linearity or specific distributional conditions. With the nonlinear links typical for zero-inflated counts or proportions, Jensen's inequality implies that plugging in the mediator mean generally produces bias. The manuscript should state the precise identification formulas used, whether integration over the mediator distribution is performed, and under what conditions the mean-based procedure is consistent.

    Authors: We appreciate the referee highlighting this key identification issue. In the current manuscript, step (ii) estimates natural indirect effects by modeling the conditional mean functions of the outcome (average expression and proportion of expressing cells) via quasi-regression and substituting the estimated mean of the counterfactual mediator into these mean functions. This mean-plugging approach is used to avoid specifying full conditional distributions. We acknowledge that, for nonlinear mean functions, this does not in general equal the required E[Y(a, M(a'))] and may introduce bias due to Jensen's inequality. In the revision, we will add explicit statements of the identification formulas in the Methods section, clarify that no integration over the mediator distribution is performed, and specify the conditions (e.g., linearity of the outcome mean in the mediator) under which the estimators remain consistent. We will also include a brief discussion of this limitation and its implications for zero-inflated single-cell data. revision: yes

Circularity Check

0 steps flagged

No circularity: quasi-regression mean specification is independent of the target indirect-effect functional

full rationale

The paper's central claim is that specifying only mean functions via quasi-regression relaxes distributional assumptions for mediation in zero-inflated single-cell data. No equation or step in the provided description reduces a derived quantity (e.g., indirect-effect estimator) to a fitted parameter by algebraic identity or by renaming the input. The three-step procedure (screening, mean-based estimation, testing) is presented as a new construction whose validity rests on external identification arguments for mediation, not on self-definition or self-citation chains. The skeptic concern about Jensen bias for natural indirect effects is a potential correctness or identification gap, not a circularity; it does not make any claimed prediction equivalent to its inputs by construction. The method is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method is described only at the level of steps and the quasi-regression relaxation of distributional assumptions.

pith-pipeline@v0.9.0 · 5541 in / 1195 out tokens · 74196 ms · 2026-05-10T16:57:22.359173+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    Ch´ en, O.et al.High-dimensional multivariate mediation with application to neuroimaging data.Biostatistics19, 121–136 (2018)

  2. [2]

    & Vansteelandt, S

    Daniel, R., De Stavola, B., Cousens, S. & Vansteelandt, S. Causal medi- ation analysis with multiple mediators.Biometrics71, 1–14 (2015) . QuasiMed17

  3. [3]

    & Joffe, M

    Ten Have, T. & Joffe, M. A review of causal estimation of effects in mediation analyses.Stat Methods Med Res21, 77–107 (2012)

  4. [4]

    & Tseng, G

    Chang, H., Fang, Y., Gorczyca, M., Batmanghelich, K. & Tseng, G. High- dimensional causal mediation analysis by partial sum statistic and sample splitting strategy in imaging genetics application.Bioinformatics41, btaf493 (2025)

  5. [5]

    Zhang, H.et al.Mediation effect selection in high-dimensional and com- positional microbiome data.Statistics in Medicine40, 885–896 (2021)

  6. [6]

    Wang, C., Hu, J., Blaser, M. & Li, H. Estimating and testing the micro- bial causal mediation effect with high-dimensional and compositional microbiome data.Bioinformatics36, 347–355 (2020)

  7. [7]

    & LeBlanc, M

    Dai, J., Stanford, J. & LeBlanc, M. A multiple-testing procedure for high-dimensional mediation hypotheses.J Am Stat Assoc.117, 198–213 (2022)

  8. [8]

    Wei, K.et al.Debiased machine learning for ultra-high dimensional mediation analysis.Bioinformaticsbtaf282, 3150–3154 (2025)

  9. [9]

    Zhang, H.et al.Estimating and testing high-dimensional mediation effects in epigenetic studies.Bioinformatics32, 3150–3154 (2016)

  10. [10]

    & Wei, P

    Chi, S., Flowers, C., Li, Z., Huang, X. & Wei, P. Mash: Mediation analysis of survival outcome and high-dimensional omics mediators with application to complex diseases.Ann Appl Stat18, 1360–1377 (2024)

  11. [11]

    & Wei, P

    Xu, Z. & Wei, P. A novel statistical framework for meta-analysis of total mediation effect with high-dimensional omics mediators in large-scale genomic consortia.PLoS Genet20, e1011483 (2024)

  12. [12]

    Auerbach, B., Hu, J., Reilly, M. & Li, M. Applications of single-cell genomics and computational strategies to study common disease and population-level variation.Genome Res31(10), 1728–1741 (2021)

  13. [13]

    & Konopka, G

    Kulkarni, A., Anderson, A., Merullo, D. & Konopka, G. Beyond bulk: a review of single cell transcriptomics methodologies and applications.Curr Opin Biotechnol58, 128–136 (2019)

  14. [14]

    & Konopka, G

    Kulkarni, A., Anderson, A., Merullo, D. & Konopka, G. Beyond bulk: a review of single cell transcriptomics methodologies and applications.Curr Opin Biotechnol58, 129–136 (2019) . 18QuasiMed

  15. [15]

    Kang, J.et al.Systematic dissection of tumor-normal single-cell ecosys- tems across a thousand tumors of 30 cancer types.Nature Communica- tions15, 4067 (2024)

  16. [16]

    URL https://arxiv

    Ahn, S.et al.A statistical framework for co-mediators of zero-inflated single-cell rna-seq data.arXiv2507.06113 (2025). URL https://arxiv. org/abs/2507.06113

  17. [17]

    Li, W. & JJ, L. An accurate and robust imputation method scImpute for single-cell RNA-seq data.Nat Commun9, 997 (2018)

  18. [18]

    & Daniel, R

    Vansteelandt, S. & Daniel, R. Interventional effects for mediation analysis with multiple mediators.Epidemiology28(2), 258–265 (2017)

  19. [19]

    Fan, J. & Lv, J. Sure independence screening for ultrahigh dimensional feature space.J. Royal Stat. Soci. Ser. B70, 849–911 (2008)

  20. [20]

    & Sheets, V

    MacKinnon, D., Lockwood, C., Hoffman, J., West, S. & Sheets, V. A comparison of methods to test mediation and other intervening variable effects.Psychological Methods7, 83–104 (2002)

  21. [21]

    Estimating causal effects of treatments in randomized and non-randomized studies.Journal of Educational Psychology66, 688–701 (1974)

    Rubin, D. Estimating causal effects of treatments in randomized and non-randomized studies.Journal of Educational Psychology66, 688–701 (1974)

  22. [22]

    Statistics and causal inference.Journal of the American Statistical Association81, 945–960 (1986)

    Holland, P. Statistics and causal inference.Journal of the American Statistical Association81, 945–960 (1986)

  23. [23]

    & Greenland, S

    Robins, J. & Greenland, S. Identifiability and exchangeability for direct and indirect effects.Epidemiology3, 143–155 (1992)

  24. [24]

    Mediation analysis via potential outcomes models.Statistics in Medicine27, 1282–1304 (2008)

    Albert, J. Mediation analysis via potential outcomes models.Statistics in Medicine27, 1282–1304 (2008)

  25. [25]

    & Stephens, M

    Sarkar, A. & Stephens, M. Separating measurement and expression mod- els clarifies confusion in single-cell RNA sequencing analysis.Nat Genet 53, 770–777 (2021)

  26. [26]

    & Risso, D

    Nguyen, T., Van den Berge, K., Chiogna, M. & Risso, D. Structure Learning for Zero-Inflated Counts with an Application to Single-Cell RNA Sequencing Data.Ann. Appl. Stat.17, 2555–2573 (2023)

  27. [27]

    Barfield, R.et al.Testing for the indirect effect under the null for genome- wide mediation analyses.Genetic Epidemiology41, 824–833 (2017)

  28. [28]

    & Hochberg, Y

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing.J. R. Statist. Soc. B 57, 289–300 (1995) . QuasiMed19

  29. [29]

    Bennett, D.et al.Religious Orders Study and Rush Memory and Aging Project.J Alzheimers Dis64, S161–S189 (2018)

  30. [30]

    Nature632, 858–868 (2024)

    Mathys, H.et al.Single-cell multiregion dissection of Alzheimer’s disease. Nature632, 858–868 (2024)

  31. [31]

    Mathys, H.et al.Single-cell atlas reveals correlates of high cognitive function, dementia, and resilience to Alzheimer’s disease pathology.Cell 186, 4365–4385.e27 (2023)

  32. [32]

    Nature603, 893–899 (2022)

    Garcia, F.et al.Single-cell dissection of the human brain vasculature. Nature603, 893–899 (2022)

  33. [33]

    Nation, D.et al.Blood-brain barrier breakdown is an early biomarker of human cognitive dysfunction.Nat Med25, 270–276 (2019)

  34. [34]

    Verwer, R.et al.Post-mortem brain tissue cultures from elderly control subjects and patients with a neurodegenerative disease.Exp Gerontol38, 167–172 (2003)

  35. [35]

    Free Neuropathol4, 10 (2023)

    Krassner, M.et al.Postmortem changes in brain cell structure: a review. Free Neuropathol4, 10 (2023)

  36. [36]

    De Jager, P.et al.A multi-omic atlas of the human frontal cortex for aging and Alzheimer’s disease research.Sci Data5, 180142 (2018)

  37. [37]

    B¨ ock, J.et al.Cell Type and Species-specific Patterns in Neuronal and Non-neuronal Methylomes of Human and Chimpanzee Cortices.Cereb Cortex28, 3724–3739 (2019)

  38. [38]

    Kategaya, L.et al.Casein kinase 1 proteomics reveal prohibitin 2 function in molecular clock.PLoS One7, e31987 (2012)

  39. [39]

    & Ma’ayan, A

    Diamant, I., Clarke, D., Evangelista, J., Lingam, N. & Ma’ayan, A. Harmonizome 3.0: integrated knowledge about genes and proteins from diverse multi-omics resources.Nucleic Acids Res53, D1016–D1028 (2025)

  40. [40]

    Morandi, E.et al.Impact of the Multiple Sclerosis-Associated Genetic Variant CD226 Gly307Ser on Human CD8 T-Cell Functions.Neurol Neuroimmunol Neuroinflamm11, e200306 (2024)

  41. [41]

    Podvin, S.et al.The Orphan C2orf40 Gene is a Neuroimmune Factor in Alzheimer’s Disease.JSM Alzheimers Dis Relat Dement3, 1020 (2016)

  42. [42]

    20QuasiMed

    Lunnon, K.et al.A blood gene expression marker of early Alzheimer’s disease.J Alzheimers Dis33, 737–753 (2013) . 20QuasiMed

  43. [43]

    Wang, Y.et al.Identification of highly reliable risk genes for Alzheimer’s disease through joint-tissue integrative analysis.Front Aging Neurosci 15, 1183119 (2023)

  44. [44]

    & Hern´ andez, H

    Chac´ on, T. & Hern´ andez, H. DNA methylation in peripheral blood leukocytes in late onset Alzheimer’s disease.J Alzheimers Dis Rep9, 25424823251341176 (2025)

  45. [45]

    Lardenoije, R.et al.Alzheimer’s disease-associated (hydroxy)methylomic changes in the brain and blood.Clin Epigenetics11, 164 (2019)

  46. [46]

    Nearly unbiased variable selection under minimax concave penalty

    CH, Z. Nearly unbiased variable selection under minimax concave penalty. Ann Stat38, 894–942 (2010)

  47. [47]

    & Hastie, T

    Zou, H. & Hastie, T. Regularization and variable selection via the elastic net.J R Stat Soc Series B67, 301–320 (2005)

  48. [48]

    A direct approach to false discovery rates.J R Stat Soc Series B 64, 479–498 (2002)

    J, S. A direct approach to false discovery rates.J R Stat Soc Series B 64, 479–498 (2002)