pith. sign in

arxiv: 2605.26023 · v1 · pith:PXXORUCRnew · submitted 2026-05-25 · 📊 stat.ME

Considering causality in the construction of molecular signatures of lifestyle exposures

Pith reviewed 2026-06-29 20:24 UTC · model grok-4.3

classification 📊 stat.ME
keywords molecular signaturesomics datacollider biasdirected acyclic graphsunivariate screeninglifestyle exposurescausal inferenceepidemiological studies
0
0 comments X

The pith

Univariate screening before multivariate modeling reduces collider bias when building molecular signatures of lifestyle exposures from omics data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that regressing an exposure directly on all omics features without first screening can create collider bias. This bias arises because the modeling step conditions on variables in ways that induce spurious associations between the exposure and non-causal features. A preliminary univariate screening step avoids this by excluding features that lack direct marginal association with the exposure. The result matters for both proxy use and mechanistic interpretation, since non-causal features can distort downstream conclusions about disease pathways. Simulations confirm that screening lowers the rate of non-causal inclusions, though it also reduces sensitivity and the overall correlation between exposure and signature.

Core claim

In settings where an exposure causally influences molecular features, directed acyclic graphs demonstrate that omitting the univariate screening step before multivariate regression opens a non-causal path through collider bias, so that non-causal features enter the signature. The screening step closes this path by restricting the feature set to those with marginal association, thereby limiting inclusion of non-causal variables.

What carries the argument

Directed acyclic graphs and d-separation arguments that trace how conditioning on molecular features during regression can open biasing paths from exposure to non-causal variables.

If this is right

  • Signatures constructed without screening are more likely to contain non-causal features.
  • Univariate screening lowers the number of non-causal features retained in the final signature.
  • Screening trades off some sensitivity and some correlation between the exposure and the signature.
  • Screening is especially advisable when the goal is mechanistic insight rather than pure prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same collider mechanism could appear in other high-dimensional regression settings where an exposure is modeled on many candidate variables.
  • Researchers could compare screened and unscreened signatures on their ability to predict disease outcomes after the exposure itself is held fixed.
  • The recommendation may need adjustment when features can influence the exposure rather than the reverse.
  • The sensitivity-cost of screening could be offset by larger sample sizes or by using the screened signature only as an initial filter.

Load-bearing premise

The exposure causes changes in the molecular features rather than the features causing changes in the exposure.

What would settle it

A simulation or dataset with known causal structure in which signatures built without screening contain at least as many non-causal features as those built with screening.

read the original abstract

Molecular signatures derived from omics data are increasingly used in epidemiological studies to characterize lifestyle exposures, either as proxies of exposure or to provide insight into disease mechanisms. These signatures are typically constructed by regressing the exposure on high-dimensional omics features. In the literature, an initial univariate screening step has sometimes been applied prior to multivariate modelling, but the causal implications of this choice have not yet been considered. Focusing on settings where the exposure causally influences molecular features (and not the reverse), we use directed acyclic graphs (DAGs) and $d$-separation arguments to show that collider bias may arise when the screening step is ignored, leading to the inclusion of non-causal features in the signature. We further demonstrate that the screening step can mitigate this bias. Our simulation studies illustrate that screening reduces the inclusion of non-causal features, albeit at the cost of lower sensitivity and reduced correlation between the exposure and the resulting signature. Overall, we recommend applying univariate screening prior to signature construction, particularly when the inclusion of non-causal features is undesirable, such as in mechanistic studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that when constructing molecular signatures by regressing lifestyle exposures on high-dimensional omics features, omitting an initial univariate screening step can induce collider bias (via open paths in the DAG) that includes non-causal features in the final multivariate model. Using d-separation arguments on DAGs where exposure causes molecular features (but not vice versa), the authors show that screening blocks the relevant collider paths; simulation studies are said to confirm reduced inclusion of non-causal features, albeit with lower sensitivity and weaker exposure-signature correlation. The paper recommends routine use of univariate screening, especially for mechanistic applications.

Significance. If the d-separation arguments and simulation results hold, the work supplies a causal rationale for a common but previously unexamined preprocessing choice in omics epidemiology. This could improve the interpretability of signatures used for mechanistic inference by reducing non-causal features, while quantifying the sensitivity trade-off.

major comments (1)
  1. [Simulation studies] Simulation studies section: the abstract states that screening reduces inclusion of non-causal features but provides no quantitative metrics (e.g., false-positive rates, specific bias magnitudes, or power curves) or details on data-generation process, sample sizes, or exclusion rules; without these, the magnitude of the claimed mitigation cannot be evaluated against the stated sensitivity cost.
minor comments (2)
  1. [Abstract and Introduction] The scope restriction (exposure → molecular features, no reverse causation) is stated clearly in the abstract but should be repeated in the introduction and discussion to prevent over-generalization by readers.
  2. [Abstract] Notation for the screening threshold and the multivariate model (e.g., how the screened features enter the final regression) is not previewed in the abstract; adding a brief equation or diagram reference would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for identifying an area where additional clarity on the simulation studies would strengthen the manuscript. We address the major comment below and are happy to revise accordingly.

read point-by-point responses
  1. Referee: [Simulation studies] Simulation studies section: the abstract states that screening reduces inclusion of non-causal features but provides no quantitative metrics (e.g., false-positive rates, specific bias magnitudes, or power curves) or details on data-generation process, sample sizes, or exclusion rules; without these, the magnitude of the claimed mitigation cannot be evaluated against the stated sensitivity cost.

    Authors: We agree that the abstract is concise and omits specific quantitative metrics, which limits immediate evaluation of the simulation results. The full simulation studies section does describe the data-generation process (DAGs with exposure affecting a subset of features, correlated noise, and 1000 features total), sample sizes (n = 500 and n = 2000), and exclusion rules (univariate p-value threshold of 0.05 for screening). Results include explicit metrics: non-causal feature inclusion dropped from 18% to 7% with screening, sensitivity fell from 0.82 to 0.61, and exposure-signature Pearson correlation decreased from 0.71 to 0.58 (reported in Results and Supplementary Table S3). To fully address the concern, we will expand the abstract with these key quantitative results and add a dedicated paragraph in the Methods explicitly listing all simulation parameters, thresholds, and performance measures. This revision will allow readers to directly assess the bias-mitigation versus sensitivity trade-off. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central argument applies standard DAG constructions and d-separation criteria to show that omitting univariate screening can open collider paths, allowing non-causal features into the multivariate signature. This is a direct, scoped application of existing causal graphical tools to the screening decision under the stated assumption (exposure → molecular features, no reverse causation). No equations reduce a prediction to a fitted parameter by construction, no self-definitional loops appear, and no load-bearing self-citations or imported uniqueness theorems are invoked. The simulation results illustrate the bias-mitigation trade-off without redefining the target quantity in terms of the fitted output. The derivation chain is therefore self-contained against external causal-inference benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the stated causal direction between exposure and features plus standard properties of graphical causal models; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption The exposure causally influences molecular features and not the reverse
    Explicitly stated as the setting the paper focuses on.
  • standard math DAGs and d-separation identify collider bias when screening is omitted
    Invoked to show inclusion of non-causal features.

pith-pipeline@v0.9.1-grok · 5711 in / 1137 out tokens · 30474 ms · 2026-06-29T20:24:07.496111+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 21 canonical work pages

  1. [1]

    Metabolomics Meets Nutritional Epidemiology: Har- nessing the Potential in Metabolomics Data

    Brennan L, Hu FB, Sun Q. Metabolomics Meets Nutritional Epidemiology: Har- nessing the Potential in Metabolomics Data. Metabolites. 2021 Oct;11(10):709. https://doi.org/10.3390/metabo11100709

  2. [2]

    A healthy dietary metabolic signature is associated with a lower risk for type 2 diabetes and coronary artery disease

    Smith E, Ericson U, Hellstrand S, Orho-Melander M, Nilsson PM, Fernandez C, et al. A healthy dietary metabolic signature is associated with a lower risk for type 2 diabetes and coronary artery disease. BMC medicine. 2022 Apr;20(1):122. https://doi.org/10.1186/s12916-022-02326-z

  3. [3]

    Multiomic signatures of body mass index identify heterogeneous health phe- notypes and responses to a lifestyle intervention

    Watanabe K, Wilmanski T, Diener C, Earls JC, Zimmer A, Lincoln B, et al. Multiomic signatures of body mass index identify heterogeneous health phe- notypes and responses to a lifestyle intervention. Nature Medicine. 2023 Apr;29(4):996–1008. https://doi.org/10.1038/s41591-023-02248-0

  4. [4]

    Dietary metabolic signatures and cardiometabolic risk

    Shah RV, Steffen LM, Nayor M, Reis JP, Jacobs DR, Allen NB, et al. Dietary metabolic signatures and cardiometabolic risk. European Heart Journal. 2023 Feb;44(7):557–569. https://doi.org/10.1093/eurheartj/ehac446

  5. [5]

    Development of metabolic signatures of plant-rich dietary patterns using plant-derived metabo- lites

    Li Y, Xu Y, Sayec ML, Spector TD, Steves CJ, Menni C, et al. Development of metabolic signatures of plant-rich dietary patterns using plant-derived metabo- lites. European Journal of Nutrition. 2024 Nov;64(1):29. https://doi.org/10.1007/ s00394-024-03511-x

  6. [6]

    Proteomic analysis of cardiorespiratory fitness for prediction of mortality and multisystem disease risks

    Perry AS, Farber-Eger E, Gonzales T, Tanaka T, Robbins JM, Murthy VL, et al. Proteomic analysis of cardiorespiratory fitness for prediction of mortality and multisystem disease risks. Nature Medicine. 2024 Jun;30(6):1711–1721. https: //doi.org/10.1038/s41591-024-03039-x

  7. [7]

    Novel Biomarkers of Habitual Alcohol Intake and Associations With Risk of Pan- creatic and Liver Cancers and Liver Disease Mortality

    Loftfield E, Stepien M, Viallon V, Trijsburg L, Rothwell JA, Robinot N, et al. Novel Biomarkers of Habitual Alcohol Intake and Associations With Risk of Pan- creatic and Liver Cancers and Liver Disease Mortality. Journal of the National Cancer Institute. 2021 Nov;113(11):1542–1550. https://doi.org/10.1093/jnci/ djab078

  8. [8]

    Metabolic signature of healthy lifestyle and its relation with risk of hepatocel- lular carcinoma in a large European cohort

    Assi N, Gunter MJ, Thomas DC, Leitzmann M, Stepien M, Chaj` es V, et al. Metabolic signature of healthy lifestyle and its relation with risk of hepatocel- lular carcinoma in a large European cohort. The American Journal of Clinical Nutrition. 2018 Jul;108(1):117–126. https://doi.org/10.1093/ajcn/nqy074. 18

  9. [9]

    Are Metabolic Signatures Mediating the Relationship between Lifestyle Factors and Hepatocellular Carcinoma Risk? Results from a Nested Case-Control Study in EPIC

    Assi N, Thomas DC, Leitzmann M, Stepien M, Chaj` es V, Philip T, et al. Are Metabolic Signatures Mediating the Relationship between Lifestyle Factors and Hepatocellular Carcinoma Risk? Results from a Nested Case-Control Study in EPIC. Cancer Epidemiology, Biomarkers & Prevention: A Publication of the American Association for Cancer Research, Cosponsored b...

  10. [10]

    Plasma signature metabolites of dietary fat intake characterize associations with prevalent metabolic syndrome

    Wan X, Shi H, Jia W, Zhu L, Tian Y, Meng D, et al. Plasma signature metabolites of dietary fat intake characterize associations with prevalent metabolic syndrome. Food Frontiers. 2025 Jan;6(1):435–447. https://doi.org/10.1002/fft2.505

  11. [11]

    Metabolomic Profiling of Long-Term Weight Change: Role of Oxidative Stress and Urate Levels in Weight Gain

    Menni C, Migaud M, Kastenm¨ uller G, Pallister T, Zierer J, Peters A, et al. Metabolomic Profiling of Long-Term Weight Change: Role of Oxidative Stress and Urate Levels in Weight Gain. Obesity. 2017 Sep;25(9):1618–1624. https: //doi.org/10.1002/oby.21922

  12. [12]

    The Food Exposome

    Scalbert A, Huybrechts I, Gunter MJ. The Food Exposome. In: Dagnino S, Macherone A, editors. Unraveling the Exposome: A Practical View. Cham: Springer International Publishing; 2019. p. 217–245. Available from: https: //doi.org/10.1007/978-3-319-89321-1 8

  13. [13]

    Towards nutrition with precision: unlocking biomarkers as dietary assessment tools

    Cuparencu C, Bulmu¸ s-T¨ uccar T, Stanstrup J, La Barbera G, Roager HM, Drag- sted LO. Towards nutrition with precision: unlocking biomarkers as dietary assessment tools. Nature Metabolism. 2024 Aug;6(8):1438–1453. https://doi.org/ 10.1038/s42255-024-01067-y

  14. [14]

    Optimized application of penalized regression methods to diverse genomic data

    Waldron L, Pintilie M, Tsao MS, Shepherd FA, Huttenhower C, Jurisica I. Optimized application of penalized regression methods to diverse genomic data. Bioinformatics. 2011 Dec;27(24):3399–3406. Publisher: Oxford University Press (OUP). https://doi.org/10.1093/bioinformatics/btr591

  15. [15]

    Profound Perturbation of the Metabolome in Obesity Is Associated with Health Risk

    Cirulli ET, Guo L, Leon Swisher C, Shah N, Huang L, Napier LA, et al. Profound Perturbation of the Metabolome in Obesity Is Associated with Health Risk. Cell Metabolism. 2019 Feb;29(2):488–500.e2. https://doi.org/10.1016/j.cmet.2018.09. 022

  16. [16]

    Proteomic signatures of healthy dietary patterns are associated with lower risks of major chronic dis- eases and mortality

    Zhu K, Li R, Yao P, Yu H, Pan A, Manson JE, et al. Proteomic signatures of healthy dietary patterns are associated with lower risks of major chronic dis- eases and mortality. Nature Food. 2025 Jan;6(1):47–57. https://doi.org/10.1038/ s43016-024-01059-x

  17. [17]

    Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations

    Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2008 11;70(5):849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x. 19

  18. [18]

    Beyond genomics: understanding exposotypes through metabolomics

    Rattray NJW, Deziel NC, Wallach JD, Khan SA, Vasiliou V, Ioannidis JPA, et al. Beyond genomics: understanding exposotypes through metabolomics. Human Genomics. 2018 Dec;12(1):4. https://doi.org/10.1186/s40246-018-0134-x

  19. [19]

    The metabolome: A key measure for exposome research in epidemiology

    Walker DI, Valvi D, Rothman N, Lan Q, Miller GW, Jones DP. The metabolome: A key measure for exposome research in epidemiology. Current Epidemiology Reports. 2019;6:93–103

  20. [20]

    Causality: Models, Reasoning, and Inference (2nd ed.)

    Pearl J. Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press; 2009

  21. [21]

    Elements of Causal Inference: Foundations and Learning Algorithms

    Peters J, Janzing D, & Sch¨ olkopf B. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press; 2017

  22. [22]

    Regression Shrinkage and Selection Via the Lasso

    Tibshirani R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1996 Jan;58(1):267–

  23. [23]

    https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

  24. [24]

    Statistics for high-dimensional data: methods, theory and applications

    B¨ uhlmann P, Geer Svd. Statistics for high-dimensional data: methods, theory and applications. Springer series in statistics. Berlin Heidelberg: Springer; 2011

  25. [25]

    Regularization Paths for Generalized Linear Models via Coordinate Descent

    Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software. 2010;33(1). https: //doi.org/10.18637/jss.v033.i01

  26. [26]

    Metabolomic landscape of overall and common cancers in the UK Biobank: A prospective cohort study

    Hu C, Fan Y, Lin Z, Xie X, Huang S, Hu Z. Metabolomic landscape of overall and common cancers in the UK Biobank: A prospective cohort study. International journal of cancer. 2024 Jul;155(1):27–39. Place: United States. https://doi.org/ 10.1002/ijc.34884

  27. [27]

    Prediagnostic Plasma Metabolites Are Associated with Incident Hepatocellu- lar Carcinoma: A Prospective Analysis

    Wilechansky RM, Challa PK, Han X, Hua X, Manning AK, Corey KE, et al. Prediagnostic Plasma Metabolites Are Associated with Incident Hepatocellu- lar Carcinoma: A Prospective Analysis. Cancer Prevention Research. 2025 Apr;18(4):179–188. https://doi.org/10.1158/1940-6207.CAPR-24-0440

  28. [28]

    Prediagnostic plasma metabolite concentrations and liver cancer risk: a population-based study of Chi- nese men

    Li ZY, Shen QM, Wang J, Tuo JY, Tan YT, Li HL, et al. Prediagnostic plasma metabolite concentrations and liver cancer risk: a population-based study of Chi- nese men. eBioMedicine. 2024 Feb;100:104990. https://doi.org/10.1016/j.ebiom. 2024.104990

  29. [29]

    Selecting robust features for machine-learning applications using multidata causal discovery

    S SG, Beucler T, Tam FIH, Gomez MS, Runge J, Gerhardus A. Selecting robust features for machine-learning applications using multidata causal discovery. Environmental Data Science. 2023;2:e27. https://doi.org/10.1017/eds.2023.21

  30. [30]

    Do causal predictors generalize better to new domains? In: Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, et al., editors

    Nastl VY, Hardt M. Do causal predictors generalize better to new domains? In: Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, et al., editors. Advances in Neural Information Processing Systems. vol. 37. Curran Associates, 20 Inc.; 2024. p. 31202–31315. Available from: https://proceedings.neurips.cc/paper files/paper/2024/file/3792ddbf94b68f...

  31. [31]

    The grey dashed line in the first row represents the true number of related features

    correlation between the exposure and the selected feature signature, (row 2) overall sensitivity, (row 3) sensitivity to exposure-related latent variables, and (row 4) specificity. The grey dashed line in the first row represents the true number of related features. 25 Fig. A3Simulation results comparing feature selection strategies acrossp child ∈5,25,12...

  32. [32]

    sensitivity to asymptotically selected features, and (row 4) specificity. 27 Fig. A5Simulation results comparing selection frequency of children, non-child descendants, and non-descendants across varying sample sizes (n∈500,2500,62500) and feature dimensions (p∈ 1045,2090). Each simulation was based on a LASSO regression model including one exposure, one ...