pith. sign in

arxiv: 2509.02648 · v1 · submitted 2025-09-02 · 🧬 q-bio.GN · cs.LG· q-bio.QM· stat.AP

Optimizing Prognostic Biomarker Discovery in Pancreatic Cancer Through Hybrid Ensemble Feature Selection and Multi-Omics Data

Pith reviewed 2026-05-18 20:00 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.LGq-bio.QMstat.AP
keywords feature selectionmulti-omicspancreatic cancersurvival predictionensemble methodsbiomarker discoveryprognostic modelinghybrid selection
0
0 comments X

The pith

A hybrid ensemble feature selection method finds fewer and more stable prognostic biomarkers in pancreatic cancer multi-omics data than late-fusion CoxLasso while keeping similar performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a hybrid ensemble feature selection approach that combines data subsampling, multiple survival models, and both embedded and wrapper strategies to rank and select features for predicting patient survival from high-dimensional multi-omics data. Features are aggregated through a voting-theory-inspired mechanism across models and subsamples, and the optimal number is chosen automatically via a Pareto front that balances accuracy against sparsity. When tested on multi-omics datasets from three pancreatic cancer cohorts, the method produces significantly fewer and more stable biomarkers than conventional late-fusion CoxLasso models without loss of discrimination power. This matters for turning noisy omics measurements into reliable, clinically usable prognostic signatures that avoid overfitting and support validation in new patients.

Core claim

The hybrid ensemble feature selection (hEFS) approach integrates data subsampling with multiple prognostic models using embedded and wrapper-based strategies for survival prediction. Omics features are ranked by a voting-theory-inspired aggregation across models and subsamples, and the optimal feature count is selected via a Pareto front that balances predictive accuracy and model sparsity without user-defined thresholds. Applied to multi-omics datasets from three pancreatic cancer cohorts, hEFS identifies significantly fewer and more stable biomarkers than conventional late-fusion CoxLasso models while maintaining comparable discrimination performance.

What carries the argument

The hEFS method, which ranks features via voting-theory-inspired aggregation across multiple models and data subsamples then selects feature count through Pareto-front optimization balancing accuracy and sparsity.

If this is right

  • Prognostic models using hEFS-selected features achieve comparable survival discrimination to models using more features from CoxLasso.
  • The resulting biomarkers exhibit greater consistency across different data subsamples, raising reliability for downstream clinical use.
  • Automatic Pareto-front selection removes the need for arbitrary user thresholds when choosing how many features to retain.
  • The method is implemented in the open-source mlr3fselect R package and can be applied to other high-dimensional survival settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hybrid aggregation and Pareto selection steps could be tested on multi-omics data from additional cancer types to check whether stable biomarker reduction generalizes beyond pancreatic cases.
  • Stability gains from subsampling and voting might translate to improved reproducibility when the selected biomarkers are validated in independent external cohorts.
  • The Pareto-front idea for trading off accuracy and sparsity could be adapted to feature selection tasks outside survival analysis, such as classification or regression in other high-dimensional biological datasets.

Load-bearing premise

The voting-theory-inspired aggregation across subsamples and models combined with Pareto-front selection will reliably produce fewer and more stable feature sets than late-fusion CoxLasso without hidden dependence on the specific pancreatic cancer cohorts or preprocessing choices.

What would settle it

Applying hEFS and late-fusion CoxLasso to the same three cohorts or to new independent pancreatic cancer multi-omics datasets and finding no significant reduction in feature count or improvement in stability metrics while discrimination performance stays comparable would falsify the central claim.

read the original abstract

Prediction of patient survival using high-dimensional multi-omics data requires systematic feature selection methods that ensure predictive performance, sparsity, and reliability for prognostic biomarker discovery. We developed a hybrid ensemble feature selection (hEFS) approach that combines data subsampling with multiple prognostic models, integrating both embedded and wrapper-based strategies for survival prediction. Omics features are ranked using a voting-theory-inspired aggregation mechanism across models and subsamples, while the optimal number of features is selected via a Pareto front, balancing predictive accuracy and model sparsity without any user-defined thresholds. When applied to multi-omics datasets from three pancreatic cancer cohorts, hEFS identifies significantly fewer and more stable biomarkers compared to the conventional, late-fusion CoxLasso models, while maintaining comparable discrimination performance. Implemented within the open-source mlr3fselect R package, hEFS offers a robust, interpretable, and clinically valuable tool for prognostic modelling and biomarker discovery in high-dimensional survival settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a hybrid ensemble feature selection (hEFS) pipeline for survival prediction from multi-omics data in pancreatic cancer. The method integrates data subsampling, multiple prognostic models, a voting-theory-inspired aggregation for feature ranking, and Pareto-front optimization to select the number of features without user-defined thresholds. When applied to three pancreatic cancer cohorts, the authors report that hEFS yields significantly fewer and more stable biomarkers than late-fusion CoxLasso while preserving comparable discrimination performance. The approach is implemented as an open-source extension in the mlr3fselect R package.

Significance. If the reported gains in sparsity and stability are robustly attributable to the voting aggregation and Pareto selection rather than the embedded subsampling, the work would offer a practical, interpretable tool for high-dimensional prognostic modeling in oncology. The open-source implementation and focus on clinically relevant endpoints (survival discrimination plus biomarker reliability) strengthen its potential utility.

major comments (2)
  1. [Results] Results section: the abstract and main text claim 'significantly fewer and more stable biomarkers' with 'comparable discrimination performance,' yet no numerical values are supplied for biomarker counts, stability metrics (e.g., Jaccard index or overlap across folds/cohorts), discrimination metrics (C-index or AUC), or the statistical tests used to support significance. This absence prevents evaluation of whether the improvements are load-bearing for the central claim.
  2. [Methods] Methods section: hEFS explicitly incorporates data subsampling across models, while the late-fusion CoxLasso baseline is described without mention of equivalent bootstrap or subsample aggregation. Because stability is typically increased by any internal resampling, the comparison does not isolate the contribution of the voting-theory aggregation or Pareto-front step; a re-run of CoxLasso with matched subsampling is required to substantiate that the novel components drive the reported stability and sparsity advantages.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'without any user-defined thresholds' for Pareto selection should be clarified, as the front itself may still depend on the choice of objective functions or normalization.
  2. The manuscript would benefit from explicit cross-validation details (number of folds, outer/inner loops) and cohort-specific preprocessing steps to allow full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [Results] Results section: the abstract and main text claim 'significantly fewer and more stable biomarkers' with 'comparable discrimination performance,' yet no numerical values are supplied for biomarker counts, stability metrics (e.g., Jaccard index or overlap across folds/cohorts), discrimination metrics (C-index or AUC), or the statistical tests used to support significance. This absence prevents evaluation of whether the improvements are load-bearing for the central claim.

    Authors: We acknowledge that the manuscript as submitted does not provide the specific numerical values or statistical details necessary to fully evaluate the claims. In the revised version, we will add a dedicated results subsection or table that reports: (i) the exact number of biomarkers selected by hEFS versus late-fusion CoxLasso in each of the three cohorts; (ii) stability metrics including Jaccard indices for feature overlap across cross-validation folds and across cohorts; (iii) discrimination performance via C-index (or time-dependent AUC) with 95% confidence intervals; and (iv) the statistical tests (e.g., Wilcoxon signed-rank tests for paired comparisons) and associated p-values used to assess significance. These additions will make the central claims quantitatively verifiable. revision: yes

  2. Referee: [Methods] Methods section: hEFS explicitly incorporates data subsampling across models, while the late-fusion CoxLasso baseline is described without mention of equivalent bootstrap or subsample aggregation. Because stability is typically increased by any internal resampling, the comparison does not isolate the contribution of the voting-theory aggregation or Pareto-front step; a re-run of CoxLasso with matched subsampling is required to substantiate that the novel components drive the reported stability and sparsity advantages.

    Authors: This is a fair criticism of the experimental design. To isolate the effects of the voting-theory aggregation and Pareto-front optimization, we will conduct additional experiments in which the late-fusion CoxLasso baseline is also subjected to the same subsampling procedure used in hEFS. We will then directly compare the resulting biomarker counts, stability (e.g., Jaccard overlap), and discrimination performance between the subsampled CoxLasso and the complete hEFS pipeline. The revised manuscript will include these results and a discussion of how much of the observed advantage is attributable to the novel components versus subsampling alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an algorithmic pipeline (hEFS) that combines explicit subsampling, multiple survival models, voting aggregation, and Pareto-front selection of feature count. The central empirical claim—that hEFS yields fewer and more stable biomarkers than late-fusion CoxLasso while preserving discrimination—is presented as an observed outcome on three external cohorts rather than a quantity derived by construction from fitted parameters inside the same equations. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to justify the core method; the approach is implemented in the external mlr3fselect package. The derivation chain therefore remains self-contained and does not reduce to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard survival-analysis assumptions and the effectiveness of the described ensemble procedure; no new free parameters, invented entities, or ad-hoc axioms beyond conventional Cox-model and machine-learning practice are introduced in the abstract.

axioms (1)
  • domain assumption Cox proportional hazards model assumptions hold for the prognostic models used inside the ensemble.
    The abstract refers to prognostic models and late-fusion CoxLasso, which presuppose the standard Cox model.

pith-pipeline@v0.9.0 · 5726 in / 1403 out tokens · 48105 ms · 2026-05-18T20:00:14.164573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    E., Wang, J., Mitchell, H., Webb-Robertson, B

    [McDermott2013] McDermott, J. E., Wang, J., Mitchell, H., Webb-Robertson, B. J., Hafen, R., Ramey, J., & Rodland, K. D. (2013). Challenges in biomarker discovery: Combining expert insights with statistical analysis of complex omics data. Expert Opinion on Medical Diagnostics , 7 (1), 37–51. https://doi.org/10.1517/17530059.2012.718329 [Rufeng2022] Li, R.,...

  2. [2]

    J., Lyssiotis, C

    https://doi.org/10.3390/IJMS20194781 [Halbrook2023] Halbrook, C. J., Lyssiotis, C. A., Pasca di Magliano, M., & Maitra, A. (2023). Pancreatic cancer: Advances and challenges. Cell , 186 (8), 1729–1754. https://doi.org/10.1016/J.CELL.2023.02.014 [Tripathi2024] Tripathi, S., Tabari, A., Mansur, A., Dabbara, H., Bridge, C. P., & Daye, D. (2024). From Machine...

  3. [3]

    G., Diehn, M., André, F., Roy-Chowdhuri, S., Mountzios, G., Wistuba, I

    https://doi.org/10.3390/DIAGNOSTICS14020174 [Passaro2024] Passaro, A., Al Bakir, M., Hamilton, E. G., Diehn, M., André, F., Roy-Chowdhuri, S., Mountzios, G., Wistuba, I. I., Swanton, C., & Peters, S. (2024). Cancer biomarkers: Emerging trends and clinical implications for personalized treatment. Cell , 187 (7), 1617–1635. https://doi.org/10.1016/J.CELL.20...

  4. [4]

    https://doi.org/10.1093/NARGAB/LQAE079 [Zhao2024] Zhao, Z., Zobolas, J., Zucknick, M., & Aittokallio, T. (2024). Tutorial on survival modeling with applications to omics data. Bioinformatics . https://doi.org/10.1093/BIOINFORMATICS/BTAE132 [Ding2022] Ding, D. Y., Li, S., Narasimhan, B., & Tibshirani, R. (2022). Cooperative learning for multiview analysis....

  5. [5]

    Applied Predictive Modeling

    https://doi.org/10.21105/JOSS.01903 [Hastie2009] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction . Springer. [Kuhn2013] Kuhn, M., Johnson, K. (2013). “Applied Predictive Modeling.” In chapter Over-Fitting and Model Tuning, 61–92. Springer New York, New York, NY. ISBN 978-1-461...

  6. [6]

    J., Hruban, R

    [Raphael2017] Raphael, B. J., Hruban, R. H., Aguirre, A. J., Moffitt, R. A., Yeh, J. J., Stewart, C., Robertson, A. G., Cherniack, A. D., Gupta, M., Getz, G., Gabriel, S. B., Meyerson, M., Cibulskis, C., Fei, S. S., Hinoue, T., Shen, H., Laird, P. W., Ling, S., Lu, Y., … Zenklusen, J. C. (2017). Integrated Genomic Characterization of Pancreatic Ductal Ade...

  7. [7]

    Y., Shivakumar, M., Kim, D., & Honavar, V

    https://doi.org/10.3390/CANCERS12113234 [El-Manzalawy2018] El-Manzalawy, Y., Hsieh, T. Y., Shivakumar, M., Kim, D., & Honavar, V. (2018). Min-redundancy and max-relevance multi-view feature selection for predicting ovarian cancer survival using multi-omics data. BMC Medical Genomics , 11 . https://doi.org/10.1186/S12920-018-0388-0 [Jaeger2023] Jaeger, B. ...

  8. [8]

    https://doi.org/10.3322/CAAC.21871 [Pishvaian2020] Pishvaian, M

    CA: A Cancer Journal for Clinicians , 75 (1), 10–45. https://doi.org/10.3322/CAAC.21871 [Pishvaian2020] Pishvaian, M. J., Blais, E. M., Brody, J. R., Lyons, E., DeArbeloa, P., Hendifar, A., Mikhail, S., Chung, V., Sahai, V., Sohal, D. P. S., Bellakbira, S., Thach, D., Rahib, L., Madhavan, S., Matrisian, L. M., & Petricoin, E. F. (2020). Overall survival i...

  9. [9]

    B., Jing, Z., Chaudhary, K., Huang, S., & Garmire, L

    [Poirion2021] Poirion, O. B., Jing, Z., Chaudhary, K., Huang, S., & Garmire, L. X. (2021). DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data. Genome Medicine , 13 (1), 1–15. https://doi.org/10.1186/S13073-021-00930-X [Chen2023] Chen, R. J., Lu, M. Y., Williamson, D. F. K., Chen, T. Y., Lipko...

  10. [10]

    https://doi.org/10.1093/GENETICS/IYAD031 [Wang2022] Wang, J

    Genetics , 224 (1). https://doi.org/10.1093/GENETICS/IYAD031 [Wang2022] Wang, J. H., Li, C. R., & Hou, P. L. (2022). Feature screening for survival trait with application to TCGA high-dimensional genomic data. PeerJ , 10 , e13098. https://doi.org/10.7717/PEERJ.13098 [Giordano2022] Giordano, F., Milito, S., & Restaino, M. (2022). A variable selection metho...

  11. [11]

    & Wickham, H

    [Ushey2025] Ushey, K. & Wickham, H. (2025). renv: Project Environments . R package version 1.1.5, https://rstudio.github.io/renv/ [Sonabend2021] Sonabend, R., Király, F. J., Bender, A., Bischl, B., & Lang, M. (2021). mlr3proba: an R package for machine learning in survival analysis. Bioinformatics, 37(17), 2789–2791. https://doi.org/10.1093/BIOINFORMATICS...

  12. [12]

    https://doi.org/10.21105/joss.04705 [Chen2016] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 785–794. https://doi.org/10.1145/2939672.2939785 [Barnwal2022] Barnwal, A., Cho, H., & Hocking, T. (2022). Survival Regression with Acceler...

  13. [13]

    https://doi.org/10.21105/JOSS.03010 [WHO2000] World Health Organization. (2000). International Classification of Diseases for Oncology, Third Edition (ICD-O-3). Geneva: World Health Organization