Recognition: unknown
StackFeat: a convergent algorithm for optimal predictor selection in genomic data
Pith reviewed 2026-05-08 08:48 UTC · model grok-4.3
The pith
StackFeat converges on stable, high-performing biomarker sets in genomic data by retaining only features that rank high on both signed effect size and selection frequency across repeated cross-validations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that iteratively accumulating signed coefficients and selection frequencies from repeated cross-validation folds, then retaining only the features that rank highly according to both statistics, produces a convergent algorithm that extracts a compact, stable, and high-accuracy predictor set from high-dimensional genomic data.
What carries the argument
The dual-criterion accumulator that tracks both signed coefficients (effect strength and direction) and selection frequencies (stability estimate) across iterations to enforce convergence on jointly optimal features.
If this is right
- Reduces 332 features to a 5-miRNA signature with 98.5 percent dimensionality reduction while achieving AUC 0.922.
- Outperforms the benchmark 9-gene set with statistical significance (p = 0.0016).
- Recovers known markers such as hsa-miR-150-5p and surfaces novel candidates.
- Supplies convergence guarantees that single-criterion selection lacks.
- Supports discovery of both established and previously unknown biomarker relationships.
Where Pith is reading between the lines
- The same dual tracking of magnitude and frequency could be tested on proteomics or single-cell RNA data where sample sizes are similarly constrained.
- If the convergence property generalizes, combining effect-size and selection-probability statistics may offer a principle for robust selection in other high-dimensional settings.
- The compact signature size could simplify translation into clinical diagnostic panels that require fewer measurements.
Load-bearing premise
The assumption that the features highest in both signed coefficients and selection frequency across cross-validations represent the true optimal and stable set of predictors.
What would settle it
Observing on new genomic datasets that the dual-criterion selected features fail to show better stability or predictive performance than single-criterion methods such as LASSO or random-forest importance rankings.
Figures
read the original abstract
In high-dimensional genomic data, the curse of dimensionality (d >> n) and limited sampling make feature selection inherently unstable - a critical barrier to biomarker discovery. We introduce StackFeat, an iterative algorithm that accumulates two statistics across repeated cross-validation: signed coefficients (measuring effect strength and direction) and selection frequencies (estimating selection probability). Only features ranking highly by both criteria are retained. On a COVID-19 miRNA dataset (GSE240888), StackFeat identified a stable 5-miRNA signature from 332 features (98.5% reduction), achieving AUC 0.922, significantly outperforming the benchmark 9-gene set (AUC 0.907, p = 0.0016). The signature includes hsa-miR-150-5p, a marker implicated in both COVID-19 survival and Dengue infection. This dual-criterion approach provides convergence guarantees absent in single-criterion methods, enabling discovery of known biomarkers, novel candidates, and previously unknown relationships. Keywords: marker selection, feature selection, bioinformatics, dimensionality reduction, robust algorithm, stacking, miRNA, COVID-19
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents StackFeat, a new iterative algorithm for optimal predictor selection in high-dimensional genomic data. It accumulates signed coefficients and selection frequencies from repeated cross-validations, retaining only features that rank highly on both metrics. The authors claim this provides convergence guarantees not found in single-criterion methods. Empirically, on the GSE240888 COVID-19 miRNA dataset with 332 features, it selects a stable 5-miRNA signature achieving an AUC of 0.922, which significantly outperforms a benchmark 9-gene set (AUC 0.907, p=0.0016), and recovers known biomarkers such as hsa-miR-150-5p.
Significance. If the dual-criterion iterative procedure can be shown to deliver stable predictors with the claimed convergence properties, StackFeat would represent a useful advance for biomarker discovery in d >> n genomic settings by mitigating instability in feature selection. The reported 98.5% feature reduction to a 5-miRNA signature with improved AUC and recovery of a biologically plausible marker (hsa-miR-150-5p) on an independent dataset indicates practical promise for miRNA signature development in infectious disease contexts.
major comments (3)
- [Abstract and Methods] Abstract and Methods: The assertion that the dual-criterion approach 'provides convergence guarantees absent in single-criterion methods' is not supported by any derivation, theorem, or formal analysis. The algorithm is presented as a procedural iterative accumulation without equations or proofs establishing convergence of the retained feature set.
- [Results] Results: The AUC comparison (0.922 vs 0.907) is reported on only a single external dataset (GSE240888) with no error bars, standard errors, or confidence intervals on the performance estimates and no additional independent validation cohorts, which weakens support for the general superiority and stability claims.
- [Methods] Methods: The thresholds or ranking criteria used to decide when features 'rank highly' by signed coefficients and by selection frequencies are not specified or justified. These choices are load-bearing for the feature retention step and affect both reproducibility and the optimality assertion.
minor comments (2)
- [Abstract] The keywords list 'stacking' but the algorithm description does not explicitly relate the iterative accumulation to ensemble stacking techniques; a brief clarification of the terminology would aid reader understanding.
- [Results] Inclusion of a table listing the final 5-miRNA signature together with their coefficient magnitudes and selection frequencies would improve the presentation of the empirical results.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We have addressed each major comment point-by-point below, making revisions to strengthen the manuscript where the concerns are valid. Our responses focus on clarifying the methods, qualifying unsupported claims, and improving the presentation of results.
read point-by-point responses
-
Referee: [Abstract and Methods] The assertion that the dual-criterion approach 'provides convergence guarantees absent in single-criterion methods' is not supported by any derivation, theorem, or formal analysis. The algorithm is presented as a procedural iterative accumulation without equations or proofs establishing convergence of the retained feature set.
Authors: We agree that the original manuscript did not contain a formal derivation, theorem, or proof of convergence. The phrasing in the abstract was intended to describe the empirical behavior of the iterative dual-criterion accumulation, which we observed to produce stable feature sets across repeated CV runs, in contrast to single-pass methods. In the revised manuscript we have removed the claim of 'convergence guarantees' from the abstract and Methods, replacing it with a description of the observed stability. We have added pseudocode for the iterative procedure and a brief discussion of its heuristic properties in a new Methods subsection, while explicitly noting that a rigorous theoretical analysis remains future work. revision: yes
-
Referee: [Results] The AUC comparison (0.922 vs 0.907) is reported on only a single external dataset (GSE240888) with no error bars, standard errors, or confidence intervals on the performance estimates and no additional independent validation cohorts, which weakens support for the general superiority and stability claims.
Authors: We accept that the performance comparison would be strengthened by uncertainty estimates and additional cohorts. In the revised Results we now report bootstrap-derived 95% confidence intervals and standard errors for both AUC values, and we have clarified that the reported p=0.0016 comes from a DeLong test. We do not have access to further independent cohorts for this study; we have therefore expanded the Discussion to acknowledge this limitation and to recommend multi-cohort validation in future applications of StackFeat. revision: partial
-
Referee: [Methods] The thresholds or ranking criteria used to decide when features 'rank highly' by signed coefficients and by selection frequencies are not specified or justified. These choices are load-bearing for the feature retention step and affect both reproducibility and the optimality assertion.
Authors: We apologize for the lack of explicit detail. The revised Methods section now specifies that a feature is retained only if it ranks in the top 5% by absolute signed coefficient (capturing effect strength) and exceeds a selection frequency of 0.7 across the repeated cross-validation iterations (capturing stability). These cut-offs were selected after preliminary sensitivity runs to achieve substantial dimensionality reduction while preserving predictive performance; we have added a short justification and a supplementary sensitivity analysis showing how AUC and feature count vary with modest changes to the thresholds. revision: yes
Circularity Check
No significant circularity detected
full rationale
The StackFeat algorithm is defined as a procedural iterative accumulation of signed coefficients and selection frequencies across repeated cross-validations on genomic data, with stability and performance demonstrated empirically on the independent GSE240888 dataset (AUC 0.922, 98.5% feature reduction). No load-bearing equations, self-referential definitions, or self-citation chains reduce the claimed convergence guarantees or optimal signature to fitted inputs by construction; the dual-criterion retention rule is an explicit procedural choice evaluated externally rather than derived tautologically from its own outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- ranking thresholds for coefficients and frequencies
axioms (1)
- domain assumption Repeated cross-validation yields stable estimates of selection frequency and signed effect strength.
Reference graph
Works this paper leans on
-
[1]
Tibshirani, Regression shrinkage and selection via the lasso
R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B58, 267–288 (1996)
1996
-
[2]
H. Zou, T. Hastie, Regularization and variable selection via the elastic net. J. R. Stat. Soc. B67, 301–320 (2005)
2005
-
[3]
Wolpert, Stacked generalization
D.H. Wolpert, Stacked generalization. Neural Netw.5, 241–259 (1992)
1992
-
[4]
J. Gao, E. Kyubwa et al., Circulating miRNA profiles in COVID-19 patients and meta- analysis: implications for disease progression and prognosis. Sci. Rep.13, 21656 (2023). https://doi.org/10.1038/s41598-023-48227-w
-
[5]
Meinshausen, P
N. Meinshausen, P. Bühlmann, Stability selection. J. R. Stat. Soc. B72, 417–473 (2010)
2010
-
[6]
Extremely randomized trees.Machine Learning, 63(1):3–42, 2006
P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees. Mach. Learn.63, 3–42 (2006). https://doi.org/10.1007/s10994-006-6226-1
-
[7]
Breiman, Random forests
L. Breiman, Random forests. Mach. Learn.45, 5–32 (2001)
2001
-
[8]
Y . Ding, S. Tang et al., Plasma miR-150-5p as a biomarker for chronic obstructive pulmonary disease. Int. J. Chron. Obstruct. Pulmon. Dis.18, 399–406 (2023). https: //doi.org/10.2147/COPD.S400985
-
[9]
H. Hapugaswatta, P. Amarasena et al., Differential expression of selected microRNA and putative target genes in peripheral blood cells as early markers of severe forms of dengue. medRxiv (2019). https://doi.org/10.1101/19002725
-
[10]
Fernandez-Pato et al., Plasma miRNA profile at COVID-19 onset predicts severity status and mortality
A. Fernandez-Pato et al., Plasma miRNA profile at COVID-19 onset predicts severity status and mortality. Emerg. Microbes Infect.11, 676–688 (2022). https://doi.org/10. 1080/22221751.2022.2038021
-
[11]
Y . Yang, D. Fang et al., Circulating microRNAs as emerging regulators of COVID-19. Theranostics13, 125–147 (2023). https://doi.org/10.7150/thno.78164
-
[12]
Fernandez-Pato, Host miRNA differences by COVID-19 severity: identification of age and sex bias, Master thesis, Universidad Autónoma de Madrid (2021)
A. Fernandez-Pato, Host miRNA differences by COVID-19 severity: identification of age and sex bias, Master thesis, Universidad Autónoma de Madrid (2021)
2021
-
[13]
J.T. Chow, L. Salmena, Prediction and analysis of SARS-CoV-2-targeting microRNA in human lung epithelium. Genes11, 1002 (2020). https://doi.org/10.3390/genes11091002
-
[14]
K. Pollet, N. Garnier et al., Host miRNAs as biomarkers of SARS-CoV-2 infection: a critical review. Sens. Diagn.2, 12–35 (2023). https://doi.org/10.1039/D2SD00140C
-
[15]
S.R. Trampuz, D. V ogrinc et al., Shared miRNA landscapes of COVID-19 and neurode- generation confirm neuroinflammation as an important overlapping feature. Front. Mol. Neurosci.16, 1123955 (2023). https://doi.org/10.3389/fnmol.2023.1123955
-
[16]
A. Andalib, S. Rashed, The upregulation of hsa-mir-181b-1 and downregulation of its target CYLD in the late-stage of tumor progression of breast cancer. Indian J. Clin. Biochem.35, 312–321 (2019). https://doi.org/10.1007/s12291-019-00826-z
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.