arxiv: 2604.22887 · v1 · submitted 2026-04-24 · 🧬 q-bio.OT

Recognition: unknown

StackFeat: a convergent algorithm for optimal predictor selection in genomic data

Akbar Yermekov , D.A. Herrera-Mart\'i

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:48 UTC · model grok-4.3

classification 🧬 q-bio.OT

keywords feature selectionbiomarker discoverymiRNACOVID-19high-dimensional genomicsconvergent algorithmcross-validationdimensionality reduction

0 comments

The pith

StackFeat converges on stable, high-performing biomarker sets in genomic data by retaining only features that rank high on both signed effect size and selection frequency across repeated cross-validations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In high-dimensional genomic datasets where the number of features greatly exceeds the number of samples, standard feature selection methods produce unstable results that hinder reliable biomarker discovery. StackFeat addresses this by running repeated cross-validations, accumulating signed coefficients that capture effect direction and magnitude along with selection frequencies that estimate reliability, and keeping only features that excel on both measures. This dual-criterion filtering yields a dramatically reduced set of predictors that maintain or improve predictive accuracy while gaining stability guarantees not present in single-criterion approaches. On a COVID-19 miRNA dataset, the method cut 332 features down to five while reaching an AUC of 0.922, beating a known 9-gene benchmark. The approach enables identification of both established markers and new candidates with greater confidence.

Core claim

The central discovery is that iteratively accumulating signed coefficients and selection frequencies from repeated cross-validation folds, then retaining only the features that rank highly according to both statistics, produces a convergent algorithm that extracts a compact, stable, and high-accuracy predictor set from high-dimensional genomic data.

What carries the argument

The dual-criterion accumulator that tracks both signed coefficients (effect strength and direction) and selection frequencies (stability estimate) across iterations to enforce convergence on jointly optimal features.

If this is right

Reduces 332 features to a 5-miRNA signature with 98.5 percent dimensionality reduction while achieving AUC 0.922.
Outperforms the benchmark 9-gene set with statistical significance (p = 0.0016).
Recovers known markers such as hsa-miR-150-5p and surfaces novel candidates.
Supplies convergence guarantees that single-criterion selection lacks.
Supports discovery of both established and previously unknown biomarker relationships.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual tracking of magnitude and frequency could be tested on proteomics or single-cell RNA data where sample sizes are similarly constrained.
If the convergence property generalizes, combining effect-size and selection-probability statistics may offer a principle for robust selection in other high-dimensional settings.
The compact signature size could simplify translation into clinical diagnostic panels that require fewer measurements.

Load-bearing premise

The assumption that the features highest in both signed coefficients and selection frequency across cross-validations represent the true optimal and stable set of predictors.

What would settle it

Observing on new genomic datasets that the dual-criterion selected features fail to show better stability or predictive performance than single-criterion methods such as LASSO or random-forest importance rankings.

Figures

Figures reproduced from arXiv: 2604.22887 by Akbar Yermekov, D.A. Herrera-Mart\'i.

**Figure 1.** Figure 1: StackFeat algorithm workflow. 5. Dynamic Thresholding: Set candidate set size: m = $ mean(genes_per_fold) 4 % 6. Dual-Criterion Set Intersection: Generate the current feature set: S (t) = {j : |wj | in top m} | {z } S (t) w ∩ {j : cj in top m} | {z } S (t) c 7. Convergence Check: Convergence requires two consecutive iteration differences below tolerance ε: |AUC(t) − AUC(t−1)| < ε and |AUC(t−1) − AUC(t−2)|… view at source ↗

**Figure 2.** Figure 2: Feature selection and performance evolution across iterations view at source ↗

**Figure 3.** Figure 3: StackFeat convergence. Diff = |AUC(t) − AUC(t−1)|. Convergence requires two consecutive diffs below ε = 0.02. To further validate the robustness of the 5-marker signature, we also compared its performance using 10x10-fold cross-validation, with 2 different classifiers (Ensemble classifier of: ExtraTrees, Logistic Regression, Gaussian Naive Bayes, and KNN, and separately via Random Forest) against the 9-ma… view at source ↗

**Figure 4.** Figure 4: Performance comparison: StackFeat (5 genes) vs Gao et al. (9 genes). Right: Mean and median AUC across 10×10-fold CV. Close agreement indicates stable performance; median slightly higher suggests few low outlier folds. score of 0.925 at iteration 2 before stabilizing. The convergence mean AUC across all the last 3 iterations (10-12) was 0.905 ± 0.011. Feature Count Evolution ( view at source ↗

read the original abstract

In high-dimensional genomic data, the curse of dimensionality (d >> n) and limited sampling make feature selection inherently unstable - a critical barrier to biomarker discovery. We introduce StackFeat, an iterative algorithm that accumulates two statistics across repeated cross-validation: signed coefficients (measuring effect strength and direction) and selection frequencies (estimating selection probability). Only features ranking highly by both criteria are retained. On a COVID-19 miRNA dataset (GSE240888), StackFeat identified a stable 5-miRNA signature from 332 features (98.5% reduction), achieving AUC 0.922, significantly outperforming the benchmark 9-gene set (AUC 0.907, p = 0.0016). The signature includes hsa-miR-150-5p, a marker implicated in both COVID-19 survival and Dengue infection. This dual-criterion approach provides convergence guarantees absent in single-criterion methods, enabling discovery of known biomarkers, novel candidates, and previously unknown relationships. Keywords: marker selection, feature selection, bioinformatics, dimensionality reduction, robust algorithm, stacking, miRNA, COVID-19

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StackFeat is a simple iterative dual-criterion feature selector that delivers a big dimensionality cut and a small AUC gain on one miRNA dataset, but the convergence claim and validation are thin.

read the letter

The main point here is that StackFeat runs repeated cross-validation, tracks signed coefficients and selection frequencies, and keeps only features that score high on both until the set stabilizes. On the GSE240888 COVID-19 miRNA data it trims 332 features to five with an AUC of 0.922 versus the benchmark nine-gene set at 0.907 (p=0.0016). That reduction is useful in d >> n settings and it recovers a known marker, hsa-miR-150-5p. The dual-statistic idea is a reasonable way to chase stability where single-criterion methods often flip features across folds. The evaluation uses an independent dataset, so there is no obvious self-reference in the numbers. The procedural description avoids the circularity that sometimes creeps into these papers. The soft spots are clear. The abstract asserts convergence guarantees without any derivation or proof sketch, and the ranking thresholds are free parameters whose sensitivity is not shown. Only one external dataset appears, with no error bars, no additional cohorts, and no check on whether the p-value survives multiple-testing correction. The iteration itself is described at a high level, so it is hard to judge how much the result depends on the exact stopping rule or the base learner. This work is aimed at bioinformaticians who need stable signatures for biomarker work and are tired of unstable LASSO or random-forest selections. A reader looking for a practical tweak to existing pipelines could pick up the idea quickly. It is coherent enough on its own terms to warrant referee time; the empirical result is modest but the stability problem it targets is real, so a serious editor should send it out for methods and reproducibility checks rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The manuscript presents StackFeat, a new iterative algorithm for optimal predictor selection in high-dimensional genomic data. It accumulates signed coefficients and selection frequencies from repeated cross-validations, retaining only features that rank highly on both metrics. The authors claim this provides convergence guarantees not found in single-criterion methods. Empirically, on the GSE240888 COVID-19 miRNA dataset with 332 features, it selects a stable 5-miRNA signature achieving an AUC of 0.922, which significantly outperforms a benchmark 9-gene set (AUC 0.907, p=0.0016), and recovers known biomarkers such as hsa-miR-150-5p.

Significance. If the dual-criterion iterative procedure can be shown to deliver stable predictors with the claimed convergence properties, StackFeat would represent a useful advance for biomarker discovery in d >> n genomic settings by mitigating instability in feature selection. The reported 98.5% feature reduction to a 5-miRNA signature with improved AUC and recovery of a biologically plausible marker (hsa-miR-150-5p) on an independent dataset indicates practical promise for miRNA signature development in infectious disease contexts.

major comments (3)

[Abstract and Methods] Abstract and Methods: The assertion that the dual-criterion approach 'provides convergence guarantees absent in single-criterion methods' is not supported by any derivation, theorem, or formal analysis. The algorithm is presented as a procedural iterative accumulation without equations or proofs establishing convergence of the retained feature set.
[Results] Results: The AUC comparison (0.922 vs 0.907) is reported on only a single external dataset (GSE240888) with no error bars, standard errors, or confidence intervals on the performance estimates and no additional independent validation cohorts, which weakens support for the general superiority and stability claims.
[Methods] Methods: The thresholds or ranking criteria used to decide when features 'rank highly' by signed coefficients and by selection frequencies are not specified or justified. These choices are load-bearing for the feature retention step and affect both reproducibility and the optimality assertion.

minor comments (2)

[Abstract] The keywords list 'stacking' but the algorithm description does not explicitly relate the iterative accumulation to ensemble stacking techniques; a brief clarification of the terminology would aid reader understanding.
[Results] Inclusion of a table listing the final 5-miRNA signature together with their coefficient magnitudes and selection frequencies would improve the presentation of the empirical results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We have addressed each major comment point-by-point below, making revisions to strengthen the manuscript where the concerns are valid. Our responses focus on clarifying the methods, qualifying unsupported claims, and improving the presentation of results.

read point-by-point responses

Referee: [Abstract and Methods] The assertion that the dual-criterion approach 'provides convergence guarantees absent in single-criterion methods' is not supported by any derivation, theorem, or formal analysis. The algorithm is presented as a procedural iterative accumulation without equations or proofs establishing convergence of the retained feature set.

Authors: We agree that the original manuscript did not contain a formal derivation, theorem, or proof of convergence. The phrasing in the abstract was intended to describe the empirical behavior of the iterative dual-criterion accumulation, which we observed to produce stable feature sets across repeated CV runs, in contrast to single-pass methods. In the revised manuscript we have removed the claim of 'convergence guarantees' from the abstract and Methods, replacing it with a description of the observed stability. We have added pseudocode for the iterative procedure and a brief discussion of its heuristic properties in a new Methods subsection, while explicitly noting that a rigorous theoretical analysis remains future work. revision: yes
Referee: [Results] The AUC comparison (0.922 vs 0.907) is reported on only a single external dataset (GSE240888) with no error bars, standard errors, or confidence intervals on the performance estimates and no additional independent validation cohorts, which weakens support for the general superiority and stability claims.

Authors: We accept that the performance comparison would be strengthened by uncertainty estimates and additional cohorts. In the revised Results we now report bootstrap-derived 95% confidence intervals and standard errors for both AUC values, and we have clarified that the reported p=0.0016 comes from a DeLong test. We do not have access to further independent cohorts for this study; we have therefore expanded the Discussion to acknowledge this limitation and to recommend multi-cohort validation in future applications of StackFeat. revision: partial
Referee: [Methods] The thresholds or ranking criteria used to decide when features 'rank highly' by signed coefficients and by selection frequencies are not specified or justified. These choices are load-bearing for the feature retention step and affect both reproducibility and the optimality assertion.

Authors: We apologize for the lack of explicit detail. The revised Methods section now specifies that a feature is retained only if it ranks in the top 5% by absolute signed coefficient (capturing effect strength) and exceeds a selection frequency of 0.7 across the repeated cross-validation iterations (capturing stability). These cut-offs were selected after preliminary sensitivity runs to achieve substantial dimensionality reduction while preserving predictive performance; we have added a short justification and a supplementary sensitivity analysis showing how AUC and feature count vary with modest changes to the thresholds. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The StackFeat algorithm is defined as a procedural iterative accumulation of signed coefficients and selection frequencies across repeated cross-validations on genomic data, with stability and performance demonstrated empirically on the independent GSE240888 dataset (AUC 0.922, 98.5% feature reduction). No load-bearing equations, self-referential definitions, or self-citation chains reduce the claimed convergence guarantees or optimal signature to fitted inputs by construction; the dual-criterion retention rule is an explicit procedural choice evaluated externally rather than derived tautologically from its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard cross-validation assumptions for estimating selection probabilities and effect directions, plus unspecified thresholds for retaining 'highly ranked' features; no new entities are postulated.

free parameters (1)

ranking thresholds for coefficients and frequencies
The exact cutoffs used to decide which features rank 'highly' by both criteria are not specified in the abstract and function as tunable parameters.

axioms (1)

domain assumption Repeated cross-validation yields stable estimates of selection frequency and signed effect strength.
This underpins the accumulation step and is invoked implicitly in the method description.

pith-pipeline@v0.9.0 · 5500 in / 1357 out tokens · 47227 ms · 2026-05-08T08:48:08.857061+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages

[1]

Tibshirani, Regression shrinkage and selection via the lasso

R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B58, 267–288 (1996)

1996
[2]

H. Zou, T. Hastie, Regularization and variable selection via the elastic net. J. R. Stat. Soc. B67, 301–320 (2005)

2005
[3]

Wolpert, Stacked generalization

D.H. Wolpert, Stacked generalization. Neural Netw.5, 241–259 (1992)

1992
[4]

J. Gao, E. Kyubwa et al., Circulating miRNA profiles in COVID-19 patients and meta- analysis: implications for disease progression and prognosis. Sci. Rep.13, 21656 (2023). https://doi.org/10.1038/s41598-023-48227-w

work page doi:10.1038/s41598-023-48227-w 2023
[5]

Meinshausen, P

N. Meinshausen, P. Bühlmann, Stability selection. J. R. Stat. Soc. B72, 417–473 (2010)

2010
[6]

Extremely randomized trees.Machine Learning, 63(1):3–42, 2006

P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees. Mach. Learn.63, 3–42 (2006). https://doi.org/10.1007/s10994-006-6226-1

work page doi:10.1007/s10994-006-6226-1 2006
[7]

Breiman, Random forests

L. Breiman, Random forests. Mach. Learn.45, 5–32 (2001)

2001
[8]

Y . Ding, S. Tang et al., Plasma miR-150-5p as a biomarker for chronic obstructive pulmonary disease. Int. J. Chron. Obstruct. Pulmon. Dis.18, 399–406 (2023). https: //doi.org/10.2147/COPD.S400985

work page doi:10.2147/copd.s400985 2023
[9]

Hapugaswatta, P

H. Hapugaswatta, P. Amarasena et al., Differential expression of selected microRNA and putative target genes in peripheral blood cells as early markers of severe forms of dengue. medRxiv (2019). https://doi.org/10.1101/19002725

work page doi:10.1101/19002725 2019
[10]

Fernandez-Pato et al., Plasma miRNA profile at COVID-19 onset predicts severity status and mortality

A. Fernandez-Pato et al., Plasma miRNA profile at COVID-19 onset predicts severity status and mortality. Emerg. Microbes Infect.11, 676–688 (2022). https://doi.org/10. 1080/22221751.2022.2038021

work page arXiv 2022
[11]

Y . Yang, D. Fang et al., Circulating microRNAs as emerging regulators of COVID-19. Theranostics13, 125–147 (2023). https://doi.org/10.7150/thno.78164

work page doi:10.7150/thno.78164 2023
[12]

Fernandez-Pato, Host miRNA differences by COVID-19 severity: identification of age and sex bias, Master thesis, Universidad Autónoma de Madrid (2021)

A. Fernandez-Pato, Host miRNA differences by COVID-19 severity: identification of age and sex bias, Master thesis, Universidad Autónoma de Madrid (2021)

2021
[13]

J.T. Chow, L. Salmena, Prediction and analysis of SARS-CoV-2-targeting microRNA in human lung epithelium. Genes11, 1002 (2020). https://doi.org/10.3390/genes11091002

work page doi:10.3390/genes11091002 2020
[14]

Pollet, N

K. Pollet, N. Garnier et al., Host miRNAs as biomarkers of SARS-CoV-2 infection: a critical review. Sens. Diagn.2, 12–35 (2023). https://doi.org/10.1039/D2SD00140C

work page doi:10.1039/d2sd00140c 2023
[15]

Trampuz, D

S.R. Trampuz, D. V ogrinc et al., Shared miRNA landscapes of COVID-19 and neurode- generation confirm neuroinflammation as an important overlapping feature. Front. Mol. Neurosci.16, 1123955 (2023). https://doi.org/10.3389/fnmol.2023.1123955

work page doi:10.3389/fnmol.2023.1123955 2023
[16]

Andalib, S

A. Andalib, S. Rashed, The upregulation of hsa-mir-181b-1 and downregulation of its target CYLD in the late-stage of tumor progression of breast cancer. Indian J. Clin. Biochem.35, 312–321 (2019). https://doi.org/10.1007/s12291-019-00826-z

work page doi:10.1007/s12291-019-00826-z 2019