Recognition: unknown
Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery
Pith reviewed 2026-05-10 11:23 UTC · model grok-4.3
The pith
LLM chain-of-thought reasoning filters saliency genes from a Mamba model to improve breast cancer classifier performance while using far fewer features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gradient saliency from a Mamba SSM on TCGA-BRCA RNA-seq yields candidate biomarkers contaminated by confounders. Using DeepSeek-R1 with structured chain-of-thought reasoning filters this to 17 genes. This filtered set achieves higher AUC (0.927) on held-out test than a 5000-gene variance baseline (0.903) or the raw 50-gene set (0.832). A faithfulness audit (COSMIC CGC, OncoKB, PAM50) reveals 6 of 17 selected genes are validated BRCA biomarkers while 10 of 16 known BRCA genes were missed.
What carries the argument
Structured chain-of-thought evaluation by the LLM to remove tissue-composition confounders from Mamba-derived gradient saliency gene lists.
Load-bearing premise
The LLM's chain-of-thought reasoning can reliably separate true BRCA biomarkers from tissue-composition confounders without introducing its own biases or missing important genes.
What would settle it
Retrain the downstream classifier on an independent BRCA RNA-seq cohort using the identical 17-gene set and check whether its AUC advantage over the 5000-gene variance baseline persists.
Figures
read the original abstract
Gradient saliency from deep sequence models surfaces candidate biomarkers efficiently, but the resulting gene lists can be contaminated by tissue-composition confounders that degrade downstream classifiers. We study whether LLM chain-of-thought (CoT) reasoning can filter these confounders, and whether reasoning quality is associated with downstream performance. We train a Mamba SSM on TCGA-BRCA RNA-seq and extract the top-50 genes by gradient saliency; DeepSeek-R1 evaluates every candidate with structured CoT to produce a final 17-gene set. On the held-out test split, the raw 50-gene saliency set (no LLM) performs worse than a 5,000-gene variance baseline (AUC 0.832 vs. 0.903), while the LLM-filtered set surpasses it (AUC 0.927), using 294x fewer features. A faithfulness audit (COSMIC CGC, OncoKB, PAM50) shows that 6 of 17 selected genes (35.3%) are validated BRCA biomarkers, while 10 of 16 known BRCA genes present in the input were missed - including FOXA1. This divergence between downstream performance and reasoning faithfulness suggests selective faithfulness in this setting: targeted confounder removal can improve predictive performance without comprehensive recall.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that gradient saliency from a Mamba SSM trained on TCGA-BRCA RNA-seq data yields a top-50 gene set whose downstream classifier underperforms a 5,000-gene variance baseline (AUC 0.832 vs 0.903), but that filtering the 50 candidates via DeepSeek-R1 chain-of-thought reasoning to a 17-gene set removes tissue-composition confounders and raises held-out AUC to 0.927 while using 294x fewer features. A faithfulness audit against COSMIC CGC, OncoKB and PAM50 shows 6/17 selected genes are validated BRCA biomarkers and 10/16 known genes are missed, which the authors interpret as evidence of selective faithfulness.
Significance. If substantiated, the hybrid Mamba-plus-LLM pipeline offers a concrete route to improve both predictive performance and feature parsimony in high-dimensional transcriptomic classification, and the selective-faithfulness observation supplies a testable hypothesis about when LLM reasoning can usefully prune confounders without requiring exhaustive recall.
major comments (3)
- [Abstract / Results] Abstract and Results: the claim that the AUC lift from 0.832 to 0.927 is caused by LLM removal of tissue-composition confounders lacks direct supporting evidence; no comparison is reported between the 33 excluded genes and the 17 retained genes on any confounder proxy (e.g., correlation with ESTIMATE stromal/immune scores, batch PCs, or infiltration signatures).
- [Methods] Methods: the manuscript supplies neither the exact LLM prompt template, the structured CoT filtering criteria, the precise train/validation/test split ratios and random seed, nor any statistical test for the reported AUC differences, rendering the central performance and faithfulness claims non-reproducible from the given description.
- [Results] Results / Faithfulness audit: the audit reports 6/17 overlap with known BRCA genes and 10/16 misses, yet provides no enrichment test or differential analysis showing that the excluded genes are enriched for confounders relative to the kept set; without this, the selective-faithfulness interpretation remains an untested post-hoc explanation.
minor comments (1)
- [Results] Table or figure captions should explicitly state the number of features in the variance baseline and whether AUC confidence intervals or DeLong tests were computed.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions that will be incorporated to improve reproducibility and strengthen the supporting evidence.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results: the claim that the AUC lift from 0.832 to 0.927 is caused by LLM removal of tissue-composition confounders lacks direct supporting evidence; no comparison is reported between the 33 excluded genes and the 17 retained genes on any confounder proxy (e.g., correlation with ESTIMATE stromal/immune scores, batch PCs, or infiltration signatures).
Authors: We agree that the manuscript currently lacks direct empirical comparisons between the excluded and retained gene sets on confounder proxies. In the revision we will add these analyses, computing correlations of both gene sets with ESTIMATE stromal/immune scores, batch principal components, and infiltration signatures to provide concrete support for the confounder-removal interpretation. revision: yes
-
Referee: [Methods] Methods: the manuscript supplies neither the exact LLM prompt template, the structured CoT filtering criteria, the precise train/validation/test split ratios and random seed, nor any statistical test for the reported AUC differences, rendering the central performance and faithfulness claims non-reproducible from the given description.
Authors: We acknowledge that the current description omits these details. The revised manuscript will include the full DeepSeek-R1 prompt template and structured CoT criteria, the exact train/validation/test split ratios together with the random seed used, and statistical comparisons of AUC values (e.g., DeLong test with p-values and confidence intervals). revision: yes
-
Referee: [Results] Results / Faithfulness audit: the audit reports 6/17 overlap with known BRCA genes and 10/16 misses, yet provides no enrichment test or differential analysis showing that the excluded genes are enriched for confounders relative to the kept set; without this, the selective-faithfulness interpretation remains an untested post-hoc explanation.
Authors: The referee correctly identifies the absence of an enrichment or differential analysis. We will add such tests in the revision (e.g., differential correlation with confounder proxies or gene-set enrichment comparing the two gene sets) to evaluate whether excluded genes are enriched for confounders and thereby ground the selective-faithfulness claim in quantitative evidence. revision: yes
Circularity Check
No circularity: empirical pipeline with held-out evaluation and external audits
full rationale
The paper describes an empirical workflow: train Mamba-SSM on TCGA-BRCA RNA-seq, rank genes by gradient saliency, apply LLM CoT filtering to produce a 17-gene subset, then evaluate AUC on a held-out test split against a variance baseline and audit selected genes against COSMIC CGC, OncoKB, and PAM50. No equations, derivations, or first-principles claims appear in the provided text. Performance numbers (AUC 0.832 vs. 0.903 vs. 0.927) are computed on independent test data; the faithfulness audit uses external databases. No self-citations, fitted parameters renamed as predictions, or self-definitional reductions are present. The central claim therefore rests on observable downstream metrics rather than any input-to-output equivalence by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- top-50 genes cutoff
- final 17-gene set size
axioms (2)
- domain assumption Gradient saliency from a trained Mamba-SSM on RNA-seq data surfaces candidate biomarkers
- ad hoc to paper LLM chain-of-thought reasoning can accurately identify and remove tissue-composition confounders
Reference graph
Works this paper leans on
-
[1]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752
work page Pith review arXiv 2024
-
[2]
Daniel P. Jeong, Zachary C. Lipton, and Pradeep Ravikumar. Llm-select: Feature selection with large language models, 2025. URL https://arxiv.org/abs/2407.02694
-
[3]
Measuring Faithfulness in Chain-of-Thought Reasoning
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwel...
work page Pith review arXiv 2023
-
[4]
Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, and Li Shen. Knowledge-driven feature selection and engineering for genotype data with large language models, 2025. URL https://arxiv.org/abs/2410.01795
-
[5]
Parker, Michael Mullins, Maggie C.U
Joel S. Parker, Michael Mullins, Maggie C.U. Cheang, Leung S., Voduc D., Vickery T., Davies S., Fauron C., He X., Hu Z., Quackenbush J.F., Stijleman I.J., Palazzo J., Marron J.S., Nobel A.B., Mardis E., Nielsen T.O., Ellis M.J., Perou C.M., and Bernard P.S. Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncolog...
-
[6]
Nicholas Pudjihartono, Tayaza Fadason, Andreas W. Kempa-Liehr, and Justin M. O'Sullivan. A review of feature selection methods for machine learning-based disease risk prediction. Frontiers in Bioinformatics, 2: 0 927312, 2022. doi:10.3389/fbinf.2022.927312. URL https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2022.927312/full
-
[7]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2014. URL https://arxiv.org/abs/1312.6034
work page Pith review arXiv 2014
- [8]
-
[9]
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal
Erica Zhang, Ryunosuke Goto, Naomi Sagan, Jurik Mutter, Nick Phillips, Ash Alizadeh, Kangwook Lee, Jose Blanchet, Mert Pilanci, and Robert Tibshirani. Llm-lasso: A robust framework for domain-informed feature selection and regularization, 2025. URL https://arxiv.org/abs/2502.10648
-
[10]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[11]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[12]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[13]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.