pith. machine review for the scientific record. sign in

arxiv: 2604.14334 · v2 · submitted 2026-04-15 · 🧬 q-bio.QM · cs.AI

Recognition: unknown

Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:23 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AI
keywords biomarker discoveryfeature selectionLLM reasoningMamba SSMRNA-seqbreast cancergradient saliencyfaithfulness audit
0
0 comments X

The pith

LLM chain-of-thought reasoning filters saliency genes from a Mamba model to improve breast cancer classifier performance while using far fewer features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language model chain-of-thought reasoning can remove tissue-composition confounders from gene candidate lists produced by gradient saliency in a Mamba state space model trained on TCGA-BRCA RNA-seq data. Raw top-50 saliency genes yield an AUC of 0.832 on held-out test data, below a 5000-gene variance baseline at 0.903, but LLM filtering to 17 genes raises AUC to 0.927. The filtered set contains only 6 of 17 validated BRCA biomarkers yet misses 10 of 16 known genes present in the input, including FOXA1. This pattern indicates that targeted confounder removal can lift downstream accuracy even when the reasoning does not achieve comprehensive recall of established markers.

Core claim

Gradient saliency from a Mamba SSM on TCGA-BRCA RNA-seq yields candidate biomarkers contaminated by confounders. Using DeepSeek-R1 with structured chain-of-thought reasoning filters this to 17 genes. This filtered set achieves higher AUC (0.927) on held-out test than a 5000-gene variance baseline (0.903) or the raw 50-gene set (0.832). A faithfulness audit (COSMIC CGC, OncoKB, PAM50) reveals 6 of 17 selected genes are validated BRCA biomarkers while 10 of 16 known BRCA genes were missed.

What carries the argument

Structured chain-of-thought evaluation by the LLM to remove tissue-composition confounders from Mamba-derived gradient saliency gene lists.

Load-bearing premise

The LLM's chain-of-thought reasoning can reliably separate true BRCA biomarkers from tissue-composition confounders without introducing its own biases or missing important genes.

What would settle it

Retrain the downstream classifier on an independent BRCA RNA-seq cohort using the identical 17-gene set and check whether its AUC advantage over the 5000-gene variance baseline persists.

Figures

Figures reproduced from arXiv: 2604.14334 by Aijing Feng, Pushpa Kumar Balan.

Figure 1
Figure 1. Figure 1: End-to-end neuro-symbolic pipeline. High-dimensional TCGA-BRCA RNA-seq is pro [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance Comparison: Accuracy, F1, and AUC metrics across experimental conditions. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Raw Gradient Saliency Heatmap for the top-50 genes in TCGA-BRCA samples, highlight [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the Agentic Chain-of-Thought (CoT) process, mapping Mamba saliency to [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Gradient saliency from deep sequence models surfaces candidate biomarkers efficiently, but the resulting gene lists can be contaminated by tissue-composition confounders that degrade downstream classifiers. We study whether LLM chain-of-thought (CoT) reasoning can filter these confounders, and whether reasoning quality is associated with downstream performance. We train a Mamba SSM on TCGA-BRCA RNA-seq and extract the top-50 genes by gradient saliency; DeepSeek-R1 evaluates every candidate with structured CoT to produce a final 17-gene set. On the held-out test split, the raw 50-gene saliency set (no LLM) performs worse than a 5,000-gene variance baseline (AUC 0.832 vs. 0.903), while the LLM-filtered set surpasses it (AUC 0.927), using 294x fewer features. A faithfulness audit (COSMIC CGC, OncoKB, PAM50) shows that 6 of 17 selected genes (35.3%) are validated BRCA biomarkers, while 10 of 16 known BRCA genes present in the input were missed - including FOXA1. This divergence between downstream performance and reasoning faithfulness suggests selective faithfulness in this setting: targeted confounder removal can improve predictive performance without comprehensive recall.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that gradient saliency from a Mamba SSM trained on TCGA-BRCA RNA-seq data yields a top-50 gene set whose downstream classifier underperforms a 5,000-gene variance baseline (AUC 0.832 vs 0.903), but that filtering the 50 candidates via DeepSeek-R1 chain-of-thought reasoning to a 17-gene set removes tissue-composition confounders and raises held-out AUC to 0.927 while using 294x fewer features. A faithfulness audit against COSMIC CGC, OncoKB and PAM50 shows 6/17 selected genes are validated BRCA biomarkers and 10/16 known genes are missed, which the authors interpret as evidence of selective faithfulness.

Significance. If substantiated, the hybrid Mamba-plus-LLM pipeline offers a concrete route to improve both predictive performance and feature parsimony in high-dimensional transcriptomic classification, and the selective-faithfulness observation supplies a testable hypothesis about when LLM reasoning can usefully prune confounders without requiring exhaustive recall.

major comments (3)
  1. [Abstract / Results] Abstract and Results: the claim that the AUC lift from 0.832 to 0.927 is caused by LLM removal of tissue-composition confounders lacks direct supporting evidence; no comparison is reported between the 33 excluded genes and the 17 retained genes on any confounder proxy (e.g., correlation with ESTIMATE stromal/immune scores, batch PCs, or infiltration signatures).
  2. [Methods] Methods: the manuscript supplies neither the exact LLM prompt template, the structured CoT filtering criteria, the precise train/validation/test split ratios and random seed, nor any statistical test for the reported AUC differences, rendering the central performance and faithfulness claims non-reproducible from the given description.
  3. [Results] Results / Faithfulness audit: the audit reports 6/17 overlap with known BRCA genes and 10/16 misses, yet provides no enrichment test or differential analysis showing that the excluded genes are enriched for confounders relative to the kept set; without this, the selective-faithfulness interpretation remains an untested post-hoc explanation.
minor comments (1)
  1. [Results] Table or figure captions should explicitly state the number of features in the variance baseline and whether AUC confidence intervals or DeLong tests were computed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions that will be incorporated to improve reproducibility and strengthen the supporting evidence.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results: the claim that the AUC lift from 0.832 to 0.927 is caused by LLM removal of tissue-composition confounders lacks direct supporting evidence; no comparison is reported between the 33 excluded genes and the 17 retained genes on any confounder proxy (e.g., correlation with ESTIMATE stromal/immune scores, batch PCs, or infiltration signatures).

    Authors: We agree that the manuscript currently lacks direct empirical comparisons between the excluded and retained gene sets on confounder proxies. In the revision we will add these analyses, computing correlations of both gene sets with ESTIMATE stromal/immune scores, batch principal components, and infiltration signatures to provide concrete support for the confounder-removal interpretation. revision: yes

  2. Referee: [Methods] Methods: the manuscript supplies neither the exact LLM prompt template, the structured CoT filtering criteria, the precise train/validation/test split ratios and random seed, nor any statistical test for the reported AUC differences, rendering the central performance and faithfulness claims non-reproducible from the given description.

    Authors: We acknowledge that the current description omits these details. The revised manuscript will include the full DeepSeek-R1 prompt template and structured CoT criteria, the exact train/validation/test split ratios together with the random seed used, and statistical comparisons of AUC values (e.g., DeLong test with p-values and confidence intervals). revision: yes

  3. Referee: [Results] Results / Faithfulness audit: the audit reports 6/17 overlap with known BRCA genes and 10/16 misses, yet provides no enrichment test or differential analysis showing that the excluded genes are enriched for confounders relative to the kept set; without this, the selective-faithfulness interpretation remains an untested post-hoc explanation.

    Authors: The referee correctly identifies the absence of an enrichment or differential analysis. We will add such tests in the revision (e.g., differential correlation with confounder proxies or gene-set enrichment comparing the two gene sets) to evaluate whether excluded genes are enriched for confounders and thereby ground the selective-faithfulness claim in quantitative evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with held-out evaluation and external audits

full rationale

The paper describes an empirical workflow: train Mamba-SSM on TCGA-BRCA RNA-seq, rank genes by gradient saliency, apply LLM CoT filtering to produce a 17-gene subset, then evaluate AUC on a held-out test split against a variance baseline and audit selected genes against COSMIC CGC, OncoKB, and PAM50. No equations, derivations, or first-principles claims appear in the provided text. Performance numbers (AUC 0.832 vs. 0.903 vs. 0.927) are computed on independent test data; the faithfulness audit uses external databases. No self-citations, fitted parameters renamed as predictions, or self-definitional reductions are present. The central claim therefore rests on observable downstream metrics rather than any input-to-output equivalence by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on two empirical thresholds and one domain assumption about LLM reasoning quality; no new physical entities are introduced.

free parameters (2)
  • top-50 genes cutoff
    Arbitrary initial selection threshold for saliency candidates before LLM filtering
  • final 17-gene set size
    Outcome of the LLM CoT filtering step rather than a pre-specified number
axioms (2)
  • domain assumption Gradient saliency from a trained Mamba-SSM on RNA-seq data surfaces candidate biomarkers
    Invoked when extracting the initial top-50 list
  • ad hoc to paper LLM chain-of-thought reasoning can accurately identify and remove tissue-composition confounders
    Core premise of the filtering step that produces the final 17-gene set

pith-pipeline@v0.9.0 · 5529 in / 1654 out tokens · 70855 ms · 2026-05-10T11:23:35.914169+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 9 canonical work pages

  1. [1]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752

  2. [2]

    Jeong, Zachary C

    Daniel P. Jeong, Zachary C. Lipton, and Pradeep Ravikumar. Llm-select: Feature selection with large language models, 2025. URL https://arxiv.org/abs/2407.02694

  3. [3]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwel...

  4. [4]

    Knowledge-driven feature selection and engineering for genotype data with large language models, 2025

    Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, and Li Shen. Knowledge-driven feature selection and engineering for genotype data with large language models, 2025. URL https://arxiv.org/abs/2410.01795

  5. [5]

    Parker, Michael Mullins, Maggie C.U

    Joel S. Parker, Michael Mullins, Maggie C.U. Cheang, Leung S., Voduc D., Vickery T., Davies S., Fauron C., He X., Hu Z., Quackenbush J.F., Stijleman I.J., Palazzo J., Marron J.S., Nobel A.B., Mardis E., Nielsen T.O., Ellis M.J., Perou C.M., and Bernard P.S. Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncolog...

  6. [6]

    Kempa-Liehr, and Justin M

    Nicholas Pudjihartono, Tayaza Fadason, Andreas W. Kempa-Liehr, and Justin M. O'Sullivan. A review of feature selection methods for machine learning-based disease risk prediction. Frontiers in Bioinformatics, 2: 0 927312, 2022. doi:10.3389/fbinf.2022.927312. URL https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2022.927312/full

  7. [7]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2014. URL https://arxiv.org/abs/1312.6034

  8. [8]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023. URL https://arxiv.org/abs/2305.04388

  9. [9]

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal

    Erica Zhang, Ryunosuke Goto, Naomi Sagan, Jurik Mutter, Nick Phillips, Ash Alizadeh, Kangwook Lee, Jose Blanchet, Mert Pilanci, and Robert Tibshirani. Llm-lasso: A robust framework for domain-informed feature selection and regularization, 2025. URL https://arxiv.org/abs/2502.10648

  10. [10]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  11. [11]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  12. [12]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  13. [13]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...