arxiv: 2605.14073 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: no theorem link

AttnGen: Attention-Guided Saliency Learning for Interpretable Genomic Sequence Classification

Rayhaneh Shabani Nia , Ali Karkehabadi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords attention mechanismgenomic sequence classificationsaliency learningprogressive maskinginterpretabilityCNN baselineperturbation analysis

0 comments

The pith

AttnGen embeds attention-based saliency into training to classify 200-nucleotide sequences at 96.73% accuracy while forcing reliance on fewer positions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AttnGen as a training framework that adds an attention layer to compute per-nucleotide importance scores and then progressively masks low-scoring positions. This changes how the network learns to solve the binary human-versus-worm classification task on the standard demo benchmark. The resulting models reach higher validation accuracy than a plain CNN, converge faster, and show large accuracy drops when only the high-saliency nucleotides are removed. The core goal is therefore to make the classifier both more accurate and more dependent on a compact, interpretable set of sequence positions.

Core claim

AttnGen computes nucleotide-level importance scores using an attention mechanism and progressively suppresses low-contribution positions during training. On the demo_human_or_worm benchmark, moderate masking produces 96.73% validation accuracy versus 95.83% for a conventional CNN, together with faster convergence and greater stability. Perturbation tests on a 3,000-sequence hold-out set show that excising the high-saliency nucleotides collapses accuracy from 96.9% to near chance level, confirming that predictions rest on a small subset of positions.

What carries the argument

Attention-guided progressive masking that derives per-position importance scores and removes low-contribution nucleotides from the input during optimization.

If this is right

Masking 10-20% of positions yields the best accuracy-interpretability trade-off.
High-saliency nucleotides identified by the attention scores are the positions the model actually uses for its decisions.
Training stability and convergence speed both improve relative to an unmasked CNN baseline.
The learned saliency map can be read out directly as the compact set of positions driving each prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying the same masking schedule to longer regulatory sequences could surface candidate functional motifs without post-hoc explanation methods.
Comparing the masked positions against known transcription-factor binding sites on independent genomic datasets would test whether the saliency scores align with established biology.
Replacing the CNN backbone with a transformer encoder while keeping the attention-guided mask could reveal whether the benefit generalizes beyond convolutional architectures.

Load-bearing premise

Attention-derived importance scores reflect functionally relevant nucleotide contributions rather than artifacts of training dynamics or the particular benchmark distribution.

What would settle it

Remove the top 10-20% highest-saliency nucleotides from every sequence in the 3,000-example evaluation set and observe whether accuracy remains well above chance; if it does not fall to near-random levels, the claim that the model depends on those positions is falsified.

Figures

Figures reproduced from arXiv: 2605.14073 by Ali Karkehabadi, Rayhaneh Shabani Nia.

**Figure 1.** Figure 1: Masking patterns for human and worm sequences at different masking levels. Blue: retained positions; red: masked positions. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Accuracy under progressive masking of high-gradient positions. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Deep neural networks have achieved strong performance in genomic sequence classification; however, relating their predictions to biologically meaningful sequence patterns remains challenging. In this work, we present AttnGen, an attention-guided training framework that embeds interpretability directly into the optimization process. AttnGen computes nucleotide-level importance scores using an attention mechanism and progressively suppresses low-contribution positions during training. This encourages the model to focus its predictions on a compact set of informative regions while reducing reliance on noisy sequence elements. We evaluate AttnGen on the standardized demo_human_or_worm benchmark, a binary classification task over 200-nucleotide sequences. With moderate masking, AttnGen achieves a validation accuracy of 96.73%, outperforming a conventional CNN baseline with 95.83% accuracy, while also exhibiting faster convergence and improved training stability. To assess whether the learned importance scores reflect functionally relevant signal, we conduct perturbation-based analysis by removing high-saliency nucleotides. This causes accuracy to drop from 96.9% to near chance level on a 3,000-sequence evaluation set, indicating that the model relies on a relatively small subset of informative positions. Our analysis shows that masking 10--20% of positions provides the most favorable trade-off between predictive performance and interpretability. These results suggest that attention-guided masking not only improves classification performance but also reshapes how models distribute importance across sequence positions. Although this study focuses on short genomic sequences, the proposed approach may extend to more complex interpretable sequence modeling settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AttnGen ties attention scores to progressive masking during training for a small accuracy lift on a toy benchmark, but the perturbation test is built into the method and does not show biological meaning.

read the letter

AttnGen adds a training loop that computes attention-based importance and then progressively masks low-scoring positions. This produces a model that reaches 96.73% validation accuracy on the 200-nt human-or-worm task versus 95.83% for a plain CNN, with faster convergence and better stability. The masking fraction is tuned between 10-20% and the final saliency maps come directly from the attention weights used in training. That coupling is the concrete increment over standard attention or post-hoc saliency methods. The numbers are reported clearly and the perturbation experiment is run on a held-out set of 3,000 sequences, which is at least reproducible from the description. The approach is simple enough that someone could reimplement it from the abstract alone. The main weakness is that the perturbation result is unsurprising by design: the optimizer is explicitly rewarded for concentrating on a small subset of positions, so removing those positions must drop accuracy. Without a random-ablation control of the same size, without overlap checks against known motifs, and without any external genomic annotation, the drop only confirms that the model followed the training signal, not that the positions are functionally meaningful. No error bars or statistical tests are mentioned, so the 0.9-point gain cannot be assessed for reliability. The paper is a modest engineering note aimed at people already working on attention models for short genomic sequences who want saliency without extra post-processing. It is coherent on its own terms and the experiments are grounded enough to deserve referee time, though any review would likely ask for the missing controls and a clearer statement of what the saliency actually explains. I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces AttnGen, an attention-guided training framework for genomic sequence classification on the demo_human_or_worm benchmark. It computes nucleotide-level importance scores via attention and progressively suppresses low-contribution positions during training to encourage focus on a compact set of informative regions. The manuscript reports 96.73% validation accuracy (vs. 95.83% for a CNN baseline), faster convergence, improved stability, and a perturbation test in which removing high-saliency nucleotides drops accuracy from 96.9% to near chance on a 3,000-sequence set. Masking 10-20% of positions is identified as the favorable trade-off.

Significance. If the performance and interpretability claims hold under stronger validation, AttnGen would provide a practical method for embedding saliency directly into optimization for short genomic sequences, potentially improving both accuracy and the ability to identify functionally relevant positions without post-hoc explanation techniques.

major comments (2)

[Abstract] Abstract: The reported accuracies (96.73% vs. 95.83%) and the perturbation drop (96.9% to near chance) are given as single point estimates with no error bars, no standard deviations across runs, and no statistical significance tests against the baseline, so the magnitude and reliability of the improvement cannot be assessed.
[Abstract] Abstract (perturbation analysis): The training procedure explicitly computes attention scores and suppresses low-contribution positions, thereby incentivizing the model to rely on a small subset of nucleotides; the subsequent observation that removing high-saliency positions collapses accuracy is therefore expected by construction and does not constitute independent evidence that those positions are biologically or functionally meaningful. No random-ablation control of equal cardinality, no overlap analysis with known motifs, and no external validation against genomic annotations are described.

minor comments (1)

[Abstract] Abstract: The precise schedule for the masking fraction (how 'moderate masking' and the 10-20% range are implemented during training) is not specified, which limits reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major point below and have made revisions to strengthen the statistical reporting and add controls to the perturbation analysis.

read point-by-point responses

Referee: [Abstract] Abstract: The reported accuracies (96.73% vs. 95.83%) and the perturbation drop (96.9% to near chance) are given as single point estimates with no error bars, no standard deviations across runs, and no statistical significance tests against the baseline, so the magnitude and reliability of the improvement cannot be assessed.

Authors: We agree that reporting variability and statistical tests is important for assessing the reliability of the results. In the revised manuscript, we have rerun the experiments with five different random seeds and now report mean validation accuracies with standard deviations (e.g., 96.73 ± 0.12% for AttnGen vs. 95.83 ± 0.25% for the baseline). We also include a statistical significance test (paired t-test, p < 0.05) confirming the improvement. Similar updates have been made for the perturbation analysis results. revision: yes
Referee: [Abstract] Abstract (perturbation analysis): The training procedure explicitly computes attention scores and suppresses low-contribution positions, thereby incentivizing the model to rely on a small subset of nucleotides; the subsequent observation that removing high-saliency positions collapses accuracy is therefore expected by construction and does not constitute independent evidence that those positions are biologically or functionally meaningful. No random-ablation control of equal cardinality, no overlap analysis with known motifs, and no external validation against genomic annotations are described.

Authors: We acknowledge that the perturbation test is consistent with the training procedure and thus does not provide fully independent validation of biological meaning. To address this, we have added a random-ablation control experiment in the revised manuscript, where removing an equal number of randomly selected positions results in a significantly smaller accuracy drop compared to saliency-based removal. This supports that the attention-guided positions are more critical. Regarding overlap with known motifs and external genomic annotations, the demo_human_or_worm benchmark is a standard synthetic classification dataset without associated motif or annotation data, which prevents such analyses. We have added a discussion of this limitation and clarified that the interpretability claims are in the context of the model's learned saliency for the classification task rather than direct biological validation. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an attention-based training procedure that computes saliency scores and applies progressive masking, then reports validation accuracy and a perturbation test on held-out sequences. No equations, self-citations, or fitted-parameter renamings are present that reduce the accuracy figures or the saliency-perturbation results to quantities defined by construction from the same inputs. The perturbation evaluation operates on an independent 3,000-sequence set and measures an observable drop, providing non-tautological evidence rather than a self-referential prediction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that attention scores can be treated as reliable importance weights for masking and that moderate masking improves generalization without discarding critical signal. The masking fraction itself is selected post-hoc on validation performance.

free parameters (1)

masking fraction
10-20% range reported as optimal trade-off; chosen after observing performance on the validation set.

axioms (1)

domain assumption Attention scores computed during training accurately indicate nucleotide contribution to the final prediction
Invoked to justify progressive suppression of low-attention positions.

pith-pipeline@v0.9.0 · 5577 in / 1382 out tokens · 40695 ms · 2026-05-15T05:11:25.272331+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

DNA binding sites: representation and discovery.Bioinfor- matics16, 16–23 (2000)

Stormo, G. DNA binding sites: representation and discovery.Bioinfor- matics16, 16–23 (2000)

work page 2000
[2]

& Hinton, G

LeCun, Y ., Bengio, Y . & Hinton, G. Deep learning.Nature521, 436–444 (2015)

work page 2015
[3]

& Sadeghmalakabadi, S

Karkehabadi, A. & Sadeghmalakabadi, S. Evaluating deep learning models for architectural image classification: A case study on the UC Davis campus.2024 IEEE 8th International Conference on Information and Communication Technology (CICT), pp. 1–6 (2024)

work page 2024
[4]

& Padoy, N

Hassanpour, J., Srivastav, V ., Mutter, D. & Padoy, N. Overcoming Di- mensional Collapse in Self-Supervised Contrastive Learning for Medical Image Segmentation. (2024)

work page 2024
[5]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034(2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[6]

& Yan, Q

Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. InICML, 3319–3328 (2017)

work page 2017
[7]

& Kundaje, A

Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. InICML, 3145–3153 (2017)

work page 2017
[8]

Avsec, ˇZ.et al.Effective gene expression prediction from sequence by integrating long-range interactions.Nature Methods18, 1196–1203 (2021)

work page 2021
[9]

InNeurIPS31, 9505–9515 (2018)

Adebayo, J.et al.Sanity checks for saliency maps. InNeurIPS31, 9505–9515 (2018)

work page 2018
[10]

& Doshi-Velez, F

Ross, A., Hughes, M. & Doshi-Velez, F. Right for the right reasons: Training differentiable models by constraining their explanations. In IJCAI, 2662–2670 (2017)

work page 2017
[11]

& Feizi, S

Ismail, A., Corrada Bravo, H. & Feizi, S. Improving deep learning interpretability by saliency guided training. InNeurIPS34, 26726–26739 (2021)

work page 2021
[12]

& Sasan, A

Karkehabadi, A., Homayoun, H. & Sasan, A. SMOOT: Saliency guided mask optimized online training. In2024 IEEE 17th Dallas Circuits and Systems Conference (DCAS), pp. 1–6 (2024)

work page 2024
[13]

& Alexiou, P

Gre ˇsov´a, K., Martinek, V ., ˇCech´ak, D., ˇSimeˇcek, P. & Alexiou, P. Genomic benchmarks: a collection of datasets for genomic sequence classification.BMC Genomic Data24, 25 (2023)

work page 2023
[14]

& Sasan, A

Karkehabadi, A., Homayoun, H. & Sasan, A. Unified Gravity Loss for Robust Neural Networks Through Feature Space Optimization. Proceedings of the Great Lakes Symposium on VLSI 2025, pp. 947– 953 (2025)

work page 2025
[15]

& Frey, B

Alipanahi, B., Delong, A., Weirauch, M. & Frey, B. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.Nature Biotechnology33, 831–838 (2015)

work page 2015
[16]

& Sasan, A

Karkehabadi, A., Latibari, B., Homayoun, H. & Sasan, A. HLGM: A novel methodology for improving model accuracy using saliency-guided high and low gradient masking. In2024 14th International Conference on Information Science and Technology (ICIST), pp. 909–917 (2024)

work page 2024
[17]

& Troyanskaya, O

Zhou, J. & Troyanskaya, O. Predicting effects of noncoding variants with deep learning-based sequence model.Nature Methods12, 931–934 (2015)

work page 2015
[18]

& Xie, X

Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences.Nucleic Acids Research44, e107 (2016)

work page 2016
[19]

InICCV, 618–626 (2017)

Selvaraju, R.et al.Grad-CAM: Visual explanations from deep networks via gradient-based localization. InICCV, 618–626 (2017)

work page 2017
[20]

& Terry, M

Kapishnikov, A., Bolukbasi, T., Vi ´egas, F. & Terry, M. Guided integrated gradients: An adaptive path method for removing noise. InCVPR, 5050– 5058 (2021)

work page 2021