Wisteria: A Unified Multi-Scale Feature Learning Framework for DNA Language Model
Pith reviewed 2026-05-08 10:59 UTC · model grok-4.3
The pith
Wisteria unifies multi-scale feature learning in a Mamba-based DNA language model via gated convolutions, MLPs, and Fourier attention, showing strong benchmark performance on genomic tasks with short and long-range dependencies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Wisteria effectively unifies local and global dependency modeling for multi scale genomic sequence analysis and demonstrates strong performance on downstream benchmarks against competitive DNA language model baselines.
Load-bearing premise
That augmenting Mamba with gated dilated convolutions, gated MLPs, and Fourier attention actually captures the interplay between local motifs and global dependencies better than existing methods, as claimed in the abstract without supporting experimental details.
Figures
read the original abstract
DNA language model aims to decipher the regulatory grammar and semantic of genomes by capturing long range dependencies in DNA sequences. Existing methods emphasize long range token interactions but often ignore the interplay between local motifs and global dependencies. In this paper, we propose Wisteria, a genomic language model that integrates multi scale feature learning within a unified framework for DNA sequence. Specifically, Wisteria augments the Mamba based architecture with gated dilated convolutions to capture local motifs and regulatory patterns, while gated multilayer perceptrons refine global dependencies. We further introduce a Fourier based attention mechanism to support frequency domain modeling, periodic extension and length generalization. Across four experimental settings with both short and long range dependencies, Wisteria demonstrates strong performance on downstream benchmarks against competitive DNA language model baselines. These results indicate that Wisteria effectively unifies local and global dependency modeling for multi scale genomic sequence analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Wisteria, a unified multi-scale feature learning framework for DNA language models. It augments the Mamba architecture with gated dilated convolutions to capture local motifs and regulatory patterns, gated multilayer perceptrons to refine global dependencies, and a Fourier-based attention mechanism for frequency domain modeling, periodic extension, and length generalization. The model is evaluated across four experimental settings involving short and long range dependencies and claims strong performance against competitive DNA language model baselines.
Significance. If the empirical results and architectural contributions hold after proper validation, the work would advance DNA language modeling by explicitly targeting the interplay between local motifs and global dependencies, an area where existing methods fall short. The Fourier attention component for length generalization could prove useful for variable-length genomic sequences, and the overall unification approach offers a scalable alternative to pure transformer or state-space models in regulatory genomics.
major comments (3)
- Abstract: The central claim of 'strong performance' and effective unification of local-global modeling is presented without any quantitative results, error bars, data splits, ablation tables, or baseline comparisons, rendering the performance attribution unverifiable and load-bearing for the paper's contribution.
- Experimental section: No ablation or component-isolation experiments are described to confirm that gains arise from the interplay captured by gated dilated convolutions, gated MLPs, and Fourier attention rather than from increased capacity, training schedule, or baseline differences; this directly undermines the unification claim.
- Methods: No equations, diagrams, or implementation details are supplied for how the three augmentations are integrated into the Mamba backbone or how they interact within the unified framework, preventing assessment of whether the multi-scale modeling is genuinely achieved.
minor comments (2)
- The abstract would be strengthened by including at least one key quantitative result or comparison metric to ground the performance claims.
- Consider adding an architecture diagram and pseudocode for the module integrations to improve clarity of the proposed framework.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below, outlining the revisions we will make to address the concerns raised.
read point-by-point responses
-
Referee: Abstract: The central claim of 'strong performance' and effective unification of local-global modeling is presented without any quantitative results, error bars, data splits, ablation tables, or baseline comparisons, rendering the performance attribution unverifiable and load-bearing for the paper's contribution.
Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript, we will add concise summaries of performance metrics (e.g., accuracy or F1 improvements on the four experimental settings), baseline comparisons, and experimental details such as data splits, while respecting abstract length constraints. This will make the central claims immediately verifiable. revision: yes
-
Referee: Experimental section: No ablation or component-isolation experiments are described to confirm that gains arise from the interplay captured by gated dilated convolutions, gated MLPs, and Fourier attention rather than from increased capacity, training schedule, or baseline differences; this directly undermines the unification claim.
Authors: We acknowledge that ablation studies are essential to isolate the contributions of each component and validate the unification claim. We will add a new subsection with ablation experiments in the revised manuscript, including tables that systematically remove or modify the gated dilated convolutions, gated MLPs, and Fourier attention. These will compare against capacity-matched baselines and varied training schedules to demonstrate that performance gains arise from the multi-scale feature learning approach. revision: yes
-
Referee: Methods: No equations, diagrams, or implementation details are supplied for how the three augmentations are integrated into the Mamba backbone or how they interact within the unified framework, preventing assessment of whether the multi-scale modeling is genuinely achieved.
Authors: We will expand the Methods section in the revised manuscript to include explicit mathematical equations for each augmentation (gated dilated convolutions, gated MLPs, and Fourier attention), a detailed architectural diagram showing their integration into the Mamba backbone, and implementation specifics such as interaction mechanisms, hyperparameter settings, and pseudocode. This will provide a clear description of how the components achieve unified multi-scale modeling. revision: yes
Circularity Check
No circularity: empirical architecture proposal with no derivations or self-referential reductions
full rationale
The paper presents Wisteria as an architectural augmentation of Mamba using gated dilated convolutions for local motifs, gated MLPs for global dependencies, and Fourier attention for frequency modeling. No equations, first-principles derivations, or predictions appear in the abstract or described content. Claims of unification and strong performance rest entirely on downstream benchmark results against baselines, which are independent empirical observations rather than reductions to fitted inputs, self-definitions, or self-citation chains. No load-bearing steps match any of the enumerated circularity patterns; the work is self-contained as a model proposal plus evaluation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Pengyu Zhang, Hongming Zhang, and Hao Wu. ipro-wael: a comprehensive and robust framework for identifying promoters in multiple species.Nucleic Acids Research, 50(18):10278–10289, 2022
work page 2022
-
[2]
Guangsheng Pei, Ruifeng Hu, Peilin Jia, and Zhongming Zhao. Deepfun: a deep learning sequence-based model to decipher non-coding variant effect in a tissue- and cell type-specific manner.Nucleic acids research, 49(W1):W131–W139, 2021
work page 2021
-
[3]
Zixuan Wang, Meiqin Gong, Yuhang Liu, Shuwen Xiong, Maocheng Wang, Jiliu Zhou, and Yongqing Zhang. Towards a better understanding of tf-dna binding pre- diction from genomic features.Computers in biology and medicine, 149:105993, 2022
work page 2022
-
[4]
Predicting splicing from primary sequence with deep learning.Cell, 176(3):535–548.e24, 2019
Kishore Jaganathan, Sofia Kyriazopoulou Panagiotopoulou, Jeremy F McRae, Siavash Fazel Darbandi, David Knowles, Yang I Li, Jack A Kosmicki, Juan Arbelaez, Wenwu Cui, Grace B Schwartz, et al. Predicting splicing from primary sequence with deep learning.Cell, 176(3):535–548.e24, 2019. 21
work page 2019
-
[5]
Predicting rna splicing from dna sequence using pangolin.Genome biology, 23(1):103, 2022
Tony Zeng and Yang I Li. Predicting rna splicing from dna sequence using pangolin.Genome biology, 23(1):103, 2022
work page 2022
-
[6]
Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R Ledsam, Agnieszka Grabska-Barwinska, Kyle R Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, and David R Kelley. Effective gene expression prediction from sequence by integrating long-range interactions.Nature methods, 18(10):1196–1203, 2021
work page 2021
-
[7]
Tuan Trieu, Alexander Martinez-Fundichely, and Ekta Khurana. Deepmilo: a deep learning approach to predict the impact of non-coding sequence variants on 3d chromatin structure.Genome biology, 21:1–11, 2020
work page 2020
-
[8]
Pengju Ding, Jianxin Wang, Shiyue He, Xin Gao, Xu Yu, and Bin Yu. Deeputf: Locating transcription factor binding sites via interpretable dual-channel encoder- decoder structure.Pattern Recognition, 161:111279, 2025
work page 2025
-
[9]
Guodong Li, Yue Yang, Dongxu Li, Xiaorui Su, Zhi Zeng, Pengwei Hu, and Lun Hu. A bijective inference network for interpretable identification of rna n6-methyladenosine modification sites.Pattern Recognition, 164:111541, 2025
work page 2025
-
[10]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InThe First Conference on Language Modeling. COLM, 2024
work page 2024
-
[11]
Caduceus: Bi-directional equivariant long-range dna sequence modeling
Yair Schiff, Chia Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, and V olodymyr Kuleshov. Caduceus: Bi-directional equivariant long-range dna sequence modeling. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 43632–43648. PMLR, PMLR, 2024
work page 2024
-
[12]
Stefan Schoenfelder and Peter Fraser. Long-range enhancer–promoter contacts in gene expression control.Nature Reviews Genetics, 20(8):437–455, 2019
work page 2019
-
[13]
Jan Mrázek. Comparative analysis of sequence periodicity among prokaryotic genomes points to differences in nucleoid structure and a relationship to gene expression.Journal of bacteriology, 192(14):3763–3772, 2010
work page 2010
-
[14]
Nucleosome dna sequence structure of isochores.BMC genomics, 12(1):203, 2011
Zakharia M Frenkel, Thomas Bettecken, and Edward N Trifonov. Nucleosome dna sequence structure of isochores.BMC genomics, 12(1):203, 2011
work page 2011
-
[15]
Roformer: Enhanced transformer with rotary position embedding.Neurocomput- ing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput- ing, 568:127063, 2024
work page 2024
-
[16]
Fourier position embedding: Enhancing attention’s periodic extension for length generalization
Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, and Bowen Zhou. Fourier position embedding: Enhancing attention’s periodic extension for length generalization. InProceed- ings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 249...
work page 2025
-
[17]
Liyuan Shu, Jiao Tang, Xiaoyu Guan, and Daoqiang Zhang. A comprehensive survey of genome language models in bioinformatics.Briefings in Bioinformatics, 27(1):bbaf724, 2026
work page 2026
-
[18]
Jian Zhou and Olga G Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model.Nature methods, 12(10):931–934, 2015
work page 2015
-
[19]
David R Kelley, Jasper Snoek, and John L Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks.Genome research, 26(7):990–999, 2016
work page 2016
-
[20]
Md Tawab Alam Khan, Shadman Shadab, Nazia Afrin Neezi, and Sheikh Adilina. Deepdbp: Deep neural networks for identification of dna-binding proteins.Infor- matics in Medicine Unlocked, 19:100318, 2020
work page 2020
-
[21]
Gonzalo Benegas, Sanjit Singh Batra, and Yun S Song. Dna language models are powerful predictors of genome-wide variant effects.Proceedings of the National Academy of Sciences, 120(44):e2311219120, 2023
work page 2023
-
[22]
Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome.Bioinformatics, 37(15):2112–2120, 2021
work page 2021
-
[23]
Dnabert-2: Efficient foundation model and benchmark for multi-species genomes
Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, and Han Liu. Dnabert-2: Efficient foundation model and benchmark for multi-species genomes. InInternational Conference on Learning Representations. ICLR, 2024
work page 2024
-
[24]
Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Car- ranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P de Almeida, Hassan Sirelkhatim, et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics.Nature Methods, 22(2):287–297, 2025
work page 2025
-
[25]
Johannes Linder, Divyanshi Srivastava, Han Yuan, Vikram Agarwal, and David R Kelley. Predicting rna-seq coverage from dna sequence as a unifying model of gene regulation.Nature Genetics, 57:949–961, 2025
work page 2025
-
[26]
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020
work page 2020
-
[27]
Transformer-xl: Attentive language models beyond a fixed-length context
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. InProceedings of the 57th annual meeting of the association for compu- tational linguistics, pages 2978–2988, 2019
work page 2019
-
[28]
Gena- lm: a family of open-source foundational dna language models for long sequences
Veniamin Fishman, Yuri Kuratov, Aleksei Shmelev, Maxim Petrov, Dmitry Penzar, Denis Shepelin, Nikolay Chekanov, Olga Kardymon, and Mikhail Burtsev. Gena- lm: a family of open-source foundational dna language models for long sequences. Nucleic Acids Research, 53(2):gkae1310, 2025. 23
work page 2025
-
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30:5998–6008, 2017
work page 2017
-
[30]
Train short, test long: Attention with linear biases enables input length extrapolation
Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InInternational Conference on Learning Representations. ICLR, 2022
work page 2022
-
[31]
Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Michael Wornow, Cal- lum Birch-Sykes, Stefano Massaroli, Aman Patel, Clayton Rabideau, Yoshua Ben- gio, et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution.Advances in neural information processing systems, 36:43177–43201, 2023
work page 2023
-
[32]
Hyena hierarchy: Towards larger convolutional language models
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. InProceedings of the 40th Interna- tional Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 28043–28078. PMLR, PMLR, 2023
work page 2023
-
[33]
Eric Nguyen, Michael Poli, Matthew G Durrant, Brian Kang, Dhruva Katrekar, David B Li, Liam J Bartie, Armin W Thomas, Samuel H King, Garyk Brixi, et al. Sequence modeling and design from molecular to genome scale with evo.Science, 386(6723):eado9336, 2024
work page 2024
-
[34]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019
work page 2019
-
[35]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations. ICLR, 2019
work page 2019
-
[36]
Katarína Grešová, Vlastimil Martinek, David ˇCechák, Petr Šimeˇcek, and Panagio- tis Alexiou. Genomic benchmarks: a collection of datasets for genomic sequence classification.BMC Genomic Data, 24(1):25, 2023
work page 2023
-
[37]
Whitaker, Zhao Chen, and Wei Wang
John W. Whitaker, Zhao Chen, and Wei Wang. Predicting the human epigenome from dna motifs.Nature Methods, 12(3):265–272, 2015
work page 2015
-
[38]
Advancing dna language models: The genomics long-range benchmark
Chia Hsiang Kao, Evan Trop, McKinley Polen, Yair Schiff, Bernardo P de Almeida, Aaron Gokaslan, Thomas Pierrot, and V olodymyr Kuleshov. Advancing dna language models: The genomics long-range benchmark. InICLR 2024 Workshop on Machine Learning for Genomics Explorations (MLGenX). ICLR, 2024
work page 2024
-
[39]
Developmental enhancers and chromo- some topology.Science, 361(6409):1341–1345, 2018
Eileen EM Furlong and Michael Levine. Developmental enhancers and chromo- some topology.Science, 361(6409):1341–1345, 2018. 24
work page 2018
-
[40]
Gao Wang, Abhishek Sarkar, Peter Carbonetto, and Matthew Stephens. A simple new approach to variable selection in regression, with application to genetic fine mapping.Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(5):1273–1300, 2020
work page 2020
-
[41]
Bend: Benchmarking dna language models on biologically meaningful tasks
Frederikke Isa Marin, Felix Teufel, Marc Horlacher, Dennis Madsen, Dennis Pultz, Ole Winther, and Wouter Boomsma. Bend: Benchmarking dna language models on biologically meaningful tasks. InInternational Conference on Learning Representations. ICLR, 2024
work page 2024
-
[42]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 25
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.