Recognition: unknown
Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models
Pith reviewed 2026-05-07 09:08 UTC · model grok-4.3
The pith
Sigmoid attention replaces softmax in single-cell foundation models to deliver better cell-type separation, faster training, and greater stability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sigmoid attention, used as a direct replacement for softmax inside the attention layers, produces superior representations on single-cell data while accelerating training and removing sources of instability. On six diverse datasets it improves cell-type separation by 25 percent, raises cohesion metrics, and lowers validation loss. Models train up to 10 percent faster. In stress tests, 160-million-parameter bidirectional models trained on 8,000-token sequences without gradient clipping cause softmax attention to diverge with gradients exploding by four orders of magnitude, whereas sigmoid attention stays stable. These outcomes follow from sigmoid attention having globally bounded derivatives,
What carries the argument
Sigmoid attention, which computes attention weights with the sigmoid function instead of softmax, yielding globally bounded derivatives at most 0.25 and a diagonal Jacobian structure that removes dense coupling between tokens.
If this is right
- Cell-type separation and cohesion improve on multiple single-cell datasets, leading to more accurate downstream biological analyses.
- Training time decreases by up to 10 percent for the same model size and data volume.
- Models remain stable during training on long sequences without needing gradient clipping or other stabilizers.
- An open-sourced Triton kernel delivers 515 TFLOPS on H100 hardware and handles native padding for biological sequences.
Where Pith is reading between the lines
- The same bounded-gradient property could reduce the need for learning-rate schedules or clipping in other sequence-modeling tasks that use long or variable-length inputs.
- Because the kernel supports irregular padding, similar efficiency gains may appear when applying the approach to other domains with sparse or gapped sequences.
- If the diagonal Jacobian reduces coupling, larger models might be trained with fewer synchronization steps across GPUs.
Load-bearing premise
The observed gains in representation quality, speed, and stability will continue to hold for other single-cell foundation-model architectures, larger parameter counts, and different biological sequence types without further changes.
What would settle it
Train a 500-million-parameter single-cell model on an unseen dataset with 16,000-token sequences and no gradient clipping; if the sigmoid version still converges while the softmax version diverges with exploding gradients, the stability claim is supported.
Figures
read the original abstract
Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single-cell datasets, sigmoid achieves 25% higher cell-type separation, better cell-type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives ($\leq 0.25$) as opposed to softmax, and a diagonal Jacobian structure in contrast with softmax's dense coupling, which together help alleviate training instabilities. In stress tests on 160M-parameter bidirectional attention models trained without gradient clipping on 8K-token sequences, softmax diverges catastrophically, with gradients exploding by four orders of magnitude, while sigmoid remains stable. Finally, we implement and open-source TritonSigmoid, an efficient GPU kernel that achieves 515 TFLOPS on H100 GPUs, outperforming both FlashAttention-2 and FlashSigmoid, with native padding support, which is essential for biological sequences. Our results establish sigmoid attention as both theoretically grounded and empirically superior for biological foundation models. Code is available at https://github.com/MSDLLCpapers/triton-sigmoid
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes sigmoid attention as a drop-in replacement for softmax attention in single-cell foundation models. It claims improved representations (25% higher cell-type separation, better cohesion metrics, and lower validation loss on six datasets), up to 10% faster training, and greater stability from globally bounded derivatives (≤0.25) and a diagonal Jacobian (versus softmax's dense row-wise coupling). These are backed by a stress test on 160M-parameter models trained without clipping on 8K sequences (where softmax gradients explode by four orders of magnitude but sigmoid remains stable) and an open-sourced TritonSigmoid kernel achieving 515 TFLOPS on H100 GPUs with native padding support.
Significance. If the empirical gains and stability properties hold under broader scrutiny, the work offers a practically useful alternative attention mechanism for biological foundation models, where long sequences and training stability are critical. The combination of theoretical analysis (bounded derivatives and Jacobian structure), controlled stress testing, consistent gains across multiple datasets, and a reproducible high-performance kernel implementation strengthens the contribution and enables direct adoption or extension.
minor comments (4)
- The abstract states that sigmoid attention yields '25% higher cell-type separation' and 'better cell-type cohesion metrics' across six datasets, but does not specify the exact metrics (e.g., silhouette score, ARI), baseline values, or statistical tests used to establish the improvement; adding these details in the results section would strengthen the empirical claims.
- The stress-test description reports gradient explosion 'by four orders of magnitude' for softmax but does not provide the precise gradient-norm values, sequence-length distribution, or number of training steps at which divergence occurred; including these in a dedicated subsection or table would improve reproducibility.
- The TritonSigmoid kernel is reported to reach 515 TFLOPS and outperform FlashAttention-2 and FlashSigmoid, yet no table compares throughput, memory usage, or padding overhead across different sequence lengths and batch sizes; such a comparison would clarify the practical advantage for biological data.
- The manuscript mentions 'six diverse single-cell datasets' but provides no table listing dataset names, sizes, or preprocessing steps; adding this information would help readers assess the generality of the reported gains.
Simulated Author's Rebuttal
We thank the referee for their positive summary, significance assessment, and recommendation for minor revision. The report accurately captures the core contributions of sigmoid attention as a theoretically grounded and empirically effective replacement for softmax in single-cell foundation models. No specific major comments were raised in the report.
Circularity Check
No significant circularity detected
full rationale
The paper's claims rest on explicit mathematical properties of the sigmoid and softmax functions (globally bounded derivatives ≤0.25 and diagonal vs. dense Jacobian) plus independent empirical measurements across six datasets and controlled stress tests on 160M-parameter models. No step reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation chain; the theoretical distinctions are standard calculus results and the performance numbers are externally measured outcomes.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math The derivative of the sigmoid function is globally bounded by 0.25 and its Jacobian is diagonal
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review arXiv
-
[2]
Longformer: The Long-Document Transformer
URL https://arxiv.org/abs/2004.05150. Haiyang Bian, Yixin Chen, Xiaomin Dong, Chen Li, Minsheng Hao, Sijie Chen, Jinyi Hu, Maosong Sun, Lei Wei, and Xuegong Zhang. scmulan: a multitask generative pre-trained language model for single- cell analysis. InInternational Conference on Research in Computational Molecular Biology, pp. 479–482. Springer,
work page internal anchor Pith review arXiv 2004
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page internal anchor Pith review arXiv 2005
-
[4]
Alexis Chevalier, Soumya Ghosh, Urvi Awasthi, James Watkins, Julia Bieniewska, Nichita Mitrea, Olga Kotova, Kirill Shkura, Andrew Noble, Michael Steinbaugh, et al. Teddy: A family of foundation models for understanding single cell biology.arXiv preprint arXiv:2503.03485,
-
[6]
Generating Long Sequences with Sparse Transformers
URLhttp://arxiv.org/abs/1904.10509. Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai.Nature Methods, pp. 1–11,
work page internal anchor Pith review arXiv 1904
-
[7]
doi: 10.1101/ 2023.10.30.563174. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning,
2023
-
[8]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
URL https://arxiv.org/abs/2307.08691. TriDao, DanFu, StefanoErmon, AtriRudra, andChristopherRé. Flashattention: Fastandmemory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359,
work page internal anchor Pith review arXiv
-
[9]
URLhttps://proceedings.mlr.press/v139/dasoulas21a. html. JacobDevlin, Ming-WeiChang, KentonLee, andKristinaToutanova. Bert: Pre-trainingofdeepbidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long...
2019
-
[10]
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2(3):4,
-
[11]
Ancestral and environmental diversity shape the immune landscape in indonesia.bioRxiv, pp
Muhamad Fachrul, Pongsakorn Sukonthamarn, Pradiptajati Kusuma, Monika Meili Novita, Isabela Alvim, Isabella Apriyana, Chelzie Crenna-Darusallam, Andreas Christian, Alice Groudko, Robert Kendle, et al. Ancestral and environmental diversity shape the immune landscape in indonesia.bioRxiv, pp. 2026–02,
2026
-
[12]
A multimodal single-cell atlas of the adolescent brain reveals gene regulatory networks linking development to disease risk.bioRxiv, pp
Isabella C Galvao, Manuela Lemoine, Tomas V Waichman, Bhavyaa Chandarana, Steven Hebert, Thais C de Oliveira, Luciano Henrique Braz dos Santos, Fabio Rogerio, Clarissa L Yasuda, Enrico Ghizoni, et al. A multimodal single-cell atlas of the adolescent brain reveals gene regulatory networks linking development to disease risk.bioRxiv, pp. 2026–02,
2026
-
[13]
Tahoe-x1: Scaling perturbation-trained single-cell foundation models to 3 billion parameters.bioRxiv, pp
Shreshth Gandhi, Farnoosh Javadi, Valentine Svensson, Umair Khan, Matthew G Jones, John Yu, Daniele Merico, Hani Goodarzi, and Nima Alidoust. Tahoe-x1: Scaling perturbation-trained single-cell foundation models to 3 billion parameters.bioRxiv, pp. 2025–10,
2025
-
[14]
Variance sensitivity induces attention entropy collapse and instability in transformers
Jonghyun Hong and Sungyoon Lee. Variance sensitivity induces attention entropy collapse and instability in transformers. InChristosChristodoulopoulos, TanmoyChakraborty, CarolynRose, andVioletPeng(eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 8360–8378, Suzhou, China, November
2025
-
[15]
Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.421. URLhttps://aclanthology.org/2025.emnlp-main.421/. Tatiana A Karakasheva, Clara Morral Martinez, Yusen Zhou, Jingya Qui, Xinyi E Chen, Gloria E Soto, Shaneice K Nettleford, Olivia T Hix, Daana M Roach, Alyssa M Laguerta, et al. An epigenetic basis for s...
-
[16]
Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han
URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ aac02401755a65904cf977a33136af4a-Paper-Conference.pdf. Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the difficulty of training transformers. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings of the 2020 Conference on Empirical Method...
2022
-
[17]
doi: 10.18653/v1/2020.emnlp-main.463
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.463. URL https://aclanthology.org/2020.emnlp-main.463/. Malte D Luecken, Maren Büttner, Kridsadakorn Chaichoompu, Anna Danese, Marta Interlandi, Michaela F Müller, Daniel C Strobl, Luke Zappia, Martin Dugas, Maria Colomé-Tatché, et al. Benchmarking atlas- level data integration in...
-
[18]
Llama Team AI @ Meta. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review arXiv
-
[19]
A cross-species generative cell atlas across 1.5 billion years of evolution: The transcriptformer single-cell model.bioRxiv, pp
James D Pearce, Sara E Simmonds, Gita Mahmoudabadi, Lakshmi Krishnan, Giovanni Palla, Ana-Maria Istrate, Alexander Tarashansky, Benjamin Nelson, Omar Valenzuela, Donghui Li, et al. A cross-species generative cell atlas across 1.5 billion years of evolution: The transcriptformer single-cell model.bioRxiv, pp. 2025–04,
2025
-
[20]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Accessed: 2024-01-22. Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention- sink-free.arXiv preprint arXiv:2505.06708,
work page internal anchor Pith review arXiv 2024
-
[21]
URLhttps://openreview.net/forum?id=Zhdhg6n2OG. Anna Christina Schaar, Alejandro Tejada-Lapuerta, Giovanni Palla, Robert Gutgesell, Lennard Halle, Mariia Minaeva, Larsen Vornholz, Leander Dony, Francesca Drummer, Mojtaba Bahrami, et al. Nicheformer: a foundation model for single-cell and spatial omics.bioRxiv, pp. 2024–04,
2024
-
[22]
URLhttps://arxiv.org/abs/2303.06296. A Appendix A.1 Theoretical Foundations In this section, we summarize previous results establishing why sigmoid attention leads to more stable training than softmax attention. We characterize the Lipschitz constants, Jacobian structure, and spectral norm bounds for both mechanisms. A.1.1 Lipschitz Structure and Gradient...
-
[23]
Also, subsequent works Kim et al
In order to understand the impact of this minimum entropy, we should consider that this is multiplied by the magnitude of the valuesV, and the input gradients. Also, subsequent works Kim et al. (2021); Castin et al. (2024) showed that the local Lipschitz constant of the softmax operator scales exponentially with the maximum score magnitude. More formally,...
2021
-
[24]
The local Lipschitz constant of the softmax operator satisfies: Liploc(softmax)(S)≤Cexp(∥S∥∞),(4) for a universalC >0
LetS= QK⊤ √ d ∈Rn×ndenote the matrix of attention scores, namely the score function of inputX. The local Lipschitz constant of the softmax operator satisfies: Liploc(softmax)(S)≤Cexp(∥S∥∞),(4) for a universalC >0. These observations suggest that the instability of softmax attention is fundamentally tied to its normalization structure. 16 A.1.2 Jacobian St...
2024
-
[25]
A.2.1 Model Architecture All models use a Transformer encoder architecture similar to Chevalier et al
(131.6M cells, June 11, 2024 snapshot). A.2.1 Model Architecture All models use a Transformer encoder architecture similar to Chevalier et al. (2025) with the following shared specifications in Table
2024
-
[26]
However we take away gradient clipping for the stress test experiment
The stress-test experiments designed to isolate attention mechanism stability use identical hyperparameters for both softmax and sigmoid variants, with the attention mechanism as the only variable. However we take away gradient clipping for the stress test experiment. 17 The performance comparison experiment trains four models (2 context lengths×2 attenti...
2023
-
[27]
athttps://cellxgene.cziscience.com/. Original publications: Adolescent Brain (Galvao et al., 2026), Indonesian PBMC (Fachrul et al., 2026), Healthy Colon (Karakasheva et al., 2026), Aging PBMC (Gong et al., 2025), Lung ACR (Potter et al., 2026), Heart OFT (Leshem et al., 2025). CellxGene IDs: AdolescentBrain(09971952-46c4-410c-8fb1-0afa7b0b225f), Indonesi...
2026
-
[28]
Table 5 provides complete validation loss results for all model configurations and evaluation datasets
Using multiple masking patterns accounts for variance in which specific tokens are masked, providing robust estimates of model quality on the masked language modeling objective. Table 5 provides complete validation loss results for all model configurations and evaluation datasets. The best (lowest) loss for each dataset is shown in bold. Results demonstra...
2022
-
[29]
in PCA-reduced embedding space (95% variance retained) following the standard scIB benchmarking protocol. Complete algorithmic details and metric definitions are available in the scIB metrics library (https://github.com/YosefLab/scib-metrics) and the original scIB paper (Luecken et al., 2022). A.3.6 Maximum Mean Discrepancy (MMD) Maximum Mean Discrepancy ...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.