pith. machine review for the scientific record. sign in

arxiv: 2604.27124 · v1 · submitted 2026-04-29 · 💻 cs.LG · q-bio.QM

Recognition: unknown

Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:08 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM
keywords sigmoid attentionsingle-cell datafoundation modelsattention mechanismstraining stabilitybiological sequencestransformer models
0
0 comments X

The pith

Sigmoid attention replaces softmax in single-cell foundation models to deliver better cell-type separation, faster training, and greater stability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that swapping the standard softmax attention for a sigmoid version inside transformer models improves how well the models learn from single-cell data. Across six different datasets the change produces 25 percent higher cell-type separation scores, stronger cohesion within cell types, and lower validation loss. Training runs up to 10 percent faster while remaining stable even when models are pushed to 160 million parameters on 8,000-token sequences without gradient clipping. The authors trace the stability gain to two properties: sigmoid attention keeps all derivatives globally bounded by 0.25, and its Jacobian is diagonal rather than densely coupled like softmax. They also release a fast GPU kernel that supports the irregular padding common in biological sequences.

Core claim

Sigmoid attention, used as a direct replacement for softmax inside the attention layers, produces superior representations on single-cell data while accelerating training and removing sources of instability. On six diverse datasets it improves cell-type separation by 25 percent, raises cohesion metrics, and lowers validation loss. Models train up to 10 percent faster. In stress tests, 160-million-parameter bidirectional models trained on 8,000-token sequences without gradient clipping cause softmax attention to diverge with gradients exploding by four orders of magnitude, whereas sigmoid attention stays stable. These outcomes follow from sigmoid attention having globally bounded derivatives,

What carries the argument

Sigmoid attention, which computes attention weights with the sigmoid function instead of softmax, yielding globally bounded derivatives at most 0.25 and a diagonal Jacobian structure that removes dense coupling between tokens.

If this is right

  • Cell-type separation and cohesion improve on multiple single-cell datasets, leading to more accurate downstream biological analyses.
  • Training time decreases by up to 10 percent for the same model size and data volume.
  • Models remain stable during training on long sequences without needing gradient clipping or other stabilizers.
  • An open-sourced Triton kernel delivers 515 TFLOPS on H100 hardware and handles native padding for biological sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bounded-gradient property could reduce the need for learning-rate schedules or clipping in other sequence-modeling tasks that use long or variable-length inputs.
  • Because the kernel supports irregular padding, similar efficiency gains may appear when applying the approach to other domains with sparse or gapped sequences.
  • If the diagonal Jacobian reduces coupling, larger models might be trained with fewer synchronization steps across GPUs.

Load-bearing premise

The observed gains in representation quality, speed, and stability will continue to hold for other single-cell foundation-model architectures, larger parameter counts, and different biological sequence types without further changes.

What would settle it

Train a 500-million-parameter single-cell model on an unseen dataset with 16,000-token sequences and no gradient clipping; if the sigmoid version still converges while the softmax version diverges with exploding gradients, the stability claim is supported.

Figures

Figures reproduced from arXiv: 2604.27124 by Georgios Dasoulas, Judith Mueller, Soumya Ghosh, Vijay Sadashivaiah.

Figure 1
Figure 1. Figure 1: Sequence length distribution in CellxGene pretraining dataset. The histogram (left y-axis, gray bars) shows the distribution of sequence lengths across cells in the dataset, revealing high variability characteristic of biological data. The cumulative distribution function (right y-axis, black line) indicates the fraction of sequences below each length. Vertical dashed lines mark common context length thres… view at source ↗
Figure 2
Figure 2. Figure 2: TFLOPS comparison across attention implementations. Performance across head di￾mensions (64, 128), padding levels (0%, 25%), and forward/backward passes. Rows 1–2: no padding, all implementations compared. Rows 3–4: 25% padding, FlashSigmoid unavailable (no padding support). Tri￾tonSigmoid matches or exceeds FlashSigmoid without padding and outperforms FlashAttention-2 across all configurations. Error bars… view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end training compute cost. Projected GPU hours to process 131.6M samples across four model sizes (160M, 400M, 600M, 1.4B) and three context lengths (2K, 4K, 8K), derived from measured throughput on 16 H100 GPUs (batch size 32; 2 samples/GPU). Sigmoid is consistently faster than softmax, with the gap increasing at longer contexts (e.g., 1.4B model: 2.1% at 2K, 4.0% at 4K, 7.5% at 8K), reflecting the … view at source ↗
Figure 4
Figure 4. Figure 4: Training convergence and validation performance across datasets. Left: Training loss over full CellxGene dataset for four models (2K/4K context × sigmoid/softmax). All models converge smoothly with sigmoid 4K achieving the lowest final loss. Right: Six subplots showing validation loss per dataset. Each subplot displays four measurements: softmax (red) and sigmoid (blue) at 2K (circles) and 4K (squares) con… view at source ↗
Figure 5
Figure 5. Figure 5: SCIB biological conservation metrics at 4K context. Scatter plots comparing sigmoid (y-axis) vs softmax (x-axis) across six datasets. Each panel shows a different metric: Leiden NMI (clustering agreement with ground-truth labels), Leiden ARI (pairwise clustering accuracy), Silhouette label (cell-type cohesion), and aggregate Bio Conservation (average of the three components). Points above the diagonal favo… view at source ↗
Figure 6
Figure 6. Figure 6: Softmax divergence vs sigmoid stability under stress-test conditions. Comparing training dynamics for identical 160M parameter models (8K context, no gradient clipping). Left: Training loss. Softmax initially learns (10→3 over 40K steps) but diverges catastrophically at step 55.6K (loss→10+). Sigmoid maintains stable monotonic decrease throughout 80K steps. Middle: Global gradient norm (log scale). Softmax… view at source ↗
Figure 7
Figure 7. Figure 7: UMAP visualization for Heart OFT dataset at 4K context. Left: Softmax attention. Right: Sigmoid attention. We see that sigmoid model derived embeddings cluster better. For instance, endothelial cells (purple), are clustered closer by sigmoid, while softmax has two separate clusters. A.3.5 SCIB Biological Conservation Metrics We evaluate biological structure preservation using standard metrics from the scIB… view at source ↗
Figure 8
Figure 8. Figure 8: Kernel execution time comparison. GPU kernel latency (milliseconds) across head dimensions (64, 128), padding levels (0%, 25%), and forward/backward passes. Rows 1–2: no padding. Rows 3–4: 25% padding, FlashSigmoid unavailable (no padding support). Standard Attention excluded (60–85% slower than optimized kernels). TritonSigmoid achieves lowest latency in most configurations, particularly with padding. Sha… view at source ↗
read the original abstract

Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single-cell datasets, sigmoid achieves 25% higher cell-type separation, better cell-type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives ($\leq 0.25$) as opposed to softmax, and a diagonal Jacobian structure in contrast with softmax's dense coupling, which together help alleviate training instabilities. In stress tests on 160M-parameter bidirectional attention models trained without gradient clipping on 8K-token sequences, softmax diverges catastrophically, with gradients exploding by four orders of magnitude, while sigmoid remains stable. Finally, we implement and open-source TritonSigmoid, an efficient GPU kernel that achieves 515 TFLOPS on H100 GPUs, outperforming both FlashAttention-2 and FlashSigmoid, with native padding support, which is essential for biological sequences. Our results establish sigmoid attention as both theoretically grounded and empirically superior for biological foundation models. Code is available at https://github.com/MSDLLCpapers/triton-sigmoid

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The manuscript proposes sigmoid attention as a drop-in replacement for softmax attention in single-cell foundation models. It claims improved representations (25% higher cell-type separation, better cohesion metrics, and lower validation loss on six datasets), up to 10% faster training, and greater stability from globally bounded derivatives (≤0.25) and a diagonal Jacobian (versus softmax's dense row-wise coupling). These are backed by a stress test on 160M-parameter models trained without clipping on 8K sequences (where softmax gradients explode by four orders of magnitude but sigmoid remains stable) and an open-sourced TritonSigmoid kernel achieving 515 TFLOPS on H100 GPUs with native padding support.

Significance. If the empirical gains and stability properties hold under broader scrutiny, the work offers a practically useful alternative attention mechanism for biological foundation models, where long sequences and training stability are critical. The combination of theoretical analysis (bounded derivatives and Jacobian structure), controlled stress testing, consistent gains across multiple datasets, and a reproducible high-performance kernel implementation strengthens the contribution and enables direct adoption or extension.

minor comments (4)
  1. The abstract states that sigmoid attention yields '25% higher cell-type separation' and 'better cell-type cohesion metrics' across six datasets, but does not specify the exact metrics (e.g., silhouette score, ARI), baseline values, or statistical tests used to establish the improvement; adding these details in the results section would strengthen the empirical claims.
  2. The stress-test description reports gradient explosion 'by four orders of magnitude' for softmax but does not provide the precise gradient-norm values, sequence-length distribution, or number of training steps at which divergence occurred; including these in a dedicated subsection or table would improve reproducibility.
  3. The TritonSigmoid kernel is reported to reach 515 TFLOPS and outperform FlashAttention-2 and FlashSigmoid, yet no table compares throughput, memory usage, or padding overhead across different sequence lengths and batch sizes; such a comparison would clarify the practical advantage for biological data.
  4. The manuscript mentions 'six diverse single-cell datasets' but provides no table listing dataset names, sizes, or preprocessing steps; adding this information would help readers assess the generality of the reported gains.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, significance assessment, and recommendation for minor revision. The report accurately captures the core contributions of sigmoid attention as a theoretically grounded and empirically effective replacement for softmax in single-cell foundation models. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's claims rest on explicit mathematical properties of the sigmoid and softmax functions (globally bounded derivatives ≤0.25 and diagonal vs. dense Jacobian) plus independent empirical measurements across six datasets and controlled stress tests on 160M-parameter models. No step reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation chain; the theoretical distinctions are standard calculus results and the performance numbers are externally measured outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard mathematical properties of sigmoid and softmax functions for its stability arguments and does not introduce new free parameters, axioms beyond basic calculus, or invented entities; all claims build on existing neural network components.

axioms (1)
  • standard math The derivative of the sigmoid function is globally bounded by 0.25 and its Jacobian is diagonal
    Invoked to explain reduced training instability compared to softmax.

pith-pipeline@v0.9.0 · 5551 in / 1391 out tokens · 70443 ms · 2026-05-07T09:08:50.259244+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 12 canonical work pages · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Longformer: The Long-Document Transformer

    URL https://arxiv.org/abs/2004.05150. Haiyang Bian, Yixin Chen, Xiaomin Dong, Chen Li, Minsheng Hao, Sijie Chen, Jinyi Hu, Maosong Sun, Lei Wei, and Xuegong Zhang. scmulan: a multitask generative pre-trained language model for single- cell analysis. InInternational Conference on Research in Computational Molecular Biology, pp. 479–482. Springer,

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  4. [4]

    Teddy: A family of foundation models for understanding single cell biology.arXiv preprint arXiv:2503.03485,

    Alexis Chevalier, Soumya Ghosh, Urvi Awasthi, James Watkins, Julia Bieniewska, Nichita Mitrea, Olga Kotova, Kirill Shkura, Andrew Noble, Michael Steinbaugh, et al. Teddy: A family of foundation models for understanding single cell biology.arXiv preprint arXiv:2503.03485,

  5. [6]

    Generating Long Sequences with Sparse Transformers

    URLhttp://arxiv.org/abs/1904.10509. Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai.Nature Methods, pp. 1–11,

  6. [7]

    doi: 10.1101/ 2023.10.30.563174. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning,

  7. [8]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    URL https://arxiv.org/abs/2307.08691. TriDao, DanFu, StefanoErmon, AtriRudra, andChristopherRé. Flashattention: Fastandmemory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359,

  8. [9]

    URLhttps://proceedings.mlr.press/v139/dasoulas21a. html. JacobDevlin, Ming-WeiChang, KentonLee, andKristinaToutanova. Bert: Pre-trainingofdeepbidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long...

  9. [10]

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2(3):4,

  10. [11]

    Ancestral and environmental diversity shape the immune landscape in indonesia.bioRxiv, pp

    Muhamad Fachrul, Pongsakorn Sukonthamarn, Pradiptajati Kusuma, Monika Meili Novita, Isabela Alvim, Isabella Apriyana, Chelzie Crenna-Darusallam, Andreas Christian, Alice Groudko, Robert Kendle, et al. Ancestral and environmental diversity shape the immune landscape in indonesia.bioRxiv, pp. 2026–02,

  11. [12]

    A multimodal single-cell atlas of the adolescent brain reveals gene regulatory networks linking development to disease risk.bioRxiv, pp

    Isabella C Galvao, Manuela Lemoine, Tomas V Waichman, Bhavyaa Chandarana, Steven Hebert, Thais C de Oliveira, Luciano Henrique Braz dos Santos, Fabio Rogerio, Clarissa L Yasuda, Enrico Ghizoni, et al. A multimodal single-cell atlas of the adolescent brain reveals gene regulatory networks linking development to disease risk.bioRxiv, pp. 2026–02,

  12. [13]

    Tahoe-x1: Scaling perturbation-trained single-cell foundation models to 3 billion parameters.bioRxiv, pp

    Shreshth Gandhi, Farnoosh Javadi, Valentine Svensson, Umair Khan, Matthew G Jones, John Yu, Daniele Merico, Hani Goodarzi, and Nima Alidoust. Tahoe-x1: Scaling perturbation-trained single-cell foundation models to 3 billion parameters.bioRxiv, pp. 2025–10,

  13. [14]

    Variance sensitivity induces attention entropy collapse and instability in transformers

    Jonghyun Hong and Sungyoon Lee. Variance sensitivity induces attention entropy collapse and instability in transformers. InChristosChristodoulopoulos, TanmoyChakraborty, CarolynRose, andVioletPeng(eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 8360–8378, Suzhou, China, November

  14. [15]

    ISBN 979-8-89176-332-6

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.421. URLhttps://aclanthology.org/2025.emnlp-main.421/. Tatiana A Karakasheva, Clara Morral Martinez, Yusen Zhou, Jingya Qui, Xinyi E Chen, Gloria E Soto, Shaneice K Nettleford, Olivia T Hix, Daana M Roach, Alyssa M Laguerta, et al. An epigenetic basis for s...

  15. [16]

    Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han

    URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ aac02401755a65904cf977a33136af4a-Paper-Conference.pdf. Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the difficulty of training transformers. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings of the 2020 Conference on Empirical Method...

  16. [17]

    doi: 10.18653/v1/2020.emnlp-main.463

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.463. URL https://aclanthology.org/2020.emnlp-main.463/. Malte D Luecken, Maren Büttner, Kridsadakorn Chaichoompu, Anna Danese, Marta Interlandi, Michaela F Müller, Daniel C Strobl, Luke Zappia, Martin Dugas, Maria Colomé-Tatché, et al. Benchmarking atlas- level data integration in...

  17. [18]

    The Llama 3 Herd of Models

    Llama Team AI @ Meta. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  18. [19]

    A cross-species generative cell atlas across 1.5 billion years of evolution: The transcriptformer single-cell model.bioRxiv, pp

    James D Pearce, Sara E Simmonds, Gita Mahmoudabadi, Lakshmi Krishnan, Giovanni Palla, Ana-Maria Istrate, Alexander Tarashansky, Benjamin Nelson, Omar Valenzuela, Donghui Li, et al. A cross-species generative cell atlas across 1.5 billion years of evolution: The transcriptformer single-cell model.bioRxiv, pp. 2025–04,

  19. [20]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Accessed: 2024-01-22. Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention- sink-free.arXiv preprint arXiv:2505.06708,

  20. [21]

    URLhttps://openreview.net/forum?id=Zhdhg6n2OG. Anna Christina Schaar, Alejandro Tejada-Lapuerta, Giovanni Palla, Robert Gutgesell, Lennard Halle, Mariia Minaeva, Larsen Vornholz, Leander Dony, Francesca Drummer, Mojtaba Bahrami, et al. Nicheformer: a foundation model for single-cell and spatial omics.bioRxiv, pp. 2024–04,

  21. [22]

    A Appendix A.1 Theoretical Foundations In this section, we summarize previous results establishing why sigmoid attention leads to more stable training than softmax attention

    URLhttps://arxiv.org/abs/2303.06296. A Appendix A.1 Theoretical Foundations In this section, we summarize previous results establishing why sigmoid attention leads to more stable training than softmax attention. We characterize the Lipschitz constants, Jacobian structure, and spectral norm bounds for both mechanisms. A.1.1 Lipschitz Structure and Gradient...

  22. [23]

    Also, subsequent works Kim et al

    In order to understand the impact of this minimum entropy, we should consider that this is multiplied by the magnitude of the valuesV, and the input gradients. Also, subsequent works Kim et al. (2021); Castin et al. (2024) showed that the local Lipschitz constant of the softmax operator scales exponentially with the maximum score magnitude. More formally,...

  23. [24]

    The local Lipschitz constant of the softmax operator satisfies: Liploc(softmax)(S)≤Cexp(∥S∥∞),(4) for a universalC >0

    LetS= QK⊤ √ d ∈Rn×ndenote the matrix of attention scores, namely the score function of inputX. The local Lipschitz constant of the softmax operator satisfies: Liploc(softmax)(S)≤Cexp(∥S∥∞),(4) for a universalC >0. These observations suggest that the instability of softmax attention is fundamentally tied to its normalization structure. 16 A.1.2 Jacobian St...

  24. [25]

    A.2.1 Model Architecture All models use a Transformer encoder architecture similar to Chevalier et al

    (131.6M cells, June 11, 2024 snapshot). A.2.1 Model Architecture All models use a Transformer encoder architecture similar to Chevalier et al. (2025) with the following shared specifications in Table

  25. [26]

    However we take away gradient clipping for the stress test experiment

    The stress-test experiments designed to isolate attention mechanism stability use identical hyperparameters for both softmax and sigmoid variants, with the attention mechanism as the only variable. However we take away gradient clipping for the stress test experiment. 17 The performance comparison experiment trains four models (2 context lengths×2 attenti...

  26. [27]

    athttps://cellxgene.cziscience.com/. Original publications: Adolescent Brain (Galvao et al., 2026), Indonesian PBMC (Fachrul et al., 2026), Healthy Colon (Karakasheva et al., 2026), Aging PBMC (Gong et al., 2025), Lung ACR (Potter et al., 2026), Heart OFT (Leshem et al., 2025). CellxGene IDs: AdolescentBrain(09971952-46c4-410c-8fb1-0afa7b0b225f), Indonesi...

  27. [28]

    Table 5 provides complete validation loss results for all model configurations and evaluation datasets

    Using multiple masking patterns accounts for variance in which specific tokens are masked, providing robust estimates of model quality on the masked language modeling objective. Table 5 provides complete validation loss results for all model configurations and evaluation datasets. The best (lowest) loss for each dataset is shown in bold. Results demonstra...

  28. [29]

    in PCA-reduced embedding space (95% variance retained) following the standard scIB benchmarking protocol. Complete algorithmic details and metric definitions are available in the scIB metrics library (https://github.com/YosefLab/scib-metrics) and the original scIB paper (Luecken et al., 2022). A.3.6 Maximum Mean Discrepancy (MMD) Maximum Mean Discrepancy ...