Recognition: unknown
AdaSplash-2: Faster Differentiable Sparse Attention
Pith reviewed 2026-05-10 11:15 UTC · model grok-4.3
The pith
AdaSplash-2 computes the normalizer for alpha-entmax attention in 1-2 iterations using on-the-fly histograms, matching FlashAttention-2 training speed at high block sparsity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaSplash-2 addresses the overhead of computing the normalizer tau in alpha-entmax attention by constructing a coarse histogram of the attention scores on the fly and storing it in SRAM. The histogram supplies an accurate starting point that reduces the root-finding procedure to one or two iterations in practice. When this technique is paired with a sparsity-aware GPU implementation that skips zero blocks, both the forward and backward passes become competitive with or faster than FlashAttention-2 under moderate-to-high block sparsity.
What carries the argument
Histogram-based initialization of the normalizer tau, stored in on-chip SRAM and paired with a block-skipping GPU kernel.
If this is right
- Per-step training time matches or exceeds FlashAttention-2 once block sparsity exceeds 60 percent.
- Models match softmax baselines on short-context tasks and improve on long-context downstream tasks.
- Input-dependent sparsity becomes practical for training without incurring quadratic cost penalties.
- Longer sequence lengths become more feasible because higher natural sparsity amplifies the speedup.
Where Pith is reading between the lines
- The histogram trick may transfer to other iterative normalizers used in sparse or entropic attention variants.
- Further gains could appear on hardware with larger on-chip memory or when sparsity patterns stabilize over training.
- The method could be combined with block-sparse kernels from other frameworks to widen the sparsity range where it wins.
Load-bearing premise
The histogram of attention scores stays accurate enough throughout training that the normalizer always converges in one or two iterations.
What would settle it
A timing experiment on long-context training runs in which the iteration count for tau routinely exceeds three and overall step time becomes slower than FlashAttention-2.
Figures
read the original abstract
Sparse attention has been proposed as a way to alleviate the quadratic cost of transformers, a central bottleneck in long-context training. A promising line of work is $\alpha$-entmax attention, a differentiable sparse alternative to softmax that enables input-dependent sparsity yet has lagged behind softmax due to the computational overhead necessary to compute the normalizer $\tau$. In this paper, we introduce AdaSplash-2, which addresses this limitation through a novel histogram-based initialization that reduces the number of iterations needed to compute $\tau$ to typically 1--2. The key idea is to compute a coarse histogram of attention scores on the fly and store it in on-chip SRAM, yielding a more accurate initialization that enables fast forward and backward computation. Combined with a sparsity-aware GPU implementation that skips zero blocks with low overhead, AdaSplash-2 matches or improves per-step training time relative to FlashAttention-2 when block sparsity is moderate-to-high (e.g., $>$60\%), which often occurs at long-context lengths. On downstream tasks, models trained with our efficient $\alpha$-entmax attention match softmax baselines at short-context lengths and achieve substantial gains in long-context settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AdaSplash-2, which accelerates differentiable α-entmax sparse attention by using a novel on-the-fly histogram-based initialization of the normalizer τ. This reduces the iterative solver to typically 1-2 iterations in forward and backward passes by storing a coarse histogram of attention scores in SRAM. Combined with a sparsity-aware GPU kernel that skips zero blocks at low overhead, AdaSplash-2 is claimed to match or beat FlashAttention-2 wall-clock time per training step when block sparsity exceeds 60% (common at long contexts), while downstream models match softmax baselines at short contexts and show gains at long contexts.
Significance. If the speed and convergence claims hold under the reported conditions, this provides a practical path to input-dependent sparse attention that closes the efficiency gap with softmax, particularly for long-sequence training. The SRAM-resident histogram technique is a concrete engineering advance that could apply to other iterative normalizers; the work also supplies reproducible GPU kernels and downstream task results that strengthen its utility.
major comments (2)
- [§3.2] §3.2 (Histogram Initialization): The central claim that the coarse histogram yields 1-2 iterations 'on the fly' for both passes lacks any error bound, convergence-rate analysis, or scaling of approximation error with attention-score variance or sparsity level. This assumption is load-bearing for the headline result that per-step time matches or beats FlashAttention-2 above 60% block sparsity; without it the iteration count could rise and erase the reported parity.
- [§4.3] §4.3 (Ablation and Robustness): No experiments test iteration counts or wall-clock time when attention-score distributions exhibit high variance or evolve rapidly during training, nor is there sensitivity analysis on histogram bin count. These omissions leave the 'typically 1-2 iterations across encountered distributions' assertion unverified and directly affect the reliability of the long-context speedup claim.
minor comments (3)
- [Abstract] Abstract and §1: The phrase 'substantial gains in long-context settings' is not accompanied by concrete task names or metrics; cross-reference to the relevant tables/figures would improve clarity.
- [Figure 3] Figure 3 caption: The sparsity levels and sequence lengths used for the timing curves should be stated explicitly rather than left to the main text.
- [§2.2] Notation in §2.2: The definition of the coarse histogram bin width is introduced without an explicit symbol; adding one would aid readability when the initialization is referenced later.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the two major comments point by point below. Both concerns are valid and we will revise the manuscript accordingly to strengthen the theoretical grounding and empirical validation of the histogram initialization.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Histogram Initialization): The central claim that the coarse histogram yields 1-2 iterations 'on the fly' for both passes lacks any error bound, convergence-rate analysis, or scaling of approximation error with attention-score variance or sparsity level. This assumption is load-bearing for the headline result that per-step time matches or beats FlashAttention-2 above 60% block sparsity; without it the iteration count could rise and erase the reported parity.
Authors: We agree that a formal analysis is needed to support the iteration-count claim. The current manuscript relies on empirical measurements across models and lengths, but does not derive error bounds. In the revised version we will add a new paragraph in §3.2 that (i) shows the histogram approximation error is bounded by O(1/B) for B bins under a Lipschitz assumption on the score distribution, (ii) provides a simple convergence-rate argument for the Newton solver initialized by the histogram, and (iii) discusses how the bound scales with score variance and block sparsity. These additions will directly address the load-bearing nature of the claim. revision: yes
-
Referee: [§4.3] §4.3 (Ablation and Robustness): No experiments test iteration counts or wall-clock time when attention-score distributions exhibit high variance or evolve rapidly during training, nor is there sensitivity analysis on histogram bin count. These omissions leave the 'typically 1-2 iterations across encountered distributions' assertion unverified and directly affect the reliability of the long-context speedup claim.
Authors: We acknowledge the gap in robustness testing. The existing ablations cover standard training regimes but do not explicitly stress high-variance or rapidly changing distributions. In the revised §4.3 we will add three new experiments: (1) controlled synthetic attention-score distributions with increasing variance, reporting iteration counts and wall-clock time; (2) iteration-count traces recorded every 100 steps during long-context training to verify stability as distributions evolve; and (3) a sensitivity sweep over bin counts (8–64) with corresponding iteration and runtime statistics. These results will be presented alongside the existing ablations. revision: yes
Circularity Check
No significant circularity: algorithmic contribution with empirical timing claims
full rationale
The paper introduces a histogram-based initialization for the α-entmax normalizer τ and a sparsity-aware GPU kernel. The central performance claim (matching FlashAttention-2 wall-clock time at >60% block sparsity) is presented as an empirical outcome of the new initialization reducing iterations to 1-2, not as a quantity derived by construction from fitted parameters or prior self-citations. No equations in the provided abstract or description reduce the reported speedups to inputs by definition, and the initialization method is described as an independent algorithmic technique rather than a renaming or self-referential fit. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
emnlp-main.967/
URL https://aclanthology.org/2024. emnlp-main.967/. Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth In- ternational Conference on Learning Representations,
2024
-
[2]
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
URL https://openreview.net/forum? id=mZn2Xyh9Ec. Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and Re, C. Flashattention: Fast and memory-efficient exact atten- tion with IO-awareness. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.),Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/forum?id=H4DqfPSibmx. Dong, J....
-
[3]
doi: 10.18653/v1/2025.acl-long.366. URL https: //aclanthology.org/2025.acl-long.366/. Gelberg, Y ., Eguchi, K., Akiba, T., and Cetin, E. Extending the Context of Pretrained LLMs by Dropping their Posi- tional Embeddings. Technical report, Sakana AI, January
-
[4]
Gonc ¸alves, N., Treviso, M
Technical Report. Gonc ¸alves, N., Treviso, M. V ., and Martins, A. Adas- plash: Adaptive sparse flash attention. InForty- second International Conference on Machine Learning,
-
[5]
URL https://openreview.net/forum? id=OWIPDWhUcO. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Ka- dian, A., Al-Dahle, A., Letman, A., Mathur, A., Schel- ten, A., Vaughan, A., et al. The llama 3 herd of mod- els.arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783. Gu, Y ., Tafjord, O., Kuehl, B., Haddad, D., Dodge, J., and Ha...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.findings-naacl 2024
-
[6]
findings-naacl.282/
URL https://aclanthology.org/2025. findings-naacl.282/. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre, L. ...
2025
-
[7]
URL https://proceedings.mlr.press/ v235/jelassi24a.html. Jia, Z., Maggioni, M., Staiger, B., and Scarpazza, D. P. Dis- secting the nvidia volta gpu architecture via microbench- marking.arXiv preprint arXiv:1804.06826, 2018. Kantorovich, L. V . On Newton’s Method.Trudy Mat. Inst. Steklov, 28:104–144, 1949. URL https://cs.uwaterloo.ca/˜y328yu/ classics/Kant...
-
[8]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
URL https://openreview.net/forum? id=Drrl2gcjzl. Liang, W., Liu, T., Wright, L., Constable, W., Gu, A., Huang, C.-C., Zhang, I., Feng, W., Huang, H., Wang, J., Purandare, S., Nadathur, G., and Idreos, S. Torchtitan: One-stop pytorch native solution for production ready LLM pretraining. InThe Thirteenth International Confer- ence on Learning Representation...
work page internal anchor Pith review arXiv 2025
-
[9]
The LAMBADA dataset: Word prediction requiring a broad discourse context
PMLR. URL http://proceedings.mlr. press/v48/martins16.html. Michelot, C. A finite algorithm for finding the projection of a point onto the canonical simplex of n.Journal of Optimization Theory and Applications, 50(1):195–200, 1986. Milakov, M. and Gimelshein, N. Online normalizer calcula- tion for softmax.arXiv preprint arXiv:1805.02867, 2018. URLhttps://...
-
[10]
Association for Computational Linguistics. doi: 10. 18653/v1/P19-1146. URL https://www.aclweb. org/anthology/P19-1146. Press, O., Smith, N., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapo- lation. InInternational Conference on Learning Represen- tations, 2022. URL https://openreview.net/ forum?id=R8sQPpGCv0...
-
[11]
ISSN 0925-2312. doi: 10.1016/j.neucom. 2023.127063. URL https://doi.org/10.1016/ j.neucom.2023.127063. 11 ADASPLASH-2: Faster Differentiable Sparse Attention Tay, Y ., Dehghani, M., Bahri, D., and Metzler, D. Efficient transformers: A survey.ACM Comput. Surv., 55(6), De- cember 2022. ISSN 0360-0300. doi: 10.1145/3530811. URLhttps://doi.org/10.1145/3530811...
-
[12]
cc/paper_files/paper/2020/file/ c8512d142a2d849725f31a9a7a361ab9-Paper
URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ c8512d142a2d849725f31a9a7a361ab9-Paper. pdf. Zhai, S., Likhomanenko, T., Littwin, E., Busbridge, D., Ramapuram, J., Zhang, Y ., Gu, J., and Susskind, J. M. Stabilizing transformer training by preventing attention entropy collapse. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato...
2020
-
[13]
Extending llms’ context window with 100 samples.arXiv preprint arXiv:2401.07004, 2024
URL https://proceedings.mlr.press/ v202/zhai23a.html. Zhang, Y ., Li, J., and Liu, P. Extending llms’ context win- dow with 100 samples.arXiv preprint arXiv:2401.07004,
-
[14]
selected set
URL https://arxiv.org/abs/2401. 07004. 12 ADASPLASH-2: Faster Differentiable Sparse Attention 3 2 1 0 1 2 3 z2 3 2 1 0123 z1 0.0 0.2 0.4 0.6 0.8 softmax ( 1) 3 2 1 0 1 2 3 z2 3 2 1 0123 z1 0.0 0.2 0.4 0.6 0.8 top-k softmax (k = 2) 3 2 1 0 1 2 3 z2 3 2 1 0123 z1 0.0 0.2 0.4 0.6 0.8 1.0 entmax ( = 1.5) 3 2 1 0 1 2 3 z2 3 2 1 0123 z1 0.0 0.2 0.4 0.6 0.8 1.0 ...
2019
-
[15]
Entmax (converted)
from the evaluation mix following Olmo3 (Olmo et al., 2025). For completeness, we provide the results of all of our models on short-context benchmarks in Table 5. The results on RULER for the long-context adapted models are presented in Table 1, while the HELMET-ICL results can be seen in Table 2. Discussion.Overall, we see that extended models show a sma...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.