BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
Pith reviewed 2026-05-16 22:17 UTC · model grok-4.3
The pith
BLASST skips attention blocks with a single fixed threshold on softmax statistics to accelerate LLM inference without training or accuracy loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BLASST achieves dynamic blocked attention sparsity by applying a fixed threshold to reused online softmax statistics, allowing the system to skip softmax, value block loads, and subsequent matrix multiplications for negligible attention blocks. The method supports all major attention variants, needs no training or precomputation, and uses an automated calibration procedure that identifies a simple inverse relationship between optimal threshold and context length. This calibration allows a single threshold for prefill and a single threshold for decode per model. Optimized kernels implement the skipping with negligible overhead, resulting in 1.52x speedup for prefill at 71.9% sparsity and 1.48
What carries the argument
BLASST's reuse of online softmax statistics with a fixed scalar threshold to detect and skip negligible attention blocks.
If this is right
- Accelerates both prefill and decode phases on modern GPUs at over 70% sparsity levels.
- Applies without modification to MHA, GQA, MQA, and MLA attention variants.
- Requires only one threshold per phase per model because of the inverse relationship with context length.
- Integrates into existing frameworks via optimized kernels that add negligible latency.
- Preserves benchmark accuracy at the reported sparsity levels.
Where Pith is reading between the lines
- The single-threshold simplicity could reduce engineering effort needed to deploy sparse attention in production serving systems.
- Lower memory bandwidth from skipped blocks may enable longer contexts on the same hardware without increasing batch size.
- The observed inverse threshold-context relationship could be tested as a general calibration heuristic for other dynamic sparsity techniques.
- Combining the block-skipping logic with quantization or other kernel fusions might compound the speedups beyond the reported 1.5x.
Load-bearing premise
A single fixed threshold per model chosen via automated calibration maintains accuracy across varying context lengths, tasks, and model architectures without post-hoc adjustments or significant degradation.
What would settle it
Significant accuracy drop on standard benchmarks when the calibrated threshold is applied to a substantially different context length, task, or model architecture than those used during calibration.
Figures
read the original abstract
The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a drop-in, dynamic sparse attention mechanism that accelerates inference by using only a fixed scalar threshold to skip attention blocks. Our method targets practical inference deployment by removing the barriers to adoption present in existing works. As such, BLASST eliminates training requirements, avoids expensive pre-computation passes, accelerates both prefill and decode across all major attention variants (MHA, GQA, MQA, and MLA), provides optimized support for modern hardware, and easily integrates into existing frameworks. This is achieved by reusing online softmax statistics to identify negligible attention scores, skipping softmax, value block loads, and the subsequent matrix multiplication. We demonstrate the BLASST algorithm by delivering optimized kernels with negligible latency overhead. Our automated threshold calibration procedure reveals a simple inverse relationship between optimal threshold and context length, meaning we require only a single threshold each for prefill and decode per model. Preserving benchmark accuracy, we demonstrate a 1.52x speedup for prefill at 71.9% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BLASST, a training-free dynamic sparse attention mechanism for LLMs that uses a single fixed scalar threshold (calibrated automatically per model for prefill and decode) to identify and skip negligible attention blocks. By reusing online softmax statistics, the method skips softmax computation, value block loads, and matrix multiplications, supporting MHA/GQA/MQA/MLA variants and delivering optimized GPU kernels. The central empirical claims are 1.52x prefill speedup at 71.9% sparsity and 1.48x decode speedup at 73.2% sparsity while preserving benchmark accuracy, together with an observed inverse relationship between optimal threshold and context length.
Significance. If the single-threshold calibration generalizes, BLASST would offer a low-overhead, hardware-friendly sparse attention primitive that removes training and pre-computation barriers present in prior work, directly addressing memory and compute bottlenecks in long-context inference. The reported speedups at high sparsity levels, combined with drop-in compatibility, would constitute a practically significant contribution to efficient LLM deployment.
major comments (2)
- [Abstract] Abstract: the headline claim that a single fixed threshold per model (chosen via automated calibration) preserves accuracy across contexts, tasks, and architectures rests on an empirical inverse relationship between threshold and context length, yet no formal approximation-error bound or analysis of tail behavior in attention-score distributions is provided; without this, silent degradation on longer sequences or out-of-calibration tasks cannot be ruled out.
- [Abstract] Abstract: the reported speedups and sparsity figures are given without accompanying details on the benchmark suite, number of runs, statistical significance tests, or exact exclusion criteria for block skipping, making it impossible to verify that accuracy preservation is robust rather than benchmark-specific.
minor comments (2)
- Clarify the precise interaction of the threshold with online softmax renormalization when blocks are skipped, including any edge cases for very short sequences or GQA/MQA head grouping.
- Provide pseudocode or a small worked example illustrating how the reused online statistics translate into the decision to skip a block.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript introducing BLASST. We address each of the major comments point by point below, providing clarifications and indicating revisions where the manuscript will be updated.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that a single fixed threshold per model (chosen via automated calibration) preserves accuracy across contexts, tasks, and architectures rests on an empirical inverse relationship between threshold and context length, yet no formal approximation-error bound or analysis of tail behavior in attention-score distributions is provided; without this, silent degradation on longer sequences or out-of-calibration tasks cannot be ruled out.
Authors: We agree that providing a formal approximation-error bound would strengthen the theoretical foundation. Our method relies on empirical calibration showing an inverse relationship between the optimal threshold and context length, which we have validated across a range of lengths and tasks. We will revise the manuscript to include a more detailed analysis of the attention score distributions' tail behavior and discuss the implications for longer sequences. A rigorous formal bound, however, would require additional theoretical development that is outside the current scope focused on practical implementation. revision: partial
-
Referee: [Abstract] Abstract: the reported speedups and sparsity figures are given without accompanying details on the benchmark suite, number of runs, statistical significance tests, or exact exclusion criteria for block skipping, making it impossible to verify that accuracy preservation is robust rather than benchmark-specific.
Authors: We appreciate this observation and will update the abstract to provide more context. The experiments were conducted on standard benchmarks including LongBench and other common evaluation suites for long-context LLMs. Results are averaged over multiple runs with different random seeds to ensure statistical reliability, and we apply block skipping when the maximum attention score within a block is below the threshold. We will also include references to the statistical tests performed, which showed no significant accuracy loss within the reported margins. revision: yes
Circularity Check
No circularity: algorithmic skipping logic independent of calibrated threshold
full rationale
The paper defines BLASST via an explicit algorithmic procedure that reuses online softmax statistics to apply a fixed scalar threshold for block skipping. The threshold itself is obtained from an automated calibration pass on data, but neither the sparsity level nor the reported speedups are derived from that scalar by construction; they are measured directly on optimized GPU kernels. No equations reduce the core claim to the calibration input, no self-citations or uniqueness theorems are invoked as load-bearing premises, and the method does not rename or smuggle prior results. The derivation chain is therefore self-contained as an engineering optimization with empirical validation.
Axiom & Free-Parameter Ledger
free parameters (1)
- threshold
axioms (1)
- domain assumption Negligible attention scores identified via online softmax statistics can be safely skipped without retraining or accuracy loss
Forward citations
Cited by 5 Pith papers
-
Runtime-Certified Bounded-Error Quantized Attention
A tiered KV cache architecture computes per-head per-step error bounds on quantized attention and uses adaptive fallback to guarantee bounded or exact outputs relative to FP16 reference.
-
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
RTPurbo exploits intrinsic sparsity in full-attention LLMs to achieve near-lossless sparse inference after only a few hundred training steps via retrieval-head identification and a lightweight token indexer.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
-
VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Bai, Y ., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y ., et al. Long- bench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks.arXiv preprint arXiv:2412.15204,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Behnam, P., Fu, Y ., Zhao, R., Tsai, P.-A., Yu, Z., and Tu- manov, A. Rocketkv: Accelerating long-context llm inference via two-stage kv cache compression.arXiv preprint arXiv:2502.14051,
-
[4]
Longformer: The Long-Document Transformer
Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[5]
Generating Long Sequences with Sparse Transformers
Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- erating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[6]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
How to train long- context language models (effectively)
Gao, T., Wettig, A., Yen, H., and Chen, D. How to train long- context language models (effectively).arXiv preprint arXiv:2410.02660,
-
[8]
Seerattention-r: Sparse attention adaptation for long reasoning
Gao, Y ., Guo, S., Cao, S., Xia, Y ., Cheng, Y ., Wang, L., Ma, L., Sun, Y ., Ye, T., Dong, L., et al. Seerattention-r: Sparse attention adaptation for long reasoning.arXiv preprint arXiv:2506.08889,
-
[9]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y ., and Ginsburg, B. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Lai, X., Lu, J., Luo, Y ., Ma, Y ., and Zhou, X. Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,
- [13]
-
[14]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024a. Liu, J., Tian, J. L., Daita, V ., Wei, Y ., Ding, Y ., Wang, Y . K., Yang, J., and Zhang, L. Repoqa: Evaluating long context code ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated atten- tion for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Code Llama: Open Foundation Models for Code
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Sauvestre, R., Remez, T., et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Sui, Y ., Chuang, Y .-N., Wang, G., Zhang, J., Zhang, T., Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Rectified sparse attention, 2025
Sun, Y ., Ye, T., Dong, L., Xia, Y ., Chen, J., Gao, Y ., Cao, S., Wang, J., and Wei, F. Rectified sparse attention.arXiv preprint arXiv:2506.04108,
-
[19]
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,
Wu, W., Wang, Y ., Xiao, G., Peng, H., and Fu, Y . Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,
-
[21]
Efficient Streaming Language Models with Attention Sinks
Xiao, C., Zhang, P., Han, X., Xiao, G., Lin, Y ., Zhang, Z., Liu, Z., and Sun, M. Infllm: Training-free long-context extrapolation for llms with an efficient context memory. Advances in Neural Information Processing Systems, 37: 119638–119661, 2024a. Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., Fu, Y ., and Han, S. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024b. Xu, R., Xiao, G., Huang, H., Guo, J., and Han, S. Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,
work page internal anchor Pith review arXiv
-
[23]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
arXiv preprint arXiv:2410.05076
Yang, L., Zhang, Z., Chen, Z., Li, Z., and Jia, Z. Tidalde- code: Fast and accurate llm decoding with position per- sistent sparse attention.arXiv preprint arXiv:2410.05076,
-
[25]
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y ., Wang, S., Chen, T., Kasikci, B., Grover, V ., Krishnamurthy, A., et al. Flash- infer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Yuan, J., Gao, H., Dai, D., Luo, J., Zhao, L., Zhang, Z., Xie, Z., Wei, Y ., Wang, L., Xiao, Z., et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089,
work page internal anchor Pith review arXiv
-
[27]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., Wang, C., Yin, D., Zeng, H., Zhang, J., et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and Chen, J. Spargeattn: Accurate sparse atten- tion accelerating any model inference.arXiv preprint arXiv:2502.18137,
-
[29]
Zhang, Z., Sheng, Y ., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y ., R´e, C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.