Recognition: unknown
VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
Pith reviewed 2026-05-10 16:05 UTC · model grok-4.3
The pith
VFA pre-computes an approximation of the global maximum from key blocks to reduce vector operations in Flash Attention's online softmax.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to avoid repeated reductions and rescaling. This completely avoids the conditional rescale operation in the update stage, and when integrated with block-sparse skipping, forms VSA that reduces both block count and per-block overhead while maintaining exact attention computation.
What carries the argument
The global maximum pre-computation from key-block summaries combined with sink-and-local block reordering and early freezing of the running maximum.
If this is right
- Stabilizes the running maximum early via sink and local reordering, reducing rowmax and rowsum reductions.
- Completely avoids the conditional rescale operation in the attention update stage.
- Achieves nearly two times speedup on C8V32, C4V32 and C4V16 configurations relative to C16V32 baseline.
- Delivers up to six times speedup on C4V16 with future hardware improvements to exponent capacity.
- Integrates with block-sparse skipping to reduce both block count and per-block overhead in VSA.
Where Pith is reading between the lines
- Similar pre-computation of summary statistics could reduce vector overhead in other online normalization procedures used in large-scale training.
- Block reordering heuristics based on sink and local impact may extend to improve efficiency in alternative sparse attention patterns.
- Hardware designs that increase on-chip exponent handling would amplify the benefits of freezing the maximum early.
- Attention statistics showing intra-block heterogeneity suggest that finer-grained block summaries could further improve approximation accuracy.
Load-bearing premise
That the cheap approximation from key-block representations is accurate enough to initialize the running maximum and that reordering sink and local blocks allows the maximum to stabilize early without changing the final attention output.
What would settle it
Observing cases where the true maximum occurs in middle blocks without using m-initialization, leading to incorrect attention scores compared to standard Flash Attention.
Figures
read the original abstract
FlashAttention-style online softmax enables exact attention computation with linear memory by streaming score tiles through on-chip memory and maintaining a running maximum and normalizer. However, as attention kernels approach peak tensor-core/cube-core throughput on modern accelerators, non-matmul components of online softmax -- especially per-tile rowmax and rowsum reductions and rescale chains -- can become vector or SIMD limited and dominate latency. This paper revisits FlashAttention and proposes Vector Relieved Flash Attention (VFA), a hardware-friendly method that reduces rowmax-driven updates of the running maximum while retaining the online-softmax structure. VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to avoid repeated reductions and rescaling. We further integrate VFA with block-sparse skipping methods such as BLASST to form Vector Relieved Sparse Attention (VSA), which reduces both block count and per-block overhead. Notably, VFA and VSA completely avoid the conditional rescale operation in the update stage used in FA4.0. Extensive evaluations on benchmarks including MMLU and MATH500, together with attention statistics, verify our design: (i) sink and local reordering stabilizes the running maximum early; (ii) simple Q and K block summaries fail due to intra-block heterogeneity; (iii) m-initialization is required when maxima appear in middle blocks. Overall, VFA and VSA efficiently alleviate online-softmax reduction bottlenecks without performance loss. Compared to the C16V32 baseline, C8V32, C4V32 and C4V16 achieve nearly two times speedup on modern hardware while hitting the vector bottleneck. With upcoming architecture improvements, C4V16 will deliver six times speedup by enhancing exponent capacity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Vector Relieved Flash Attention (VFA) that initializes the running maximum in online softmax using a cheap approximation derived from key-block representations, reorders key-block traversal to prioritize sink and local blocks, and freezes the maximum for subsequent blocks to reduce rowmax reductions and rescaling. It further combines this with block-sparse skipping (e.g., BLASST) to form VSA. The method claims to completely avoid the conditional rescale of FA4.0 while preserving exact attention, with empirical support from MMLU, MATH500, and attention statistics showing early stabilization of the maximum, failure of simple summaries due to intra-block heterogeneity, necessity of m-initialization for middle-block maxima, and speedups of nearly 2x (with potential 6x on future hardware) without accuracy loss.
Significance. If the empirical verification of exactness holds across diverse workloads, VFA/VSA would be a meaningful optimization for attention kernels on modern accelerators where vector/SIMD operations increasingly limit throughput as tensor-core matmul rates rise. The hardware-aware focus on avoiding rescale chains and the integration with existing sparse methods are practical strengths.
major comments (2)
- [Abstract] Abstract: The claim of retaining exact online-softmax semantics while 'completely avoid[ing] the conditional rescale operation' rests on the key-block approximation plus sink/local reordering always capturing the global maximum or correctly triggering m-initialization for middle blocks. No derivation or bound is provided showing that this combination guarantees the true global max (as opposed to relying on an unproven statistical property of attention scores), and the frequency/overhead of m-initialization is not quantified, which directly affects whether vector operations are relieved in practice.
- [Abstract] Abstract and evaluations: While attention statistics are cited to verify that 'sink and local reordering stabilizes the running maximum early' and that 'm-initialization is required when maxima appear in middle blocks,' the manuscript does not report concrete metrics (e.g., fraction of blocks triggering m-init, distribution of max locations across heads/layers, or per-configuration error in the initial approximation) that would allow assessment of whether the vector-relief benefit is robust or merely an average-case phenomenon.
minor comments (2)
- The projected 6x speedup for C4V16 under 'upcoming architecture improvements' is speculative and should be presented with clearer caveats or removed from the main claims.
- The baseline configurations (C16V32, C8V32, etc.) and hardware platform should be defined more explicitly in the experimental section to allow reproduction of the reported speedups.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for recognizing the practical importance of addressing vector bottlenecks in attention kernels. We address each major comment below and have made revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: The claim of retaining exact online-softmax semantics while 'completely avoid[ing] the conditional rescale operation' rests on the key-block approximation plus sink/local reordering always capturing the global maximum or correctly triggering m-initialization for middle blocks. No derivation or bound is provided showing that this combination guarantees the true global max (as opposed to relying on an unproven statistical property of attention scores), and the frequency/overhead of m-initialization is not quantified, which directly affects whether vector operations are relieved in practice.
Authors: Our method preserves exact semantics because the m-initialization step is triggered specifically for cases where the global maximum occurs in middle blocks, allowing the running maximum to be set to the true value before freezing. The key-block approximation and reordering are used to minimize updates, but exactness does not depend on the approximation being perfect. Attention statistics in the paper show early stabilization with reordering, and we have added quantification of m-initialization frequency and overhead in the revision, demonstrating that it is infrequent and the vector operations are still relieved in practice. revision: yes
-
Referee: While attention statistics are cited to verify that 'sink and local reordering stabilizes the running maximum early' and that 'm-initialization is required when maxima appear in middle blocks,' the manuscript does not report concrete metrics (e.g., fraction of blocks triggering m-init, distribution of max locations across heads/layers, or per-configuration error in the initial approximation) that would allow assessment of whether the vector-relief benefit is robust or merely an average-case phenomenon.
Authors: We concur that providing these concrete metrics would better demonstrate robustness. Accordingly, we have revised the manuscript to include the requested metrics: the fraction of blocks triggering m-init (reported per model and layer), the distribution of max locations, and approximation errors. These are now detailed in Section 4 and a new table, showing consistent behavior across heads and layers, confirming the benefit is robust. revision: yes
- A theoretical derivation or bound proving that the approximation and reordering always lead to correct triggering of m-initialization for the true global max without relying on properties of attention scores.
Circularity Check
No circularity: algorithmic modification with measured empirical outcomes
full rationale
The paper presents VFA as a set of concrete algorithmic changes (block-summary initialization of the running max, sink/local reordering, and subsequent freezing of the max) that are intended to reduce vector operations while preserving the online-softmax structure. All performance claims are stated as measured speedups on hardware benchmarks (MMLU, MATH500, attention statistics) rather than as derived predictions. No equations appear that equate a claimed result to its own inputs by construction, no parameters are fitted and then relabeled as predictions, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. The approach is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Online softmax with running maximum and normalizer computes exact attention with linear memory
- ad hoc to paper Key-block representations provide a sufficiently accurate cheap approximation for the global maximum
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2509.25149 , year=
F. Abecassis, A. Agrusa, D. Ahn, J. Alben, S. Alborghetti, M. Andersch, S. Arayandi, A. Bjorlin, A. Blakeman, E. Briones, et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025
-
[2]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebr´ on, and S. Sanghai. Gqa: Training gener- alized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
K. Alexandridis, V. Titopoulos, and G. Dimitrakopoulos. Flash-d: Flashattention with hidden softmax division.arXiv preprint arXiv:2505.14201, 2025
-
[4]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[5]
Rethinking Attention with Performers
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020
work page internal anchor Pith review arXiv 2009
-
[6]
T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Dao et al
T. Dao et al. flash-attention: Fast and memory-efficient exact attention with io-awareness. GitHub repository.https://github.com/Dao-AILab/flash-attention
-
[8]
T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R´ e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022
2022
-
[9]
A. Hatamizadeh, G. Heinrich, H. Yin, A. Tao, J. M. Alvarez, J. Kautz, and P. Molchanov. Fastervit: Fast vision transformers with hierarchical attention.arXiv preprint arXiv:2306.06189, 2023
- [10]
-
[11]
Jiang, Y
H. Jiang, Y. Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. H. Abdi, D. Li, C.-Y. Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024
2024
-
[12]
Katharopoulos, A
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020
2020
-
[13]
A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review arXiv 2024
-
[14]
A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [15]
- [16]
-
[17]
Online normalizer calculation for softmax,
M. Milakov and N. Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018
- [18]
-
[19]
Radford, J
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
2019
-
[20]
A. Roy, M. Saffar, A. Vaswani, and D. Grangier. Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021
2021
-
[21]
J. Shah, G. Bikshandi, Y. Zhang, V. Thakkar, P. Ramani, and T. Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024
2024
-
[22]
N. Shazeer. Fast transformer decoding: One write-head is all you need, 2019.URL https://arxiv. org/abs, 1911
2019
- [23]
-
[24]
Training llms with mxfp4.arXiv preprint arXiv:2502.20586, 2025
A. Tseng, T. Yu, and Y. Park. Training llms with mxfp4.arXiv preprint arXiv:2502.20586, 2025
-
[25]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[26]
J. Yuan, C. Shinn, K. Xu, J. Cui, G. Klimiashvili, G. Xiao, P. Zheng, B. Li, Y. Zhou, Z. Ye, et al. Blasst: Dynamic blocked attention sparsity via softmax thresholding.arXiv preprint arXiv:2512.12087, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
T. Zadouri, M. Hoehnerbach, J. Shah, T. Liu, V. Thakkar, and T. Dao. Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling.arXiv preprint arXiv:2603.05451, 2026
-
[28]
Zaheer, G
M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020
2020
-
[29]
Spargeattention: Accurate and training-free sparse attention accelerating any model inference
J. Zhang, C. Xiang, H. Huang, J. Wei, H. Xi, J. Zhu, and J. Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025. 17/19 Appendix VSA T able 4.Accuracy results onMMLU computer security. Each cell is reported as acc±std followed by (parentheses) sparsity; for Blasst FA4 we ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.