pith. machine review for the scientific record. sign in

arxiv: 2604.12798 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.AI

Recognition: unknown

VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Flash Attentiononline softmaxvector operationsattention optimizationGPU kernelssparse attentionkernel optimization
0
0 comments X

The pith

VFA pre-computes an approximation of the global maximum from key blocks to reduce vector operations in Flash Attention's online softmax.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Vector Relieved Flash Attention (VFA) to address the growing bottleneck of vector reductions and rescalings in Flash Attention as matrix multiplications approach peak throughput. By initializing the running maximum with a cheap approximation from key-block representations, reordering the traversal to prioritize sink and local blocks, and freezing the maximum afterward, VFA avoids most per-tile rowmax updates and the conditional rescale operation. This preserves the exact online-softmax structure while cutting non-matmul latency. When combined with sparse methods into VSA, it further reduces both block count and per-block overhead. Evaluations on language model benchmarks confirm that the maximum stabilizes early with this approach, enabling nearly 2x speedups on modern hardware configurations.

Core claim

VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to avoid repeated reductions and rescaling. This completely avoids the conditional rescale operation in the update stage, and when integrated with block-sparse skipping, forms VSA that reduces both block count and per-block overhead while maintaining exact attention computation.

What carries the argument

The global maximum pre-computation from key-block summaries combined with sink-and-local block reordering and early freezing of the running maximum.

If this is right

  • Stabilizes the running maximum early via sink and local reordering, reducing rowmax and rowsum reductions.
  • Completely avoids the conditional rescale operation in the attention update stage.
  • Achieves nearly two times speedup on C8V32, C4V32 and C4V16 configurations relative to C16V32 baseline.
  • Delivers up to six times speedup on C4V16 with future hardware improvements to exponent capacity.
  • Integrates with block-sparse skipping to reduce both block count and per-block overhead in VSA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar pre-computation of summary statistics could reduce vector overhead in other online normalization procedures used in large-scale training.
  • Block reordering heuristics based on sink and local impact may extend to improve efficiency in alternative sparse attention patterns.
  • Hardware designs that increase on-chip exponent handling would amplify the benefits of freezing the maximum early.
  • Attention statistics showing intra-block heterogeneity suggest that finer-grained block summaries could further improve approximation accuracy.

Load-bearing premise

That the cheap approximation from key-block representations is accurate enough to initialize the running maximum and that reordering sink and local blocks allows the maximum to stabilize early without changing the final attention output.

What would settle it

Observing cases where the true maximum occurs in middle blocks without using m-initialization, leading to incorrect attention scores compared to standard Flash Attention.

Figures

Figures reproduced from arXiv: 2604.12798 by Bai Du, Gaoyige Fan, Hui Dong, Hui Wang, Yanzhao Li, Yupeng Sun, Zhiqiang Zou, Zhiyuan Zhang.

Figure 1
Figure 1. Figure 1: Latency ratio normalized to Tensor on C16V32. Implementation benefits of VFA. To quantify the practical benefits of VFA over standard FA, we compare their operator-level computation procedures at the granularity of a single (Qi , Kj , Vj ) block interac￾tion. VFA introduces only a lightweight preprocessing stage. The extraction of k repr j = sabsmax(Kj ) can be naturally fused into earlier computation, and… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical evidence supporting the sink+local reordering strategy. (a) Block similarity of Q measured by cosSim(X). (b) Block similarity of K measured by cosSim(X) [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Intra-block similarity for Q and K blocks using the SpargeAttention metric. (a) ℓ2-norm statistics of Q (token/row-level). (b) ℓ2-norm statistics of K (token/row-level) [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Magnitude variation within Q and K blocks visualized by ℓ2-norm statistics. 5.5.1 Where the Running Maximum Stabilizes To understand why the proposed block reordering is effective, we analyze the evolution of the online-softmax running maximum during the key-block scan. Recall that FlashAttention-style online softmax maintains a per-row running maximum m (j) i as blocks are processed. For each query block … view at source ↗
Figure 5
Figure 5. Figure 5: Representative cases of block-maximum location along the key-block index j, motivating the need for m-initialization. To quantify this phenomenon, Figures 5a–5c plot the block maximum as a function of the key-block index j under three representative settings. Specifically, for each query block Qi and key block Kj , we define the block maximum as m˜ (j) i ≜ rowmax(Sij ), Sij = QiK⊤ j , (3) and visualize how… view at source ↗
read the original abstract

FlashAttention-style online softmax enables exact attention computation with linear memory by streaming score tiles through on-chip memory and maintaining a running maximum and normalizer. However, as attention kernels approach peak tensor-core/cube-core throughput on modern accelerators, non-matmul components of online softmax -- especially per-tile rowmax and rowsum reductions and rescale chains -- can become vector or SIMD limited and dominate latency. This paper revisits FlashAttention and proposes Vector Relieved Flash Attention (VFA), a hardware-friendly method that reduces rowmax-driven updates of the running maximum while retaining the online-softmax structure. VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to avoid repeated reductions and rescaling. We further integrate VFA with block-sparse skipping methods such as BLASST to form Vector Relieved Sparse Attention (VSA), which reduces both block count and per-block overhead. Notably, VFA and VSA completely avoid the conditional rescale operation in the update stage used in FA4.0. Extensive evaluations on benchmarks including MMLU and MATH500, together with attention statistics, verify our design: (i) sink and local reordering stabilizes the running maximum early; (ii) simple Q and K block summaries fail due to intra-block heterogeneity; (iii) m-initialization is required when maxima appear in middle blocks. Overall, VFA and VSA efficiently alleviate online-softmax reduction bottlenecks without performance loss. Compared to the C16V32 baseline, C8V32, C4V32 and C4V16 achieve nearly two times speedup on modern hardware while hitting the vector bottleneck. With upcoming architecture improvements, C4V16 will deliver six times speedup by enhancing exponent capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Vector Relieved Flash Attention (VFA) that initializes the running maximum in online softmax using a cheap approximation derived from key-block representations, reorders key-block traversal to prioritize sink and local blocks, and freezes the maximum for subsequent blocks to reduce rowmax reductions and rescaling. It further combines this with block-sparse skipping (e.g., BLASST) to form VSA. The method claims to completely avoid the conditional rescale of FA4.0 while preserving exact attention, with empirical support from MMLU, MATH500, and attention statistics showing early stabilization of the maximum, failure of simple summaries due to intra-block heterogeneity, necessity of m-initialization for middle-block maxima, and speedups of nearly 2x (with potential 6x on future hardware) without accuracy loss.

Significance. If the empirical verification of exactness holds across diverse workloads, VFA/VSA would be a meaningful optimization for attention kernels on modern accelerators where vector/SIMD operations increasingly limit throughput as tensor-core matmul rates rise. The hardware-aware focus on avoiding rescale chains and the integration with existing sparse methods are practical strengths.

major comments (2)
  1. [Abstract] Abstract: The claim of retaining exact online-softmax semantics while 'completely avoid[ing] the conditional rescale operation' rests on the key-block approximation plus sink/local reordering always capturing the global maximum or correctly triggering m-initialization for middle blocks. No derivation or bound is provided showing that this combination guarantees the true global max (as opposed to relying on an unproven statistical property of attention scores), and the frequency/overhead of m-initialization is not quantified, which directly affects whether vector operations are relieved in practice.
  2. [Abstract] Abstract and evaluations: While attention statistics are cited to verify that 'sink and local reordering stabilizes the running maximum early' and that 'm-initialization is required when maxima appear in middle blocks,' the manuscript does not report concrete metrics (e.g., fraction of blocks triggering m-init, distribution of max locations across heads/layers, or per-configuration error in the initial approximation) that would allow assessment of whether the vector-relief benefit is robust or merely an average-case phenomenon.
minor comments (2)
  1. The projected 6x speedup for C4V16 under 'upcoming architecture improvements' is speculative and should be presented with clearer caveats or removed from the main claims.
  2. The baseline configurations (C16V32, C8V32, etc.) and hardware platform should be defined more explicitly in the experimental section to allow reproduction of the reported speedups.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful review and for recognizing the practical importance of addressing vector bottlenecks in attention kernels. We address each major comment below and have made revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: The claim of retaining exact online-softmax semantics while 'completely avoid[ing] the conditional rescale operation' rests on the key-block approximation plus sink/local reordering always capturing the global maximum or correctly triggering m-initialization for middle blocks. No derivation or bound is provided showing that this combination guarantees the true global max (as opposed to relying on an unproven statistical property of attention scores), and the frequency/overhead of m-initialization is not quantified, which directly affects whether vector operations are relieved in practice.

    Authors: Our method preserves exact semantics because the m-initialization step is triggered specifically for cases where the global maximum occurs in middle blocks, allowing the running maximum to be set to the true value before freezing. The key-block approximation and reordering are used to minimize updates, but exactness does not depend on the approximation being perfect. Attention statistics in the paper show early stabilization with reordering, and we have added quantification of m-initialization frequency and overhead in the revision, demonstrating that it is infrequent and the vector operations are still relieved in practice. revision: yes

  2. Referee: While attention statistics are cited to verify that 'sink and local reordering stabilizes the running maximum early' and that 'm-initialization is required when maxima appear in middle blocks,' the manuscript does not report concrete metrics (e.g., fraction of blocks triggering m-init, distribution of max locations across heads/layers, or per-configuration error in the initial approximation) that would allow assessment of whether the vector-relief benefit is robust or merely an average-case phenomenon.

    Authors: We concur that providing these concrete metrics would better demonstrate robustness. Accordingly, we have revised the manuscript to include the requested metrics: the fraction of blocks triggering m-init (reported per model and layer), the distribution of max locations, and approximation errors. These are now detailed in Section 4 and a new table, showing consistent behavior across heads and layers, confirming the benefit is robust. revision: yes

standing simulated objections not resolved
  • A theoretical derivation or bound proving that the approximation and reordering always lead to correct triggering of m-initialization for the true global max without relying on properties of attention scores.

Circularity Check

0 steps flagged

No circularity: algorithmic modification with measured empirical outcomes

full rationale

The paper presents VFA as a set of concrete algorithmic changes (block-summary initialization of the running max, sink/local reordering, and subsequent freezing of the max) that are intended to reduce vector operations while preserving the online-softmax structure. All performance claims are stated as measured speedups on hardware benchmarks (MMLU, MATH500, attention statistics) rather than as derived predictions. No equations appear that equate a claimed result to its own inputs by construction, no parameters are fitted and then relabeled as predictions, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review limits visibility into full assumptions; the method rests on standard FlashAttention properties plus paper-specific claims about block heterogeneity and maximum distribution.

axioms (2)
  • domain assumption Online softmax with running maximum and normalizer computes exact attention with linear memory
    Core property of FlashAttention-style methods invoked throughout the description
  • ad hoc to paper Key-block representations provide a sufficiently accurate cheap approximation for the global maximum
    Central to the initialization step proposed in VFA

pith-pipeline@v0.9.0 · 5659 in / 1553 out tokens · 52051 ms · 2026-05-10T16:05:59.298119+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 19 canonical work pages · 7 internal anchors

  1. [1]

    arXiv preprint arXiv:2509.25149 , year=

    F. Abecassis, A. Agrusa, D. Ahn, J. Alben, S. Alborghetti, M. Andersch, S. Arayandi, A. Bjorlin, A. Blakeman, E. Briones, et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

  2. [2]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebr´ on, and S. Sanghai. Gqa: Training gener- alized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

  3. [3]

    Alexandridis, V

    K. Alexandridis, V. Titopoulos, and G. Dimitrakopoulos. Flash-d: Flashattention with hidden softmax division.arXiv preprint arXiv:2505.14201, 2025

  4. [4]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

  5. [5]

    Rethinking Attention with Performers

    K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020

  6. [6]

    T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  7. [7]

    Dao et al

    T. Dao et al. flash-attention: Fast and memory-efficient exact attention with io-awareness. GitHub repository.https://github.com/Dao-AILab/flash-attention

  8. [8]

    T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R´ e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

  9. [9]

    Fastervit: Fast vision transformers with hierarchical attention.arXiv preprint arXiv:2306.06189, 2023

    A. Hatamizadeh, G. Heinrich, H. Yin, A. Tao, J. M. Alvarez, J. Kautz, and P. Molchanov. Fastervit: Fast vision transformers with hierarchical attention.arXiv preprint arXiv:2306.06189, 2023

  10. [10]

    K. Hong, G. Dai, J. Xu, Q. Mao, X. Li, J. Liu, K. Chen, Y. Dong, and Y. Wang. Flashdecoding++: Faster large language model inference on gpus.arXiv preprint arXiv:2311.01282, 2023

  11. [11]

    Jiang, Y

    H. Jiang, Y. Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. H. Abdi, D. Li, C.-Y. Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

  12. [12]

    Katharopoulos, A

    A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

  13. [13]

    A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  14. [14]

    A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  15. [15]

    E. Lu, Z. Jiang, J. Liu, Y. Du, T. Jiang, C. Hong, S. Liu, W. He, E. Yuan, Y. Wang, et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189, 2025. 16/19

  16. [16]

    Y. Luo, J. Huang, Y. Cheng, Z. Yu, K. Zhang, K. Hong, X. Ma, X. Wang, A. Tong, G. Hu, et al. Hifloat4 format for language model inference.arXiv preprint arXiv:2602.11287, 2026

  17. [17]

    Online normalizer calculation for softmax,

    M. Milakov and N. Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018

  18. [18]

    Nawrot, R

    P. Nawrot, R. Li, R. Huang, S. Ruder, K. Marchisio, and E. M. Ponti. The sparse frontier: Sparse attention trade-offs in transformer llms.arXiv preprint arXiv:2504.17768, 2025

  19. [19]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  20. [20]

    A. Roy, M. Saffar, A. Vaswani, and D. Grangier. Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

  21. [21]

    J. Shah, G. Bikshandi, Y. Zhang, V. Thakkar, P. Ramani, and T. Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

  22. [22]

    N. Shazeer. Fast transformer decoding: One write-head is all you need, 2019.URL https://arxiv. org/abs, 1911

  23. [23]

    Y. Sun, Z. Li, Y. Zhang, T. Pan, B. Dong, Y. Guo, and J. Wang. Efficient attention mechanisms for large language models: A survey.arXiv preprint arXiv:2507.19595, 2025

  24. [24]

    Training llms with mxfp4.arXiv preprint arXiv:2502.20586, 2025

    A. Tseng, T. Yu, and Y. Park. Training llms with mxfp4.arXiv preprint arXiv:2502.20586, 2025

  25. [25]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  26. [26]

    J. Yuan, C. Shinn, K. Xu, J. Cui, G. Klimiashvili, G. Xiao, P. Zheng, B. Li, Y. Zhou, Z. Ye, et al. Blasst: Dynamic blocked attention sparsity via softmax thresholding.arXiv preprint arXiv:2512.12087, 2025

  27. [27]

    arrive remote, wait local

    T. Zadouri, M. Hoehnerbach, J. Shah, T. Liu, V. Thakkar, and T. Dao. Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling.arXiv preprint arXiv:2603.05451, 2026

  28. [28]

    Zaheer, G

    M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020

  29. [29]

    Spargeattention: Accurate and training-free sparse attention accelerating any model inference

    J. Zhang, C. Xiang, H. Huang, J. Wei, H. Xi, J. Zhu, and J. Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025. 17/19 Appendix VSA T able 4.Accuracy results onMMLU computer security. Each cell is reported as acc±std followed by (parentheses) sparsity; for Blasst FA4 we ...