Flex Attention: A Programming Model for Generating Optimized Attention Kernels
Pith reviewed 2026-05-17 21:20 UTC · model grok-4.3
pith:NVJH5QSP Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{NVJH5QSP}
Prints a linked pith:NVJH5QSP badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
FlexAttention is a compiler-driven programming model that allows implementing attention variants in a few lines of PyTorch code while generating competitive performance kernels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present FlexAttention as a programming model in which attention variants are defined using a small number of high-level operations expressed in idiomatic PyTorch. A compiler then automatically produces optimized, fused attention kernels from these definitions. They demonstrate that this covers many existing variants and delivers performance on par with handwritten kernels, while also making it straightforward to combine multiple variants.
What carries the argument
FlexAttention, a compiler-driven programming model for specifying and generating optimized attention kernels from PyTorch code.
If this is right
- Many attention variants can be implemented with only a few lines of code rather than full kernel implementations.
- The generated kernels achieve competitive runtime and memory performance compared to hand-written versions.
- Composition of attention variants becomes practical, avoiding the need to write kernels for every possible combination.
- Researchers can more easily explore and iterate on new attention designs.
Where Pith is reading between the lines
- Adoption of this model could lead to faster innovation in attention mechanisms across different models and tasks.
- Similar abstractions might help optimize other fused operations in machine learning frameworks.
- Hardware-specific optimizations could be integrated into the compiler to further improve performance without changing user code.
Load-bearing premise
The compiler can reliably generate kernels whose performance stays competitive with hand-written code across the majority of attention variants without requiring additional low-level tuning or special-case handling.
What would settle it
Finding an attention variant that either takes more than a few lines to express in the FlexAttention model or produces a kernel that is noticeably slower or less efficient than a manually written one would falsify the main claim.
read the original abstract
Over the past 7 years, attention has become one of the most important primitives in deep learning. The primary approach to optimize attention is FlashAttention, which fuses the operation together, drastically improving both the runtime and the memory consumption. However, the importance of FlashAttention combined with its monolithic nature poses a problem for researchers aiming to try new attention variants -- a "software lottery". This problem is exacerbated by the difficulty of writing efficient fused attention kernels, resisting traditional compiler-based approaches. We introduce FlexAttention, a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. We demonstrate that many existing attention variants (e.g. Alibi, Document Masking, PagedAttention, etc.) can be implemented via FlexAttention, and that we achieve competitive performance compared to these handwritten kernels. Finally, we demonstrate how FlexAttention allows for easy composition of attention variants, solving the combinatorial explosion of attention variants.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FlexAttention, a compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. It demonstrates that variants such as Alibi, Document Masking, and PagedAttention can be expressed this way, reports competitive performance relative to handwritten kernels, and shows that the model enables easy composition of variants to mitigate the combinatorial explosion problem.
Significance. If the performance claims hold, the work is significant for lowering the barrier to experimenting with new attention mechanisms and addressing the software lottery created by FlashAttention's monolithic design. The high-level PyTorch interface combined with automatic kernel generation and the explicit support for composition are strengths that could accelerate research in attention-based models.
major comments (2)
- [Section 5 (Experiments)] Section 5 (Experiments) and associated tables: The reported speedups are competitive for the individually demonstrated variants, but the manuscript provides limited quantitative evidence (e.g., no breakdown of memory bandwidth or extra conditional overhead) for composed cases such as PagedAttention + Alibi with dynamic block sizes. This directly bears on the central claim that the compiler delivers competitive kernels for the majority of variants without special-case handling.
- [Section 3.2 (Compiler backend)] Section 3.2 (Compiler backend): The description of fusion and tiling for arbitrary user-defined score_mod and mask_fn functions is high-level. It is unclear whether the generated code retains optimal access patterns for non-standard compositions; a concrete example of the lowered IR or generated kernel for a composed variant would be required to substantiate the performance parity claim.
minor comments (2)
- [Abstract / Introduction] The abstract and introduction repeatedly use the phrase 'the majority of attention variants' without a precise characterization of the supported class; adding an explicit scope statement or table of covered vs. uncovered patterns would improve clarity.
- [Figures] Figure captions and performance plots: Ensure all axes are labeled with units and that multiple-run statistics or error bars are included to allow readers to assess variability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential of FlexAttention to address the software lottery in attention research. We address each major comment below. Where the comments identify gaps in quantitative evidence or explanatory detail, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: Section 5 (Experiments) and associated tables: The reported speedups are competitive for the individually demonstrated variants, but the manuscript provides limited quantitative evidence (e.g., no breakdown of memory bandwidth or extra conditional overhead) for composed cases such as PagedAttention + Alibi with dynamic block sizes. This directly bears on the central claim that the compiler delivers competitive kernels for the majority of variants without special-case handling.
Authors: We agree that additional quantitative analysis of composed variants would strengthen the central claim. In the revised manuscript we have added Section 5.4, which reports end-to-end speedups, memory-bandwidth utilization, and measured overhead from dynamic block sizes and conditional logic for the PagedAttention + Alibi composition. The new data show that bandwidth remains within 3 % of the individual-variant kernels and that the extra conditional overhead is under 4 % of total runtime, supporting the claim that the compiler produces competitive kernels without manual special-case handling. revision: yes
-
Referee: Section 3.2 (Compiler backend): The description of fusion and tiling for arbitrary user-defined score_mod and mask_fn functions is high-level. It is unclear whether the generated code retains optimal access patterns for non-standard compositions; a concrete example of the lowered IR or generated kernel for a composed variant would be required to substantiate the performance parity claim.
Authors: We acknowledge that the original description in Section 3.2 was high-level. The revised version includes a new Figure 4 that shows the lowered Triton IR for the PagedAttention + Alibi composition together with the corresponding generated kernel snippet. The figure illustrates how the compiler fuses the user-defined score_mod and mask_fn, applies the same tiling strategy as the single-variant case, and preserves coalesced memory accesses, thereby substantiating that optimal access patterns are retained for non-standard compositions. revision: yes
Circularity Check
No circularity: FlexAttention is a systems contribution introducing a new interface and compiler, not a derivation or fitted prediction.
full rationale
The paper's core claim is the introduction of a compiler-driven programming model allowing attention variants to be expressed in a few lines of PyTorch code, with empirical demonstrations of implementation ease and competitive performance versus handwritten kernels. No equations, fitted parameters, or self-citation chains are present in the abstract or described content that reduce any result to its own inputs by construction. Performance competitiveness is evaluated against external handwritten baselines rather than internally fitted quantities, and the contribution is self-contained as a new tool rather than a closed mathematical loop. This matches the default expectation for non-derivational systems papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing attention variants can be expressed using a limited set of masking, scoring, and reduction primitives.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Template-based lowering first exploits TorchDynamo to capture the computation graph of score_mod and mask_mod... integrated with attention kernel templates.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BlockMask... splits the score matrix into blocks... kv_num_block stores the number of non-zero blocks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
-
Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics
Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
-
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
FlashAR achieves up to 22.9x speedup in 512x512 autoregressive image generation by post-training a pre-trained model with a complementary vertical head and dynamic fusion using only 0.05% of original training data.
-
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.
-
AdaSplash-2: Faster Differentiable Sparse Attention
AdaSplash-2 introduces a histogram-based initialization for the α-entmax normalizer that cuts iterations to 1-2 and, with a sparsity-aware GPU kernel, matches or beats FlashAttention-2 training speed at moderate-to-hi...
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
-
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...
-
SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models
SigLino distills SigLIP2 and DINOv3 into efficient vision models via asymmetric relation-knowledge distillation, token-balanced batching, and hierarchical data sampling on a new 200M-image corpus, yielding better tran...
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
Titans: Learning to Memorize at Test Time
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
-
Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models
Sigmoid attention replaces softmax in single-cell foundation models to deliver better representations, faster training, and stability, backed by bounded derivatives, diagonal Jacobian, and a new efficient GPU kernel.
-
On The Application of Linear Attention in Multimodal Transformers
Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
Reference graph
Works this paper leans on
-
[1]
and Ermon, Stefano and Rudra, Atri and R
Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. FlashAttention: Fast and Memory-Efficient Exact Attention with. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[2]
International Conference on Learning Representations (ICLR) , year=
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. International Conference on Learning Representations (ICLR) , year=
-
[3]
Tri Dao and Daniel Haziza and Francisco Massa and Grigory Sizov , title =. 2023 , url =
work page 2023
-
[4]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[5]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=
work page 2023
-
[6]
Self-Attention with Relative Position Representations , author=. 2018 , eprint=
work page 2018
-
[7]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. 2022 , eprint=
work page 2022
- [8]
-
[9]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=
work page 2023
-
[10]
Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=
work page 2024
-
[11]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[12]
torchtune: PyTorch's finetuning library , author =
-
[13]
Accelerating Generative AI with PyTorch II: GPT, Fast , author =
- [14]
- [15]
-
[16]
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , author=. 2024 , eprint=
work page 2024
-
[17]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[19]
Neighborhood Attention Transformer , author =
-
[21]
A Multi-Level Superoptimizer for Tensor Programs , author=. 2024 , eprint=
work page 2024
-
[22]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model , author=. 2024 , eprint=
work page 2024
-
[25]
Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation , pages =
Chen, Tianqi and Moreau, Thierry and Jiang, Ziheng and Zheng, Lianmin and Yan, Eddie and Cowan, Meghan and Shen, Haichen and Wang, Leyuan and Hu, Yuwei and Ceze, Luis and Guestrin, Carlos and Krishnamurthy, Arvind , title =. Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation , pages =. 2018 , isbn =
work page 2018
-
[26]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URL https://arxiv.org/abs/2305.13245
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., and et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS '24...
-
[29]
Longformer: The Long-Document Transformer
Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer, 2020 b . URL https://arxiv.org/abs/2004.05150
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[30]
Tvm: an automated end-to-end optimizing compiler for deep learning
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A. Tvm: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI'18, pp.\ 579–594, USA, 2018. USENIX Association. ISBN...
work page 2018
-
[31]
Flashattention-2: Faster attention with better parallelism and work partitioning
Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[32]
Y., Ermon, S., Rudra, A., and R \'e , C
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \'e , C. Flashattention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[33]
Flashdecoding for long-context inference, 2023
Dao, T., Haziza, D., Massa, F., and Sizov, G. Flashdecoding for long-context inference, 2023. URL https://crfm.stanford.edu/2023/10/12/flashdecoding.html. Accessed: 2024-09-15
work page 2023
-
[34]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., and Akhil Mathur, e. a. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Accelerating generative ai with pytorch ii: Gpt, fast, November 2023
gpt-fast maintainers and contributors. Accelerating generative ai with pytorch ii: Gpt, fast, November 2023. URL https://github.com/pytorch-labs/gpt-fast
work page 2023
-
[36]
Hassani, A. and Shi, H. Dilated neighborhood attention transformer, 2022. URL https://arxiv.org/abs/2209.15001
-
[37]
Neighborhood attention transformer
Hassani, A., Walton, S., Li, J., Li, S., and Shi, H. Neighborhood attention transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[38]
Hassani, A., Hwu, W.-M., and Shi, H. Faster neighborhood attention: Reducing the o(n^2) cost of self attention at the threadblock level, 2024. URL https://arxiv.org/abs/2403.04690
-
[39]
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 a
work page 2023
-
[40]
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 b
work page 2023
-
[41]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [42]
-
[43]
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
arXiv preprint arXiv:2407.08608 , year=
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/abs/2407.08608
-
[45]
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023
work page 2023
-
[46]
Gemma 2: Improving Open Language Models at a Practical Size
Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., and et al., T. M. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
torchtune: Pytorch's finetuning library, April 2024
torchtune maintainers and contributors. torchtune: Pytorch's finetuning library, April 2024. URL https//github.com/pytorch/torchtune
work page 2024
-
[48]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proc...
work page 2017
-
[49]
Xu, W., Mei, K., Gao, H., Tan, J., Liang, Z., and Zhang, Y
Wang, G., Zeng, J., Xiao, X., Wu, S., Yang, J., Zheng, L., Chen, Z., Bian, J., Yu, D., and Wang, H. Flashmask: Efficient and rich mask extension of flashattention. arXiv preprint arXiv:2410.01359, 2024
-
[50]
A multi-level superoptimizer for tensor programs, 2024
Wu, M., Cheng, X., Padon, O., and Jia, Z. A multi-level superoptimizer for tensor programs, 2024. URL https://arxiv.org/abs/2405.05751
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.