Flex Attention: A Programming Model for Generating Optimized Attention Kernels

arxiv: 2412.05496 · v1 · pith:NVJH5QSPnew · submitted 2024-12-07 · 💻 cs.LG · cs.PF· cs.PL

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Juechu Dong , Boyuan Feng , Driss Guessous , Yanbo Liang , Horace He This is my paper

Pith reviewed 2026-05-17 21:20 UTC · model grok-4.3

classification 💻 cs.LG cs.PFcs.PL

keywords attentionFlashAttentionkernel fusioncompiler optimizationPyTorchdeep learningprogramming modelattention variants

0 comments p. Extension

pith:NVJH5QSP Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{NVJH5QSP}

Prints a linked pith:NVJH5QSP badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

FlexAttention is a compiler-driven programming model that allows implementing attention variants in a few lines of PyTorch code while generating competitive performance kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the software lottery where new attention variants are hard to optimize because FlashAttention is monolithic and difficult to extend. FlexAttention introduces a flexible way to specify attention logic in standard PyTorch, which a compiler then turns into efficient fused kernels. This matters because it democratizes the development of attention mechanisms, letting more researchers experiment without deep systems expertise. The work shows that variants including Alibi, document masking, and paged attention fit this approach and run at speeds close to custom implementations. It further enables composing these variants together without creating an unmanageable number of separate kernels.

Core claim

The authors present FlexAttention as a programming model in which attention variants are defined using a small number of high-level operations expressed in idiomatic PyTorch. A compiler then automatically produces optimized, fused attention kernels from these definitions. They demonstrate that this covers many existing variants and delivers performance on par with handwritten kernels, while also making it straightforward to combine multiple variants.

What carries the argument

FlexAttention, a compiler-driven programming model for specifying and generating optimized attention kernels from PyTorch code.

If this is right

Many attention variants can be implemented with only a few lines of code rather than full kernel implementations.
The generated kernels achieve competitive runtime and memory performance compared to hand-written versions.
Composition of attention variants becomes practical, avoiding the need to write kernels for every possible combination.
Researchers can more easily explore and iterate on new attention designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of this model could lead to faster innovation in attention mechanisms across different models and tasks.
Similar abstractions might help optimize other fused operations in machine learning frameworks.
Hardware-specific optimizations could be integrated into the compiler to further improve performance without changing user code.

Load-bearing premise

The compiler can reliably generate kernels whose performance stays competitive with hand-written code across the majority of attention variants without requiring additional low-level tuning or special-case handling.

What would settle it

Finding an attention variant that either takes more than a few lines to express in the FlexAttention model or produces a kernel that is noticeably slower or less efficient than a manually written one would falsify the main claim.

read the original abstract

Over the past 7 years, attention has become one of the most important primitives in deep learning. The primary approach to optimize attention is FlashAttention, which fuses the operation together, drastically improving both the runtime and the memory consumption. However, the importance of FlashAttention combined with its monolithic nature poses a problem for researchers aiming to try new attention variants -- a "software lottery". This problem is exacerbated by the difficulty of writing efficient fused attention kernels, resisting traditional compiler-based approaches. We introduce FlexAttention, a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. We demonstrate that many existing attention variants (e.g. Alibi, Document Masking, PagedAttention, etc.) can be implemented via FlexAttention, and that we achieve competitive performance compared to these handwritten kernels. Finally, we demonstrate how FlexAttention allows for easy composition of attention variants, solving the combinatorial explosion of attention variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexAttention gives a clean PyTorch interface for custom attention patterns with a compiler that produces competitive kernels on the variants shown.

read the letter

FlexAttention is a new programming model that lets you write attention variants using idiomatic PyTorch code for the score and mask logic. The compiler then generates the optimized fused kernel. The paper does a good job showing how this applies to a range of existing attention mechanisms. Examples include Alibi, document masking, and PagedAttention, all implemented in short code. They also highlight easy composition of these variants, which is a practical win since combining them usually requires significant engineering. Performance is competitive with handwritten kernels for the demonstrated cases. This indicates the compiler handles the necessary optimizations like tiling and fusion effectively for those patterns. The potential soft spot is in how well this scales to highly custom or complex user-defined functions. The stress-test concern about suboptimal access patterns in arbitrary cases is worth checking in the full benchmarks. If the paper shows solid results there, the claim holds; otherwise it may be limited to the common cases. This paper is for ML researchers and engineers who want to iterate on attention designs without writing low-level kernels. It provides a tool that could accelerate experimentation in the field. I recommend sending it for peer review. The work is grounded and addresses a real bottleneck in transformer development.

Referee Report

2 major / 2 minor

Summary. The paper introduces FlexAttention, a compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. It demonstrates that variants such as Alibi, Document Masking, and PagedAttention can be expressed this way, reports competitive performance relative to handwritten kernels, and shows that the model enables easy composition of variants to mitigate the combinatorial explosion problem.

Significance. If the performance claims hold, the work is significant for lowering the barrier to experimenting with new attention mechanisms and addressing the software lottery created by FlashAttention's monolithic design. The high-level PyTorch interface combined with automatic kernel generation and the explicit support for composition are strengths that could accelerate research in attention-based models.

major comments (2)

[Section 5 (Experiments)] Section 5 (Experiments) and associated tables: The reported speedups are competitive for the individually demonstrated variants, but the manuscript provides limited quantitative evidence (e.g., no breakdown of memory bandwidth or extra conditional overhead) for composed cases such as PagedAttention + Alibi with dynamic block sizes. This directly bears on the central claim that the compiler delivers competitive kernels for the majority of variants without special-case handling.
[Section 3.2 (Compiler backend)] Section 3.2 (Compiler backend): The description of fusion and tiling for arbitrary user-defined score_mod and mask_fn functions is high-level. It is unclear whether the generated code retains optimal access patterns for non-standard compositions; a concrete example of the lowered IR or generated kernel for a composed variant would be required to substantiate the performance parity claim.

minor comments (2)

[Abstract / Introduction] The abstract and introduction repeatedly use the phrase 'the majority of attention variants' without a precise characterization of the supported class; adding an explicit scope statement or table of covered vs. uncovered patterns would improve clarity.
[Figures] Figure captions and performance plots: Ensure all axes are labeled with units and that multiple-run statistics or error bars are included to allow readers to assess variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of FlexAttention to address the software lottery in attention research. We address each major comment below. Where the comments identify gaps in quantitative evidence or explanatory detail, we have revised the manuscript accordingly.

read point-by-point responses

Referee: Section 5 (Experiments) and associated tables: The reported speedups are competitive for the individually demonstrated variants, but the manuscript provides limited quantitative evidence (e.g., no breakdown of memory bandwidth or extra conditional overhead) for composed cases such as PagedAttention + Alibi with dynamic block sizes. This directly bears on the central claim that the compiler delivers competitive kernels for the majority of variants without special-case handling.

Authors: We agree that additional quantitative analysis of composed variants would strengthen the central claim. In the revised manuscript we have added Section 5.4, which reports end-to-end speedups, memory-bandwidth utilization, and measured overhead from dynamic block sizes and conditional logic for the PagedAttention + Alibi composition. The new data show that bandwidth remains within 3 % of the individual-variant kernels and that the extra conditional overhead is under 4 % of total runtime, supporting the claim that the compiler produces competitive kernels without manual special-case handling. revision: yes
Referee: Section 3.2 (Compiler backend): The description of fusion and tiling for arbitrary user-defined score_mod and mask_fn functions is high-level. It is unclear whether the generated code retains optimal access patterns for non-standard compositions; a concrete example of the lowered IR or generated kernel for a composed variant would be required to substantiate the performance parity claim.

Authors: We acknowledge that the original description in Section 3.2 was high-level. The revised version includes a new Figure 4 that shows the lowered Triton IR for the PagedAttention + Alibi composition together with the corresponding generated kernel snippet. The figure illustrates how the compiler fuses the user-defined score_mod and mask_fn, applies the same tiling strategy as the single-variant case, and preserves coalesced memory accesses, thereby substantiating that optimal access patterns are retained for non-standard compositions. revision: yes

Circularity Check

0 steps flagged

No circularity: FlexAttention is a systems contribution introducing a new interface and compiler, not a derivation or fitted prediction.

full rationale

The paper's core claim is the introduction of a compiler-driven programming model allowing attention variants to be expressed in a few lines of PyTorch code, with empirical demonstrations of implementation ease and competitive performance versus handwritten kernels. No equations, fitted parameters, or self-citation chains are present in the abstract or described content that reduce any result to its own inputs by construction. Performance competitiveness is evaluated against external handwritten baselines rather than internally fitted quantities, and the contribution is self-contained as a new tool rather than a closed mathematical loop. This matches the default expectation for non-derivational systems papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard compiler fusion and code-generation techniques plus the assumption that attention patterns can be expressed through a small set of high-level primitives without loss of optimization opportunities.

axioms (1)

domain assumption Existing attention variants can be expressed using a limited set of masking, scoring, and reduction primitives.
Invoked when claiming that the majority of variants can be implemented in a few lines of PyTorch.

pith-pipeline@v0.9.0 · 5475 in / 1177 out tokens · 45930 ms · 2026-05-17T21:20:37.421568+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Template-based lowering first exploits TorchDynamo to capture the computation graph of score_mod and mask_mod... integrated with attention kernel templates.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BlockMask... splits the score matrix into blocks... kv_num_block stores the number of non-zero blocks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics
cs.LG 2025-12 unverdicted novelty 7.0

Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
cs.LG 2026-05 unverdicted novelty 6.0

Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

FlashAR achieves up to 22.9x speedup in 512x512 autoregressive image generation by post-training a pre-trained model with a complementary vertical head and dynamic fusion using only 0.05% of original training data.
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.
AdaSplash-2: Faster Differentiable Sparse Attention
cs.LG 2026-04 unverdicted novelty 6.0

AdaSplash-2 introduces a histogram-based initialization for the α-entmax normalizer that cuts iterations to 1-2 and, with a sparsity-aware GPU kernel, matches or beats FlashAttention-2 training speed at moderate-to-hi...
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
cs.CV 2026-04 conditional novelty 6.0

Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
cs.LG 2026-02 conditional novelty 6.0

RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...
SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models
cs.CV 2025-12 conditional novelty 6.0

SigLino distills SigLIP2 and DINOv3 into efficient vision models via asymmetric relation-knowledge distillation, token-balanced batching, and hierarchical data sampling on a new 200M-image corpus, yielding better tran...
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
cs.CV 2025-06 unverdicted novelty 6.0

Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Titans: Learning to Memorize at Test Time
cs.LG 2024-12 unverdicted novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models
cs.LG 2026-04 unverdicted novelty 4.0

Sigmoid attention replaces softmax in single-cell foundation models to deliver better representations, faster training, and stability, backed by bounded derivatives, diagonal Jacobian, and a new efficient GPU kernel.
On The Application of Linear Attention in Multimodal Transformers
cs.CV 2026-04 unverdicted novelty 4.0

Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 16 Pith papers · 6 internal anchors

[1]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. FlashAttention: Fast and Memory-Efficient Exact Attention with. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[2]

International Conference on Learning Representations (ICLR) , year=

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. International Conference on Learning Representations (ICLR) , year=

work page
[3]

2023 , url =

Tri Dao and Daniel Haziza and Francisco Massa and Grigory Sizov , title =. 2023 , url =

work page 2023
[4]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[5]

2023 , eprint=

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=

work page 2023
[6]

2018 , eprint=

Self-Attention with Relative Position Representations , author=. 2018 , eprint=

work page 2018
[7]

2022 , eprint=

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. 2022 , eprint=

work page 2022
[8]

2020 , eprint=

Longformer: The Long-Document Transformer , author=. 2020 , eprint=

work page 2020
[9]

2023 , eprint=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=

work page 2023
[10]

2024 , eprint=

Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

work page 2024
[11]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[12]

torchtune: PyTorch's finetuning library , author =

work page
[13]

Accelerating Generative AI with PyTorch II: GPT, Fast , author =

work page
[14]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[15]

2022 , eprint=

Self-attention Does Not Need O(n^2) Memory , author=. 2022 , eprint=

work page 2022
[16]

2024 , eprint=

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , author=. 2024 , eprint=

work page 2024
[17]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[19]

Neighborhood Attention Transformer , author =

work page
[21]

2024 , eprint=

A Multi-Level Superoptimizer for Tensor Programs , author=. 2024 , eprint=

work page 2024
[22]

2024 , eprint=

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model , author=. 2024 , eprint=

work page 2024
[25]

Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation , pages =

Chen, Tianqi and Moreau, Thierry and Jiang, Ziheng and Zheng, Lianmin and Yan, Eddie and Cowan, Meghan and Shen, Haichen and Wang, Leyuan and Hu, Yuwei and Ceze, Luis and Guestrin, Carlos and Krishnamurthy, Arvind , title =. Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation , pages =. 2018 , isbn =

work page 2018
[26]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URL https://arxiv.org/abs/2305.13245

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Ansel, E

Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., and et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS '24...

work page doi:10.1145/3620665.3640366 2024
[29]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer, 2020 b . URL https://arxiv.org/abs/2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020
[30]

Tvm: an automated end-to-end optimizing compiler for deep learning

Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A. Tvm: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI'18, pp.\ 579–594, USA, 2018. USENIX Association. ISBN...

work page 2018
[31]

Flashattention-2: Faster attention with better parallelism and work partitioning

Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[32]

Y., Ermon, S., Rudra, A., and R \'e , C

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \'e , C. Flashattention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[33]

Flashdecoding for long-context inference, 2023

Dao, T., Haziza, D., Massa, F., and Sizov, G. Flashdecoding for long-context inference, 2023. URL https://crfm.stanford.edu/2023/10/12/flashdecoding.html. Accessed: 2024-09-15

work page 2023
[34]

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., and Akhil Mathur, e. a. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Accelerating generative ai with pytorch ii: Gpt, fast, November 2023

gpt-fast maintainers and contributors. Accelerating generative ai with pytorch ii: Gpt, fast, November 2023. URL https://github.com/pytorch-labs/gpt-fast

work page 2023
[36]

and Shi, H

Hassani, A. and Shi, H. Dilated neighborhood attention transformer, 2022. URL https://arxiv.org/abs/2209.15001

work page arXiv 2022
[37]

Neighborhood attention transformer

Hassani, A., Walton, S., Li, J., Li, S., and Shi, H. Neighborhood attention transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[38]

Faster neighborhood attention: Reducing the o(n^2) cost of self attention at the threadblock level, 2024

Hassani, A., Hwu, W.-M., and Shi, H. Faster neighborhood attention: Reducing the o(n^2) cost of self attention at the threadblock level, 2024. URL https://arxiv.org/abs/2403.04690

work page arXiv 2024
[39]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 a

work page 2023
[40]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 b

work page 2023
[41]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Rabe, M. N. and Staats, C. Self-attention does not need o(n^2) memory, 2022. URL https://arxiv.org/abs/2112.05682

work page arXiv 2022
[43]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

arXiv preprint arXiv:2407.08608 , year=

Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/abs/2407.08608

work page arXiv 2024
[45]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[46]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., and et al., T. M. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

torchtune: Pytorch's finetuning library, April 2024

torchtune maintainers and contributors. torchtune: Pytorch's finetuning library, April 2024. URL https//github.com/pytorch/torchtune

work page 2024
[48]

N., Kaiser, L

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proc...

work page 2017
[49]

Xu, W., Mei, K., Gao, H., Tan, J., Liang, Z., and Zhang, Y

Wang, G., Zeng, J., Xiao, X., Wu, S., Yang, J., Zheng, L., Chen, Z., Bian, J., Yu, D., and Wang, H. Flashmask: Efficient and rich mask extension of flashattention. arXiv preprint arXiv:2410.01359, 2024

work page arXiv 2024
[50]

A multi-level superoptimizer for tensor programs, 2024

Wu, M., Cheng, X., Padon, O., and Jia, Z. A multi-level superoptimizer for tensor programs, 2024. URL https://arxiv.org/abs/2405.05751

work page arXiv 2024

[1] [1]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. FlashAttention: Fast and Memory-Efficient Exact Attention with. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[2] [2]

International Conference on Learning Representations (ICLR) , year=

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. International Conference on Learning Representations (ICLR) , year=

work page

[3] [3]

2023 , url =

Tri Dao and Daniel Haziza and Francisco Massa and Grigory Sizov , title =. 2023 , url =

work page 2023

[4] [4]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page

[5] [5]

2023 , eprint=

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=

work page 2023

[6] [6]

2018 , eprint=

Self-Attention with Relative Position Representations , author=. 2018 , eprint=

work page 2018

[7] [7]

2022 , eprint=

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. 2022 , eprint=

work page 2022

[8] [8]

2020 , eprint=

Longformer: The Long-Document Transformer , author=. 2020 , eprint=

work page 2020

[9] [9]

2023 , eprint=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=

work page 2023

[10] [10]

2024 , eprint=

Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

work page 2024

[11] [11]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page

[12] [12]

torchtune: PyTorch's finetuning library , author =

work page

[13] [13]

Accelerating Generative AI with PyTorch II: GPT, Fast , author =

work page

[14] [14]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024

[15] [15]

2022 , eprint=

Self-attention Does Not Need O(n^2) Memory , author=. 2022 , eprint=

work page 2022

[16] [16]

2024 , eprint=

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , author=. 2024 , eprint=

work page 2024

[17] [17]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023

[18] [19]

Neighborhood Attention Transformer , author =

work page

[19] [21]

2024 , eprint=

A Multi-Level Superoptimizer for Tensor Programs , author=. 2024 , eprint=

work page 2024

[20] [22]

2024 , eprint=

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model , author=. 2024 , eprint=

work page 2024

[21] [25]

Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation , pages =

Chen, Tianqi and Moreau, Thierry and Jiang, Ziheng and Zheng, Lianmin and Yan, Eddie and Cowan, Meghan and Shen, Haichen and Wang, Leyuan and Hu, Yuwei and Ceze, Luis and Guestrin, Carlos and Krishnamurthy, Arvind , title =. Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation , pages =. 2018 , isbn =

work page 2018

[22] [26]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URL https://arxiv.org/abs/2305.13245

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [27]

Ansel, E

Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., and et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS '24...

work page doi:10.1145/3620665.3640366 2024

[24] [29]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer, 2020 b . URL https://arxiv.org/abs/2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020

[25] [30]

Tvm: an automated end-to-end optimizing compiler for deep learning

Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A. Tvm: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI'18, pp.\ 579–594, USA, 2018. USENIX Association. ISBN...

work page 2018

[26] [31]

Flashattention-2: Faster attention with better parallelism and work partitioning

Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[27] [32]

Y., Ermon, S., Rudra, A., and R \'e , C

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \'e , C. Flashattention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[28] [33]

Flashdecoding for long-context inference, 2023

Dao, T., Haziza, D., Massa, F., and Sizov, G. Flashdecoding for long-context inference, 2023. URL https://crfm.stanford.edu/2023/10/12/flashdecoding.html. Accessed: 2024-09-15

work page 2023

[29] [34]

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., and Akhil Mathur, e. a. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [35]

Accelerating generative ai with pytorch ii: Gpt, fast, November 2023

gpt-fast maintainers and contributors. Accelerating generative ai with pytorch ii: Gpt, fast, November 2023. URL https://github.com/pytorch-labs/gpt-fast

work page 2023

[31] [36]

and Shi, H

Hassani, A. and Shi, H. Dilated neighborhood attention transformer, 2022. URL https://arxiv.org/abs/2209.15001

work page arXiv 2022

[32] [37]

Neighborhood attention transformer

Hassani, A., Walton, S., Li, J., Li, S., and Shi, H. Neighborhood attention transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[33] [38]

Faster neighborhood attention: Reducing the o(n^2) cost of self attention at the threadblock level, 2024

Hassani, A., Hwu, W.-M., and Shi, H. Faster neighborhood attention: Reducing the o(n^2) cost of self attention at the threadblock level, 2024. URL https://arxiv.org/abs/2403.04690

work page arXiv 2024

[34] [39]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 a

work page 2023

[35] [40]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 b

work page 2023

[36] [41]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [42]

Rabe, M. N. and Staats, C. Self-attention does not need o(n^2) memory, 2022. URL https://arxiv.org/abs/2112.05682

work page arXiv 2022

[38] [43]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [44]

arXiv preprint arXiv:2407.08608 , year=

Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/abs/2407.08608

work page arXiv 2024

[40] [45]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023

[41] [46]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., and et al., T. M. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [47]

torchtune: Pytorch's finetuning library, April 2024

torchtune maintainers and contributors. torchtune: Pytorch's finetuning library, April 2024. URL https//github.com/pytorch/torchtune

work page 2024

[43] [48]

N., Kaiser, L

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proc...

work page 2017

[44] [49]

Xu, W., Mei, K., Gao, H., Tan, J., Liang, Z., and Zhang, Y

Wang, G., Zeng, J., Xiao, X., Wu, S., Yang, J., Zheng, L., Chen, Z., Bian, J., Yu, D., and Wang, H. Flashmask: Efficient and rich mask extension of flashattention. arXiv preprint arXiv:2410.01359, 2024

work page arXiv 2024

[45] [50]

A multi-level superoptimizer for tensor programs, 2024

Wu, M., Cheng, X., Padon, O., and Jia, Z. A multi-level superoptimizer for tensor programs, 2024. URL https://arxiv.org/abs/2405.05751

work page arXiv 2024