pith. sign in

arxiv: 2412.05496 · v1 · pith:NVJH5QSPnew · submitted 2024-12-07 · 💻 cs.LG · cs.PF· cs.PL

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Pith reviewed 2026-05-17 21:20 UTC · model grok-4.3

classification 💻 cs.LG cs.PFcs.PL
keywords attentionFlashAttentionkernel fusioncompiler optimizationPyTorchdeep learningprogramming modelattention variants
0
0 comments X p. Extension
pith:NVJH5QSP Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{NVJH5QSP}

Prints a linked pith:NVJH5QSP badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

FlexAttention is a compiler-driven programming model that allows implementing attention variants in a few lines of PyTorch code while generating competitive performance kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the software lottery where new attention variants are hard to optimize because FlashAttention is monolithic and difficult to extend. FlexAttention introduces a flexible way to specify attention logic in standard PyTorch, which a compiler then turns into efficient fused kernels. This matters because it democratizes the development of attention mechanisms, letting more researchers experiment without deep systems expertise. The work shows that variants including Alibi, document masking, and paged attention fit this approach and run at speeds close to custom implementations. It further enables composing these variants together without creating an unmanageable number of separate kernels.

Core claim

The authors present FlexAttention as a programming model in which attention variants are defined using a small number of high-level operations expressed in idiomatic PyTorch. A compiler then automatically produces optimized, fused attention kernels from these definitions. They demonstrate that this covers many existing variants and delivers performance on par with handwritten kernels, while also making it straightforward to combine multiple variants.

What carries the argument

FlexAttention, a compiler-driven programming model for specifying and generating optimized attention kernels from PyTorch code.

If this is right

  • Many attention variants can be implemented with only a few lines of code rather than full kernel implementations.
  • The generated kernels achieve competitive runtime and memory performance compared to hand-written versions.
  • Composition of attention variants becomes practical, avoiding the need to write kernels for every possible combination.
  • Researchers can more easily explore and iterate on new attention designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption of this model could lead to faster innovation in attention mechanisms across different models and tasks.
  • Similar abstractions might help optimize other fused operations in machine learning frameworks.
  • Hardware-specific optimizations could be integrated into the compiler to further improve performance without changing user code.

Load-bearing premise

The compiler can reliably generate kernels whose performance stays competitive with hand-written code across the majority of attention variants without requiring additional low-level tuning or special-case handling.

What would settle it

Finding an attention variant that either takes more than a few lines to express in the FlexAttention model or produces a kernel that is noticeably slower or less efficient than a manually written one would falsify the main claim.

read the original abstract

Over the past 7 years, attention has become one of the most important primitives in deep learning. The primary approach to optimize attention is FlashAttention, which fuses the operation together, drastically improving both the runtime and the memory consumption. However, the importance of FlashAttention combined with its monolithic nature poses a problem for researchers aiming to try new attention variants -- a "software lottery". This problem is exacerbated by the difficulty of writing efficient fused attention kernels, resisting traditional compiler-based approaches. We introduce FlexAttention, a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. We demonstrate that many existing attention variants (e.g. Alibi, Document Masking, PagedAttention, etc.) can be implemented via FlexAttention, and that we achieve competitive performance compared to these handwritten kernels. Finally, we demonstrate how FlexAttention allows for easy composition of attention variants, solving the combinatorial explosion of attention variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FlexAttention, a compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. It demonstrates that variants such as Alibi, Document Masking, and PagedAttention can be expressed this way, reports competitive performance relative to handwritten kernels, and shows that the model enables easy composition of variants to mitigate the combinatorial explosion problem.

Significance. If the performance claims hold, the work is significant for lowering the barrier to experimenting with new attention mechanisms and addressing the software lottery created by FlashAttention's monolithic design. The high-level PyTorch interface combined with automatic kernel generation and the explicit support for composition are strengths that could accelerate research in attention-based models.

major comments (2)
  1. [Section 5 (Experiments)] Section 5 (Experiments) and associated tables: The reported speedups are competitive for the individually demonstrated variants, but the manuscript provides limited quantitative evidence (e.g., no breakdown of memory bandwidth or extra conditional overhead) for composed cases such as PagedAttention + Alibi with dynamic block sizes. This directly bears on the central claim that the compiler delivers competitive kernels for the majority of variants without special-case handling.
  2. [Section 3.2 (Compiler backend)] Section 3.2 (Compiler backend): The description of fusion and tiling for arbitrary user-defined score_mod and mask_fn functions is high-level. It is unclear whether the generated code retains optimal access patterns for non-standard compositions; a concrete example of the lowered IR or generated kernel for a composed variant would be required to substantiate the performance parity claim.
minor comments (2)
  1. [Abstract / Introduction] The abstract and introduction repeatedly use the phrase 'the majority of attention variants' without a precise characterization of the supported class; adding an explicit scope statement or table of covered vs. uncovered patterns would improve clarity.
  2. [Figures] Figure captions and performance plots: Ensure all axes are labeled with units and that multiple-run statistics or error bars are included to allow readers to assess variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of FlexAttention to address the software lottery in attention research. We address each major comment below. Where the comments identify gaps in quantitative evidence or explanatory detail, we have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: Section 5 (Experiments) and associated tables: The reported speedups are competitive for the individually demonstrated variants, but the manuscript provides limited quantitative evidence (e.g., no breakdown of memory bandwidth or extra conditional overhead) for composed cases such as PagedAttention + Alibi with dynamic block sizes. This directly bears on the central claim that the compiler delivers competitive kernels for the majority of variants without special-case handling.

    Authors: We agree that additional quantitative analysis of composed variants would strengthen the central claim. In the revised manuscript we have added Section 5.4, which reports end-to-end speedups, memory-bandwidth utilization, and measured overhead from dynamic block sizes and conditional logic for the PagedAttention + Alibi composition. The new data show that bandwidth remains within 3 % of the individual-variant kernels and that the extra conditional overhead is under 4 % of total runtime, supporting the claim that the compiler produces competitive kernels without manual special-case handling. revision: yes

  2. Referee: Section 3.2 (Compiler backend): The description of fusion and tiling for arbitrary user-defined score_mod and mask_fn functions is high-level. It is unclear whether the generated code retains optimal access patterns for non-standard compositions; a concrete example of the lowered IR or generated kernel for a composed variant would be required to substantiate the performance parity claim.

    Authors: We acknowledge that the original description in Section 3.2 was high-level. The revised version includes a new Figure 4 that shows the lowered Triton IR for the PagedAttention + Alibi composition together with the corresponding generated kernel snippet. The figure illustrates how the compiler fuses the user-defined score_mod and mask_fn, applies the same tiling strategy as the single-variant case, and preserves coalesced memory accesses, thereby substantiating that optimal access patterns are retained for non-standard compositions. revision: yes

Circularity Check

0 steps flagged

No circularity: FlexAttention is a systems contribution introducing a new interface and compiler, not a derivation or fitted prediction.

full rationale

The paper's core claim is the introduction of a compiler-driven programming model allowing attention variants to be expressed in a few lines of PyTorch code, with empirical demonstrations of implementation ease and competitive performance versus handwritten kernels. No equations, fitted parameters, or self-citation chains are present in the abstract or described content that reduce any result to its own inputs by construction. Performance competitiveness is evaluated against external handwritten baselines rather than internally fitted quantities, and the contribution is self-contained as a new tool rather than a closed mathematical loop. This matches the default expectation for non-derivational systems papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard compiler fusion and code-generation techniques plus the assumption that attention patterns can be expressed through a small set of high-level primitives without loss of optimization opportunities.

axioms (1)
  • domain assumption Existing attention variants can be expressed using a limited set of masking, scoring, and reduction primitives.
    Invoked when claiming that the majority of variants can be implemented in a few lines of PyTorch.

pith-pipeline@v0.9.0 · 5475 in / 1177 out tokens · 45930 ms · 2026-05-17T21:20:37.421568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  2. Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

    q-bio.QM 2026-04 unverdicted novelty 7.0

    Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...

  3. Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics

    cs.LG 2025-12 unverdicted novelty 7.0

    Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.

  4. Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

    cs.LG 2026-05 unverdicted novelty 6.0

    Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.

  5. FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashAR achieves up to 22.9x speedup in 512x512 autoregressive image generation by post-training a pre-trained model with a complementary vertical head and dynamic fusion using only 0.05% of original training data.

  6. FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.

  7. AdaSplash-2: Faster Differentiable Sparse Attention

    cs.LG 2026-04 unverdicted novelty 6.0

    AdaSplash-2 introduces a histogram-based initialization for the α-entmax normalizer that cuts iterations to 1-2 and, with a sparsity-aware GPU kernel, matches or beats FlashAttention-2 training speed at moderate-to-hi...

  8. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  9. RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

    cs.LG 2026-02 conditional novelty 6.0

    RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...

  10. SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models

    cs.CV 2025-12 conditional novelty 6.0

    SigLino distills SigLIP2 and DINOv3 into efficient vision models via asymmetric relation-knowledge distillation, token-balanced batching, and hierarchical data sampling on a new 200M-image corpus, yielding better tran...

  11. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  12. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    cs.CV 2025-06 unverdicted novelty 6.0

    Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...

  13. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  14. Titans: Learning to Memorize at Test Time

    cs.LG 2024-12 unverdicted novelty 6.0

    Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.

  15. ZAYA1-VL-8B Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...

  16. Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models

    cs.LG 2026-04 unverdicted novelty 4.0

    Sigmoid attention replaces softmax in single-cell foundation models to deliver better representations, faster training, and stability, backed by bounded derivatives, diagonal Jacobian, and a new efficient GPU kernel.

  17. On The Application of Linear Attention in Multimodal Transformers

    cs.CV 2026-04 unverdicted novelty 4.0

    Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 16 Pith papers · 6 internal anchors

  1. [1]

    and Ermon, Stefano and Rudra, Atri and R

    Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. FlashAttention: Fast and Memory-Efficient Exact Attention with. Advances in Neural Information Processing Systems (NeurIPS) , year=

  2. [2]

    International Conference on Learning Representations (ICLR) , year=

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. International Conference on Learning Representations (ICLR) , year=

  3. [3]

    2023 , url =

    Tri Dao and Daniel Haziza and Francisco Massa and Grigory Sizov , title =. 2023 , url =

  4. [4]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  5. [5]

    2023 , eprint=

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=

  6. [6]

    2018 , eprint=

    Self-Attention with Relative Position Representations , author=. 2018 , eprint=

  7. [7]

    2022 , eprint=

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. 2022 , eprint=

  8. [8]

    2020 , eprint=

    Longformer: The Long-Document Transformer , author=. 2020 , eprint=

  9. [9]

    2023 , eprint=

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=

  10. [10]

    2024 , eprint=

    Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

  11. [11]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  12. [12]

    torchtune: PyTorch's finetuning library , author =

  13. [13]

    Accelerating Generative AI with PyTorch II: GPT, Fast , author =

  14. [14]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  15. [15]

    2022 , eprint=

    Self-attention Does Not Need O(n^2) Memory , author=. 2022 , eprint=

  16. [16]

    2024 , eprint=

    FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , author=. 2024 , eprint=

  17. [17]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  18. [19]

    Neighborhood Attention Transformer , author =

  19. [21]

    2024 , eprint=

    A Multi-Level Superoptimizer for Tensor Programs , author=. 2024 , eprint=

  20. [22]

    2024 , eprint=

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model , author=. 2024 , eprint=

  21. [25]

    Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation , pages =

    Chen, Tianqi and Moreau, Thierry and Jiang, Ziheng and Zheng, Lianmin and Yan, Eddie and Cowan, Meghan and Shen, Haichen and Wang, Leyuan and Hu, Yuwei and Ceze, Luis and Guestrin, Carlos and Krishnamurthy, Arvind , title =. Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation , pages =. 2018 , isbn =

  22. [26]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URL https://arxiv.org/abs/2305.13245

  23. [27]

    Ansel, E

    Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., and et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS '24...

  24. [29]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer, 2020 b . URL https://arxiv.org/abs/2004.05150

  25. [30]

    Tvm: an automated end-to-end optimizing compiler for deep learning

    Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A. Tvm: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI'18, pp.\ 579–594, USA, 2018. USENIX Association. ISBN...

  26. [31]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

  27. [32]

    Y., Ermon, S., Rudra, A., and R \'e , C

    Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \'e , C. Flashattention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  28. [33]

    Flashdecoding for long-context inference, 2023

    Dao, T., Haziza, D., Massa, F., and Sizov, G. Flashdecoding for long-context inference, 2023. URL https://crfm.stanford.edu/2023/10/12/flashdecoding.html. Accessed: 2024-09-15

  29. [34]

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., and Akhil Mathur, e. a. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  30. [35]

    Accelerating generative ai with pytorch ii: Gpt, fast, November 2023

    gpt-fast maintainers and contributors. Accelerating generative ai with pytorch ii: Gpt, fast, November 2023. URL https://github.com/pytorch-labs/gpt-fast

  31. [36]

    and Shi, H

    Hassani, A. and Shi, H. Dilated neighborhood attention transformer, 2022. URL https://arxiv.org/abs/2209.15001

  32. [37]

    Neighborhood attention transformer

    Hassani, A., Walton, S., Li, J., Li, S., and Shi, H. Neighborhood attention transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  33. [38]

    Faster neighborhood attention: Reducing the o(n^2) cost of self attention at the threadblock level, 2024

    Hassani, A., Hwu, W.-M., and Shi, H. Faster neighborhood attention: Reducing the o(n^2) cost of self attention at the threadblock level, 2024. URL https://arxiv.org/abs/2403.04690

  34. [39]

    H., Gonzalez, J

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 a

  35. [40]

    H., Gonzalez, J

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 b

  36. [41]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

  37. [42]

    Rabe, M. N. and Staats, C. Self-attention does not need o(n^2) memory, 2022. URL https://arxiv.org/abs/2112.05682

  38. [43]

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683

  39. [44]

    arXiv preprint arXiv:2407.08608 , year=

    Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/abs/2407.08608

  40. [45]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  41. [46]

    Gemma 2: Improving Open Language Models at a Practical Size

    Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., and et al., T. M. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118

  42. [47]

    torchtune: Pytorch's finetuning library, April 2024

    torchtune maintainers and contributors. torchtune: Pytorch's finetuning library, April 2024. URL https//github.com/pytorch/torchtune

  43. [48]

    N., Kaiser, L

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proc...

  44. [49]

    Xu, W., Mei, K., Gao, H., Tan, J., Liang, Z., and Zhang, Y

    Wang, G., Zeng, J., Xiao, X., Wu, S., Yang, J., Zheng, L., Chen, Z., Bian, J., Yu, D., and Wang, H. Flashmask: Efficient and rich mask extension of flashattention. arXiv preprint arXiv:2410.01359, 2024

  45. [50]

    A multi-level superoptimizer for tensor programs, 2024

    Wu, M., Cheng, X., Padon, O., and Jia, Z. A multi-level superoptimizer for tensor programs, 2024. URL https://arxiv.org/abs/2405.05751