Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Abhinav Jangda; Ang\'elica Moreira; Bozhi You; Divya Mahajan; Irene Wang; Keshav Pingali; Roshan Dathathri; Zelal Su Mustafaoglu

arxiv: 2511.02043 · v4 · pith:NP2MUTJYnew · submitted 2025-11-03 · 💻 cs.LG · cs.PF

Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Bozhi You , Irene Wang , Zelal Su Mustafaoglu , Abhinav Jangda , Ang\'elica Moreira , Roshan Dathathri , Divya Mahajan , Keshav Pingali This is my paper

Pith reviewed 2026-05-22 11:35 UTC · model grok-4.3

classification 💻 cs.LG cs.PF

keywords attentionPyTorchkernel fusionFlashAttentioncompilerlarge language modelsattention variants

0 comments

The pith

Flashlight automatically generates fused high-performance kernels for arbitrary attention computations from native PyTorch code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Flashlight as a compiler-native framework inside PyTorch that turns attention code written in standard Python into optimized, fused kernels similar to FlashAttention. It handles every variant covered by prior template-based tools plus additional data-dependent patterns that those tools cannot express. Developers therefore write their attention logic once in plain PyTorch and receive competitive or better runtime performance without manual kernel tuning or static specializations. If the approach holds, new attention designs can move from research notebook to efficient training or inference far more quickly.

Core claim

Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, automatically producing FlashAttention-style kernels for arbitrary attention-based programs without static templates or predefined kernel specializations, supporting all FlexAttention variants as well as more general data-dependent formulations while delivering competitive or superior performance.

What carries the argument

PyTorch's compilation workflow that fuses and tiles arbitrary attention computations, including data-dependent ones, without templates or specializations.

If this is right

All attention variants expressible with FlexAttention templates are supported, plus additional general data-dependent formulations.
Kernels achieve performance competitive with or better than FlexAttention on supported cases.
New attention models can be explored and executed efficiently using only native PyTorch code.
Diverse attention patterns receive automatic fusion and tiling for efficient execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Research groups could test novel attention patterns on real workloads without first writing CUDA.
The same compilation path might later apply to other fused operators beyond attention.
Existing PyTorch model codebases could adopt new attention blocks with minimal performance loss.

Load-bearing premise

PyTorch's compilation can reliably fuse and tile any attention computation, even when the pattern depends on the data itself.

What would settle it

A data-dependent attention variant that produces either incorrect outputs or substantially slower kernels under Flashlight compared with a hand-written implementation.

Figures

Figures reproduced from arXiv: 2511.02043 by Abhinav Jangda, Ang\'elica Moreira, Bozhi You, Divya Mahajan, Irene Wang, Keshav Pingali, Roshan Dathathri, Zelal Su Mustafaoglu.

**Figure 1.** Figure 1: FLASHLIGHT extends TorchInductor within the torch.compile stack, adding structural and semantic fusion passes with dimension demotion, algebraic transformation, and tiling-aware dimension elimination to generate optimized Triton kernels. Torch 2.0 compiler stack (Ansel et al., 2024), used via the torch.compile() API, addresses this limitation by using two key components: TorchDynamo, a Python-level graph … view at source ↗

**Figure 2.** Figure 2: Runtimes of FLASHLIGHT and FlexAttention on H100 for attention variants that are supported by FlexAttention template [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Runtimes of FLASHLIGHT and FlexAttention on A100 for attention variants that are supported by FlexAttention template [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Runtimes of FLASHLIGHT and torch.compile on H100/A100 for attention variants that are not supported by FlexAttention [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Flashlight tries to let PyTorch's compiler handle FlashAttention-style tiling for data-dependent attention variants that templates can't cover, but the performance edge depends on whether the compiler actually delivers without fallbacks.

read the letter

Flashlight is a compiler-native approach in PyTorch that generates fused and tiled kernels for attention computations, including data-dependent patterns that go past FlexAttention's static templates. The core idea is to let users write their attention logic in ordinary Python and have torch.compile plus inductor take care of the optimizations automatically. This is a practical step because it removes the need for hand-tuned kernels or restricted template languages when experimenting with new attention variants. The paper does a reasonable job laying out why template methods hit limits on generality and how leaning on the existing compilation stack could widen the set of supported programs without extra developer effort. If the implementation works as described, it lowers the cost of trying out ideas that might improve model quality or efficiency. The soft spots are around the data-dependent cases that form the main extension. Data-dependent indexing and control flow often defeat the static analyses needed for efficient tiling, and the paper needs to show concrete evidence that the generated kernels keep the memory and compute savings in those settings rather than falling back to generic paths. The abstract asserts competitive or better performance, so the results section should include direct comparisons on specific variants, with enough detail on the test cases and measurements to check for hidden specializations. Without that, the generality claim stays plausible but unproven. This work is aimed at PyTorch users who build or tune attention layers in LLMs and want to move faster than writing CUDA. A reader who needs to prototype new patterns would get value from the framework if the benchmarks hold up. It has enough substance and testable claims to deserve a serious referee who can examine the code and performance data.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Flashlight, a PyTorch compiler extension that automatically generates fused, FlashAttention-style kernels for arbitrary attention programs written in native Python. It claims to support all variants expressible in FlexAttention plus more general data-dependent formulations, without static templates, while delivering competitive or superior performance to FlexAttention.

Significance. If the compiler-based approach can reliably deliver FlashAttention-level efficiency on data-dependent attention without hidden fallbacks or specialization, the work would meaningfully lower the barrier to exploring new attention mechanisms by allowing researchers to prototype in standard PyTorch rather than hand-tuned kernels.

major comments (2)

Abstract: the central performance claim ('Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention') is asserted without any benchmarks, error bars, or implementation details. This is load-bearing for the primary contribution and must be substantiated with concrete measurements.
Abstract: the claim that Flashlight handles 'more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention' is central to the novelty argument, yet the text supplies no concrete examples of such formulations nor any analysis showing that torch.compile + inductor successfully applies tiling and fusion without conservative codegen or runtime dispatch that would lose the O(N) memory savings.

minor comments (1)

Abstract: the description of the compilation workflow would be clearer if it explicitly named the PyTorch components (torch.compile, inductor) relied upon for fusion and tiling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's insightful comments on our work. We respond to each major comment in detail below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: Abstract: the central performance claim ('Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention') is asserted without any benchmarks, error bars, or implementation details. This is load-bearing for the primary contribution and must be substantiated with concrete measurements.

Authors: We thank the referee for this observation. The performance claims in the abstract are supported by the experimental results detailed in Section 4 of the full manuscript, which includes benchmarks on various attention variants with runtime comparisons, memory profiling, and standard error bars from multiple runs. Implementation details regarding the compiler extensions and kernel generation are described in Sections 3 and the supplementary material. To make the abstract self-contained and address this point, we will revise it to briefly mention the experimental setup and key findings. revision: yes
Referee: Abstract: the claim that Flashlight handles 'more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention' is central to the novelty argument, yet the text supplies no concrete examples of such formulations nor any analysis showing that torch.compile + inductor successfully applies tiling and fusion without conservative codegen or runtime dispatch that would lose the O(N) memory savings.

Authors: We agree that providing concrete examples would strengthen the abstract's novelty claim. The manuscript elaborates on data-dependent formulations in Section 3.1, with examples including attention mechanisms where the attention weights depend on runtime data values in ways not capturable by static templates, such as content-based dynamic pruning. Furthermore, Section 5 presents compiler analysis and empirical evidence that our extensions allow inductor to perform the necessary tiling and fusion, maintaining the memory-efficient O(N) scaling without resorting to conservative code generation or runtime dispatches. We will incorporate a concise example and reference to this analysis in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; implementation leverages external PyTorch compiler

full rationale

The paper presents Flashlight as a compiler-native framework that automatically generates fused kernels by leveraging PyTorch's existing torch.compile and inductor workflow for arbitrary attention programs. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. Performance results are empirical comparisons to FlexAttention rather than reductions to the paper's own inputs by construction. The central premise relies on the transparency of PyTorch's compilation for data-dependent cases, which is an external assumption open to independent verification rather than a self-referential definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that PyTorch's existing compilation infrastructure can be extended to perform transparent fusion and tiling for arbitrary attention patterns.

axioms (1)

domain assumption PyTorch compiler can automatically fuse and tile attention computations for arbitrary patterns
Invoked when claiming transparent support without static templates or hand-tuned kernels.

pith-pipeline@v0.9.0 · 5777 in / 1053 out tokens · 36028 ms · 2026-05-22T11:35:59.896746+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 8 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: a system for large-scale machine learning. In Proceedings of the 12th USENIX Confere...

work page 2016
[3]

J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., Nowaczynski, A., Wang, B., Stepniewska-Dziubinska, M

Ahdritz, G., Bouatta, N., Floristean, C., Kadyan, S., Xia, Q., Gerecke, W., O Donnell, T. J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., Nowaczynski, A., Wang, B., Stepniewska-Dziubinska, M. M., Zhang, S., Ojewole, A., Guney, M. E., Biderman, S., Watkins, A. M., Ra, S., Lorenzo, P. R., Nivon, L., Weitzner, B., Ban, Y.-E. A., Sorger, P. K., Mostaq...

work page doi:10.1101/2022.11.20.517210 2022
[4]

Ansel, E

Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., Chauhan, G., Chourdia, A., Constable, W., Desmaison, A., DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., Hirsh, B., Huang, S., Kalambarkar, K., Kirsch, L., Lazos, M., Lezcano, M., Liang, Y., Liang, J., Lu, Y., Luk, C. K., Maher, ...

work page doi:10.1145/3620665.3640366 2024
[5]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer, 2020. URL https://arxiv.org/abs/2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020
[6]

\ TVM \ : An automated \ End-to-End \ optimizing compiler for deep learning

Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., et al. \ TVM \ : An automated \ End-to-End \ optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp.\ 578--594, 2018

work page 2018
[7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URL https://arxiv.org/abs/2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Dong, J., Feng, B., Guessous, D., Liang, Y., and He, H. Flex attention: A programming model for generating optimized attention kernels, 2024. URL https://arxiv.org/abs/2412.05496

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Flexattention: The flexibility of pytorch with the performance of flashattention, Aug 2024

He, H., Guessous, D., Liang, Y., and Dong, J. Flexattention: The flexibility of pytorch with the performance of flashattention, Aug 2024. URL https://pytorch.org/blog/flexattention/

work page 2024
[11]

T heano: A C P U and G P U M ath C ompiler in P ython

J ames B ergstra, O livier B reuleux, F r\'ed\'eric B astien, P ascal L amblin, R azvan P ascanu, G uillaume D esjardins, J oseph T urian, D avid W arde F arley, and Y oshua B engio. T heano: A C P U and G P U M ath C ompiler in P ython. In S t\'efan van der W alt and J arrod M illman (eds.), P roceedings of the 9th P ython in S cience C onference , pp.\ ...

work page doi:10.25080/majora-92bf1922-003 2010
[12]

Jumper , author R

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Z \' dek, A., Potapenko, A., et al. Highly accurate protein structure prediction with alphafold. Nature, 596 0 (7873): 0 583--589, 2021. doi:10.1038/s41586-021-03819-2

work page doi:10.1038/s41586-021-03819-2 2021
[13]

Online normalizer calculation for softmax

Milakov, M. and Gimelshein, N. Online normalizer calculation for softmax, 2018. URL https://arxiv.org/abs/1805.02867

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

torch.fx: Practical program capture and transformation for deep learning in python

Reed, J., DeVito, Z., He, H., Ussery, A., and Ansel, J. torch.fx: Practical program capture and transformation for deep learning in python. In Marculescu, D., Chi, Y., and Wu, C. (eds.), Proceedings of Machine Learning and Systems, volume 4, pp.\ 638--651, 2022. URL https://proceedings.mlsys.org/paper_files/paper/2022/file/7c98f9c7ab2df90911da23f9ce72ed6e...

work page 2022
[16]

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/abs/2407.08608

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

F., Arora, S., Singhal, A., Fu, D

Spector, B. F., Arora, S., Singhal, A., Fu, D. Y., and Ré, C. Thunderkittens: Simple, fast, and adorable ai kernels, 2024. URL https://arxiv.org/abs/2410.20399

work page arXiv 2024
[18]

Rectified sparse attention, 2025

Sun, Y., Ye, T., Dong, L., Xia, Y., Chen, J., Gao, Y., Cao, S., Wang, J., and Wei, F. Rectified sparse attention, 2025. URL https://arxiv.org/abs/2506.04108

work page arXiv 2025
[19]

T., and Cox, D

Tillet, P., Kung, H. T., and Cox, D. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, pp.\ 10–19, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367196. doi:10.1145/3315508.33...

work page doi:10.1145/3315508.3329973 2019
[20]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedi...

work page 2017
[21]

K., Velliengiri, P., Miao, X., Padon, O., and Jia, Z

Wu, M., Cheng, X., Liu, S., Shi, C., Ji, J., Ao, M. K., Velliengiri, P., Miao, X., Padon, O., and Jia, Z. Mirage: A \ Multi-Level \ superoptimizer for tensor programs. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pp.\ 21--38, 2025

work page 2025
[22]

Differential transformer, 2024

Ye, T., Dong, L., Xia, Y., Sun, Y., Zhu, Y., Huang, G., and Wei, F. Differential transformer, 2024. URL https://arxiv.org/abs/2410.05258

work page arXiv 2024
[23]

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y., Wang, S., Chen, T., Kasikci, B., Grover, V., Krishnamurthy, A., and Ceze, L. Flashinfer: Efficient and customizable attention engine for llm inference serving, 2025. URL https://arxiv.org/abs/2501.01005

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: a system for large-scale machine learning. In Proceedings of the 12th USENIX Confere...

work page 2016

[3] [3]

J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., Nowaczynski, A., Wang, B., Stepniewska-Dziubinska, M

Ahdritz, G., Bouatta, N., Floristean, C., Kadyan, S., Xia, Q., Gerecke, W., O Donnell, T. J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., Nowaczynski, A., Wang, B., Stepniewska-Dziubinska, M. M., Zhang, S., Ojewole, A., Guney, M. E., Biderman, S., Watkins, A. M., Ra, S., Lorenzo, P. R., Nivon, L., Weitzner, B., Ban, Y.-E. A., Sorger, P. K., Mostaq...

work page doi:10.1101/2022.11.20.517210 2022

[4] [4]

Ansel, E

Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., Chauhan, G., Chourdia, A., Constable, W., Desmaison, A., DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., Hirsh, B., Huang, S., Kalambarkar, K., Kirsch, L., Lazos, M., Lezcano, M., Liang, Y., Liang, J., Lu, Y., Luk, C. K., Maher, ...

work page doi:10.1145/3620665.3640366 2024

[5] [5]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer, 2020. URL https://arxiv.org/abs/2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020

[6] [6]

\ TVM \ : An automated \ End-to-End \ optimizing compiler for deep learning

Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., et al. \ TVM \ : An automated \ End-to-End \ optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp.\ 578--594, 2018

work page 2018

[7] [7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URL https://arxiv.org/abs/2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Dong, J., Feng, B., Guessous, D., Liang, Y., and He, H. Flex attention: A programming model for generating optimized attention kernels, 2024. URL https://arxiv.org/abs/2412.05496

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Flexattention: The flexibility of pytorch with the performance of flashattention, Aug 2024

He, H., Guessous, D., Liang, Y., and Dong, J. Flexattention: The flexibility of pytorch with the performance of flashattention, Aug 2024. URL https://pytorch.org/blog/flexattention/

work page 2024

[11] [11]

T heano: A C P U and G P U M ath C ompiler in P ython

J ames B ergstra, O livier B reuleux, F r\'ed\'eric B astien, P ascal L amblin, R azvan P ascanu, G uillaume D esjardins, J oseph T urian, D avid W arde F arley, and Y oshua B engio. T heano: A C P U and G P U M ath C ompiler in P ython. In S t\'efan van der W alt and J arrod M illman (eds.), P roceedings of the 9th P ython in S cience C onference , pp.\ ...

work page doi:10.25080/majora-92bf1922-003 2010

[12] [12]

Jumper , author R

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Z \' dek, A., Potapenko, A., et al. Highly accurate protein structure prediction with alphafold. Nature, 596 0 (7873): 0 583--589, 2021. doi:10.1038/s41586-021-03819-2

work page doi:10.1038/s41586-021-03819-2 2021

[13] [13]

Online normalizer calculation for softmax

Milakov, M. and Gimelshein, N. Online normalizer calculation for softmax, 2018. URL https://arxiv.org/abs/1805.02867

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

torch.fx: Practical program capture and transformation for deep learning in python

Reed, J., DeVito, Z., He, H., Ussery, A., and Ansel, J. torch.fx: Practical program capture and transformation for deep learning in python. In Marculescu, D., Chi, Y., and Wu, C. (eds.), Proceedings of Machine Learning and Systems, volume 4, pp.\ 638--651, 2022. URL https://proceedings.mlsys.org/paper_files/paper/2022/file/7c98f9c7ab2df90911da23f9ce72ed6e...

work page 2022

[16] [16]

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/abs/2407.08608

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

F., Arora, S., Singhal, A., Fu, D

Spector, B. F., Arora, S., Singhal, A., Fu, D. Y., and Ré, C. Thunderkittens: Simple, fast, and adorable ai kernels, 2024. URL https://arxiv.org/abs/2410.20399

work page arXiv 2024

[18] [18]

Rectified sparse attention, 2025

Sun, Y., Ye, T., Dong, L., Xia, Y., Chen, J., Gao, Y., Cao, S., Wang, J., and Wei, F. Rectified sparse attention, 2025. URL https://arxiv.org/abs/2506.04108

work page arXiv 2025

[19] [19]

T., and Cox, D

Tillet, P., Kung, H. T., and Cox, D. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, pp.\ 10–19, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367196. doi:10.1145/3315508.33...

work page doi:10.1145/3315508.3329973 2019

[20] [20]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedi...

work page 2017

[21] [21]

K., Velliengiri, P., Miao, X., Padon, O., and Jia, Z

Wu, M., Cheng, X., Liu, S., Shi, C., Ji, J., Ao, M. K., Velliengiri, P., Miao, X., Padon, O., and Jia, Z. Mirage: A \ Multi-Level \ superoptimizer for tensor programs. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pp.\ 21--38, 2025

work page 2025

[22] [22]

Differential transformer, 2024

Ye, T., Dong, L., Xia, Y., Sun, Y., Zhu, Y., Huang, G., and Wei, F. Differential transformer, 2024. URL https://arxiv.org/abs/2410.05258

work page arXiv 2024

[23] [23]

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y., Wang, S., Chen, T., Kasikci, B., Grover, V., Krishnamurthy, A., and Ceze, L. Flashinfer: Efficient and customizable attention engine for llm inference serving, 2025. URL https://arxiv.org/abs/2501.01005

work page internal anchor Pith review Pith/arXiv arXiv 2025