pith. sign in

arxiv: 2511.02043 · v4 · pith:NP2MUTJYnew · submitted 2025-11-03 · 💻 cs.LG · cs.PF

Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Pith reviewed 2026-05-22 11:35 UTC · model grok-4.3

classification 💻 cs.LG cs.PF
keywords attentionPyTorchkernel fusionFlashAttentioncompilerlarge language modelsattention variants
0
0 comments X

The pith

Flashlight automatically generates fused high-performance kernels for arbitrary attention computations from native PyTorch code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Flashlight as a compiler-native framework inside PyTorch that turns attention code written in standard Python into optimized, fused kernels similar to FlashAttention. It handles every variant covered by prior template-based tools plus additional data-dependent patterns that those tools cannot express. Developers therefore write their attention logic once in plain PyTorch and receive competitive or better runtime performance without manual kernel tuning or static specializations. If the approach holds, new attention designs can move from research notebook to efficient training or inference far more quickly.

Core claim

Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, automatically producing FlashAttention-style kernels for arbitrary attention-based programs without static templates or predefined kernel specializations, supporting all FlexAttention variants as well as more general data-dependent formulations while delivering competitive or superior performance.

What carries the argument

PyTorch's compilation workflow that fuses and tiles arbitrary attention computations, including data-dependent ones, without templates or specializations.

If this is right

  • All attention variants expressible with FlexAttention templates are supported, plus additional general data-dependent formulations.
  • Kernels achieve performance competitive with or better than FlexAttention on supported cases.
  • New attention models can be explored and executed efficiently using only native PyTorch code.
  • Diverse attention patterns receive automatic fusion and tiling for efficient execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Research groups could test novel attention patterns on real workloads without first writing CUDA.
  • The same compilation path might later apply to other fused operators beyond attention.
  • Existing PyTorch model codebases could adopt new attention blocks with minimal performance loss.

Load-bearing premise

PyTorch's compilation can reliably fuse and tile any attention computation, even when the pattern depends on the data itself.

What would settle it

A data-dependent attention variant that produces either incorrect outputs or substantially slower kernels under Flashlight compared with a hand-written implementation.

Figures

Figures reproduced from arXiv: 2511.02043 by Abhinav Jangda, Ang\'elica Moreira, Bozhi You, Divya Mahajan, Irene Wang, Keshav Pingali, Roshan Dathathri, Zelal Su Mustafaoglu.

Figure 1
Figure 1. Figure 1: FLASHLIGHT extends TorchInductor within the torch.compile stack, adding structural and semantic fusion passes with dimension demotion, algebraic transformation, and tiling-aware dimension elimination to generate optimized Triton kernels. Torch 2.0 compiler stack (Ansel et al., 2024), used via the torch.compile() API, addresses this limitation by us￾ing two key components: TorchDynamo, a Python-level graph … view at source ↗
Figure 2
Figure 2. Figure 2: Runtimes of FLASHLIGHT and FlexAttention on H100 for attention variants that are supported by FlexAttention template [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Runtimes of FLASHLIGHT and FlexAttention on A100 for attention variants that are supported by FlexAttention template [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Runtimes of FLASHLIGHT and torch.compile on H100/A100 for attention variants that are not supported by FlexAttention [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Flashlight, a PyTorch compiler extension that automatically generates fused, FlashAttention-style kernels for arbitrary attention programs written in native Python. It claims to support all variants expressible in FlexAttention plus more general data-dependent formulations, without static templates, while delivering competitive or superior performance to FlexAttention.

Significance. If the compiler-based approach can reliably deliver FlashAttention-level efficiency on data-dependent attention without hidden fallbacks or specialization, the work would meaningfully lower the barrier to exploring new attention mechanisms by allowing researchers to prototype in standard PyTorch rather than hand-tuned kernels.

major comments (2)
  1. Abstract: the central performance claim ('Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention') is asserted without any benchmarks, error bars, or implementation details. This is load-bearing for the primary contribution and must be substantiated with concrete measurements.
  2. Abstract: the claim that Flashlight handles 'more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention' is central to the novelty argument, yet the text supplies no concrete examples of such formulations nor any analysis showing that torch.compile + inductor successfully applies tiling and fusion without conservative codegen or runtime dispatch that would lose the O(N) memory savings.
minor comments (1)
  1. Abstract: the description of the compilation workflow would be clearer if it explicitly named the PyTorch components (torch.compile, inductor) relied upon for fusion and tiling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's insightful comments on our work. We respond to each major comment in detail below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: Abstract: the central performance claim ('Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention') is asserted without any benchmarks, error bars, or implementation details. This is load-bearing for the primary contribution and must be substantiated with concrete measurements.

    Authors: We thank the referee for this observation. The performance claims in the abstract are supported by the experimental results detailed in Section 4 of the full manuscript, which includes benchmarks on various attention variants with runtime comparisons, memory profiling, and standard error bars from multiple runs. Implementation details regarding the compiler extensions and kernel generation are described in Sections 3 and the supplementary material. To make the abstract self-contained and address this point, we will revise it to briefly mention the experimental setup and key findings. revision: yes

  2. Referee: Abstract: the claim that Flashlight handles 'more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention' is central to the novelty argument, yet the text supplies no concrete examples of such formulations nor any analysis showing that torch.compile + inductor successfully applies tiling and fusion without conservative codegen or runtime dispatch that would lose the O(N) memory savings.

    Authors: We agree that providing concrete examples would strengthen the abstract's novelty claim. The manuscript elaborates on data-dependent formulations in Section 3.1, with examples including attention mechanisms where the attention weights depend on runtime data values in ways not capturable by static templates, such as content-based dynamic pruning. Furthermore, Section 5 presents compiler analysis and empirical evidence that our extensions allow inductor to perform the necessary tiling and fusion, maintaining the memory-efficient O(N) scaling without resorting to conservative code generation or runtime dispatches. We will incorporate a concise example and reference to this analysis in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; implementation leverages external PyTorch compiler

full rationale

The paper presents Flashlight as a compiler-native framework that automatically generates fused kernels by leveraging PyTorch's existing torch.compile and inductor workflow for arbitrary attention programs. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. Performance results are empirical comparisons to FlexAttention rather than reductions to the paper's own inputs by construction. The central premise relies on the transparency of PyTorch's compilation for data-dependent cases, which is an external assumption open to independent verification rather than a self-referential definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that PyTorch's existing compilation infrastructure can be extended to perform transparent fusion and tiling for arbitrary attention patterns.

axioms (1)
  • domain assumption PyTorch compiler can automatically fuse and tile attention computations for arbitrary patterns
    Invoked when claiming transparent support without static templates or hand-tuned kernels.

pith-pipeline@v0.9.0 · 5777 in / 1053 out tokens · 36028 ms · 2026-05-22T11:35:59.896746+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 8 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X

    Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: a system for large-scale machine learning. In Proceedings of the 12th USENIX Confere...

  3. [3]

    J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., Nowaczynski, A., Wang, B., Stepniewska-Dziubinska, M

    Ahdritz, G., Bouatta, N., Floristean, C., Kadyan, S., Xia, Q., Gerecke, W., O Donnell, T. J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., Nowaczynski, A., Wang, B., Stepniewska-Dziubinska, M. M., Zhang, S., Ojewole, A., Guney, M. E., Biderman, S., Watkins, A. M., Ra, S., Lorenzo, P. R., Nivon, L., Weitzner, B., Ban, Y.-E. A., Sorger, P. K., Mostaq...

  4. [4]

    Ansel, E

    Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., Chauhan, G., Chourdia, A., Constable, W., Desmaison, A., DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., Hirsh, B., Huang, S., Kalambarkar, K., Kirsch, L., Lazos, M., Lezcano, M., Liang, Y., Liang, J., Lu, Y., Luk, C. K., Maher, ...

  5. [5]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer, 2020. URL https://arxiv.org/abs/2004.05150

  6. [6]

    \ TVM \ : An automated \ End-to-End \ optimizing compiler for deep learning

    Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., et al. \ TVM \ : An automated \ End-to-End \ optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp.\ 578--594, 2018

  7. [7]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URL https://arxiv.org/abs/2307.08691

  8. [8]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv.org/abs/2205.14135

  9. [9]

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    Dong, J., Feng, B., Guessous, D., Liang, Y., and He, H. Flex attention: A programming model for generating optimized attention kernels, 2024. URL https://arxiv.org/abs/2412.05496

  10. [10]

    Flexattention: The flexibility of pytorch with the performance of flashattention, Aug 2024

    He, H., Guessous, D., Liang, Y., and Dong, J. Flexattention: The flexibility of pytorch with the performance of flashattention, Aug 2024. URL https://pytorch.org/blog/flexattention/

  11. [11]

    T heano: A C P U and G P U M ath C ompiler in P ython

    J ames B ergstra, O livier B reuleux, F r\'ed\'eric B astien, P ascal L amblin, R azvan P ascanu, G uillaume D esjardins, J oseph T urian, D avid W arde F arley, and Y oshua B engio. T heano: A C P U and G P U M ath C ompiler in P ython. In S t\'efan van der W alt and J arrod M illman (eds.), P roceedings of the 9th P ython in S cience C onference , pp.\ ...

  12. [12]

    Jumper , author R

    Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Z \' dek, A., Potapenko, A., et al. Highly accurate protein structure prediction with alphafold. Nature, 596 0 (7873): 0 583--589, 2021. doi:10.1038/s41586-021-03819-2

  13. [13]

    Online normalizer calculation for softmax

    Milakov, M. and Gimelshein, N. Online normalizer calculation for softmax, 2018. URL https://arxiv.org/abs/1805.02867

  14. [14]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

  15. [15]

    torch.fx: Practical program capture and transformation for deep learning in python

    Reed, J., DeVito, Z., He, H., Ussery, A., and Ansel, J. torch.fx: Practical program capture and transformation for deep learning in python. In Marculescu, D., Chi, Y., and Wu, C. (eds.), Proceedings of Machine Learning and Systems, volume 4, pp.\ 638--651, 2022. URL https://proceedings.mlsys.org/paper_files/paper/2022/file/7c98f9c7ab2df90911da23f9ce72ed6e...

  16. [16]

    FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/abs/2407.08608

  17. [17]

    F., Arora, S., Singhal, A., Fu, D

    Spector, B. F., Arora, S., Singhal, A., Fu, D. Y., and Ré, C. Thunderkittens: Simple, fast, and adorable ai kernels, 2024. URL https://arxiv.org/abs/2410.20399

  18. [18]

    Rectified sparse attention, 2025

    Sun, Y., Ye, T., Dong, L., Xia, Y., Chen, J., Gao, Y., Cao, S., Wang, J., and Wei, F. Rectified sparse attention, 2025. URL https://arxiv.org/abs/2506.04108

  19. [19]

    T., and Cox, D

    Tillet, P., Kung, H. T., and Cox, D. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, pp.\ 10–19, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367196. doi:10.1145/3315508.33...

  20. [20]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedi...

  21. [21]

    K., Velliengiri, P., Miao, X., Padon, O., and Jia, Z

    Wu, M., Cheng, X., Liu, S., Shi, C., Ji, J., Ao, M. K., Velliengiri, P., Miao, X., Padon, O., and Jia, Z. Mirage: A \ Multi-Level \ superoptimizer for tensor programs. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pp.\ 21--38, 2025

  22. [22]

    Differential transformer, 2024

    Ye, T., Dong, L., Xia, Y., Sun, Y., Zhu, Y., Huang, G., and Wei, F. Differential transformer, 2024. URL https://arxiv.org/abs/2410.05258

  23. [23]

    FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

    Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y., Wang, S., Chen, T., Kasikci, B., Grover, V., Krishnamurthy, A., and Ceze, L. Flashinfer: Efficient and customizable attention engine for llm inference serving, 2025. URL https://arxiv.org/abs/2501.01005