pith. sign in

arxiv: 2506.06095 · v4 · pith:JRGJDV2Nnew · submitted 2025-06-06 · 💻 cs.LG

Accelerating Sparse Transformer Inference on GPU

Pith reviewed 2026-05-22 01:12 UTC · model grok-4.3

classification 💻 cs.LG
keywords sparse transformerGPU kernel optimizationoperator fusionmulti-head attentioninference accelerationcompilation templatesanalytical modeling
0
0 comments X

The pith

STOF accelerates sparse Transformer inference on GPUs by mapping attention to row-wise or blockwise kernels and searching fusion templates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STOF as a framework to optimize inference for sparse Transformers on GPUs, where mask layers reduce computations but require careful handling for performance. It uses analytical modeling to decide between row-wise and blockwise kernel mappings for multi-head attention, each with tailored storage formats. For other operators it applies two-stage search over compilation templates to select fusion schemes that fit the scenario. This combination supports flexible masking while delivering measured speedups. Readers care because LLMs rely on Transformers and sparsity offers efficiency gains only if the GPU execution is tuned to the sparse pattern.

Core claim

STOF is a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the state-of-the-art work, STOF achieves maximum speedups of 1.6x in MHA computation and 1.4x in end-to-end inference.

What carries the argument

Analytical modeling to select row-wise versus blockwise kernels for MHA plus two-stage search over compilation templates for downstream operator fusion.

If this is right

  • MHA computation in sparse Transformers runs faster when the kernel layout matches the sparsity pattern through modeling.
  • End-to-end inference time decreases when fusion schemes are chosen by searching compilation templates rather than using static rules.
  • Flexible masking becomes practical on GPUs because the framework adapts storage and execution to the mask without manual rewriting.
  • Speedups of 1.6x in attention and 1.4x overall hold when the two-stage search converges on good templates for the target hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modeling and search approach could be applied to other sparse operators beyond attention to compound gains in full model inference.
  • Extending the analytical model to predict energy use in addition to latency would help deployment decisions for large-scale LLM serving.
  • Testing the framework on emerging GPU architectures would reveal whether the row-versus-block decision rules remain stable or need recalibration.

Load-bearing premise

Analytical modeling of GPU kernel performance for row-wise versus blockwise mappings, together with two-stage template search, will reliably identify optimal configurations across diverse sparse masks and hardware.

What would settle it

Running STOF on a new sparse mask pattern or different GPU model and finding that a hand-tuned or alternative kernel choice runs faster than the automatically selected configuration.

Figures

Figures reproduced from arXiv: 2506.06095 by Fangxin Liu, Hailong Yang, Haodong Deng, Hongyu Liu, Mengfei Rong, Qianwen Cao, Qingxiao Sun, Wenhao Dai, Xinyu Yang.

Figure 2
Figure 2. Figure 2: Kernel fusion for MHA computation. Early works focus on the manual fusion of dense attention with￾out the mask layer. TurboTransformer [19] processes element-wise operations in embarrassingly parallel. ByteTransformer [65] imple￾ments a set of hand-written kernels. For short sequences, the inter￾mediate matrix is completely held in shared memory (SMEM) and registers. For relatively long sequences, the grou… view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison of detached operators and fused operator under different configurations. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison of fused operators using parameter settings from individual tuning and post-fusion tuning. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The design overview of STOF [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Block-wise computation with sparse storage format. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Bank conflict-free wmma warp scheduling. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The workflow of fusion scheme converter. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The workflow of hierarchical search engine. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The MHA performance of the methods normalized to that of PyTorch Native on NVIDIA RTX 4090 GPU. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The MHA performance of the methods normalized to that of PyTorch Native on NVIDIA A100 GPU. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The end-to-end performance of the methods normalized to that of PyTorch Native on RTX 4090 and A100 GPUs. [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Time breakdown of the STOF overhead normalized [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 13
Figure 13. Figure 13: The speedup of STOF with only MHA module or [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
read the original abstract

Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. In addition, current static operator fusion schemes fail to adapt to diverse application scenarios. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the stateof-the-art work, STOF achieves maximum speedups of 1.6x in MHA computation and 1.4x in end-to-end inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes STOF, a framework for accelerating sparse Transformer inference on GPUs. It enables flexible masking and operator fusion by mapping MHA computations to row-wise or blockwise kernels (with tailored storage formats) via analytical performance modeling, and by mapping downstream operator fusion to compilation templates whose optimal configurations are selected via two-stage search. Experiments are reported to yield maximum speedups of 1.6x in MHA and 1.4x end-to-end versus prior state-of-the-art work.

Significance. If the modeling and search reliably generalize, the work could deliver practical speedups for sparse Transformer deployments on GPUs. The combination of analytical modeling with lightweight search is a potentially efficient alternative to exhaustive autotuning or purely static fusion, provided the predictions hold across varied sparsity patterns and hardware.

major comments (2)
  1. The central claim that analytical modeling of row-wise versus blockwise kernel performance, together with two-stage search, identifies configurations that generalize to diverse sparse masks and hardware (abstract and STOF design description) is load-bearing for the adaptability and speedup assertions. The manuscript provides no explicit validation of the modeling equations or search procedure on held-out mask families, alternate densities, or different GPU architectures; without such evidence the reported 1.6x/1.4x gains risk being instance-specific rather than framework-driven.
  2. Experimental results section: the maximum speedups are stated without accompanying details on sparsity densities, mask generation methods, number of runs, error bars, or the precise set of baselines and datasets. This absence prevents assessment of whether the gains are robust or sensitive to post-hoc configuration choices.
minor comments (2)
  1. Abstract: 'stateof-the-art' is missing a hyphen.
  2. Notation for the unique storage formats associated with row-wise and blockwise mappings should be introduced explicitly when first used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We provide detailed responses to each major comment below and indicate the revisions we plan to make to address the concerns raised.

read point-by-point responses
  1. Referee: The central claim that analytical modeling of row-wise versus blockwise kernel performance, together with two-stage search, identifies configurations that generalize to diverse sparse masks and hardware (abstract and STOF design description) is load-bearing for the adaptability and speedup assertions. The manuscript provides no explicit validation of the modeling equations or search procedure on held-out mask families, alternate densities, or different GPU architectures; without such evidence the reported 1.6x/1.4x gains risk being instance-specific rather than framework-driven.

    Authors: We recognize that the manuscript does not present explicit validation on held-out data or different hardware. The modeling equations are derived from analytical performance models based on GPU memory hierarchy and arithmetic intensity, which are intended to be general. However, to fully substantiate the generalization claim, we will add experiments validating the model predictions on additional mask families and report results on a second GPU architecture in the revised manuscript. We have also clarified in the design section how the two-stage search adapts to new configurations. revision: yes

  2. Referee: Experimental results section: the maximum speedups are stated without accompanying details on sparsity densities, mask generation methods, number of runs, error bars, or the precise set of baselines and datasets. This absence prevents assessment of whether the gains are robust or sensitive to post-hoc configuration choices.

    Authors: We agree with the referee that additional experimental details are necessary. The revised manuscript now includes: sparsity densities used in experiments (e.g., 25%, 50%, 75% sparsity), mask generation methods (random masking and block-sparse patterns), number of runs (10 runs per configuration with mean and standard deviation reported), error bars in all figures, the full list of baselines with citations, and the specific datasets for end-to-end inference (e.g., language modeling tasks). These additions will allow readers to better evaluate the robustness of the reported speedups. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework with independent experimental validation

full rationale

The paper describes STOF as a GPU optimization framework that applies analytical modeling to select row-wise or blockwise kernels for MHA and uses two-stage search over compilation templates for operator fusion. No equations, fitted parameters presented as predictions, or self-citation chains are invoked to derive the reported speedups. The 1.6x MHA and 1.4x end-to-end gains are measured against external SOTA baselines on concrete hardware and masks, making the contribution self-contained rather than tautological. This is a standard engineering result with no load-bearing derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the framework rests on domain assumptions about GPU kernel behavior and sparsity patterns; no free parameters or invented entities are explicitly introduced in the provided text.

axioms (2)
  • domain assumption Analytical modeling can accurately predict and select between row-wise and blockwise kernel performance for sparse MHA on GPU hardware.
    Invoked when mapping computations to kernels according to analytical modeling.
  • domain assumption Two-stage search over compilation templates will identify near-optimal fusion configurations for downstream operators.
    Used to determine the optimal running configuration.

pith-pipeline@v0.9.0 · 5730 in / 1372 out tokens · 51710 ms · 2026-05-22T01:12:21.020797+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 7 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant Nair, Ilya Solovey- chik, and Purushotham Kamath. 2024. Keyformer: KV cache reduction through key tokens selection for efficient generative inference. In Conference on Machine Learning and Systems. MIT Press, 114–127

  3. [3]

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. 2022. DeepSpeed Inference: Enabling efficient inference of Transformer models at unprecedented scale. In International Conference on High Performance Computing, Networking, Storage and Analysis . IEEE, 1–15

  4. [4]

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al . 2024. PyTorch 2: Faster machine learning through dynamic python bytecode trans- formation and graph compilation. In International Conference on Architectural Support for Programming Languages and Operating...

  5. [5]

    Zhenyu Bai, Pranav Dangi, Huize Li, and Tulika Mitra. 2024. SWAT: Scalable and efficient window attention-based Transformers acceleration on FPGAs. In Design Automation Conference. ACM, 1–6

  6. [6]

    Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document Transformer. arXiv preprint arXiv:2004.05150 (2020)

  7. [7]

    Lefaudeux Benjamin, Massa Francisco, Liskovich Diana, Xiong Wenhan, Caggiano Vittorio, Naren Sean, Xu Min, Hu Jieru, Tintore Marta, Zhang Su- san, Labatut Patrick, Haziza Daniel, Wehrstedt Luca, Reizenstein Jeremy, and Sizov Grigory. 2022. xFormers: A modular and hackable Transformer modelling library. https://github.com/facebookresearch/xformers

  8. [8]

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology 15, 3 (2024), 1–45

  9. [9]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In USENIX Symposium on Operating Systems Design and Implementations . USENIX, 579–594

  10. [10]

    Zhaodong Chen, Andrew Kerr, Richard Cai, Jack Kosaian, Haicheng Wu, Yufei Ding, and Yuan Xie. 2024. EVT: Accelerating deep learning training with Epilogue Visitor Tree. In International Conference on Architectural Support for Programming Languages and Operating Systems . ACM, 301–316

  11. [11]

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse Transformers. arXiv preprint arXiv:1904.10509 (2019)

  12. [12]

    Younghyun Cho, James W Demmel, Jacob King, Xiaoye S Li, Yang Liu, and Hengrui Luo. 2023. Harnessing the crowd for autotuning high-performance com- puting applications. In International Parallel & Distributed Processing Symposium . IEEE, 635–645

  13. [13]

    Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 (2023)

  14. [14]

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAt- tention: Fast and memory-efficient exact attention with IO-awareness. Advances in neural information processing systems 35 (2022), 16344–16359

  15. [15]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional Transformers for language understanding. In Annual Conference of the North American chapter of the association for computa- tional linguistics: human language technologies . 4171–4186

  16. [16]

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. 2024. Flex Attention: A programming model for generating optimized attention kernels. arXiv preprint arXiv:2412.05496 (2024)

  17. [17]

    Jack Dongarra, Mark Gates, Jakub Kurzak, Piotr Luszczek, and Yaohung M Tsai

  18. [18]

    Autotuning numerical dense linear algebra for batched computation with GPU hardware accelerators. Proc. IEEE 106, 11 (2018), 2040–2055

  19. [19]

    Hongxiang Fan, Thomas Chau, Stylianos I Venieris, Royson Lee, Alexandros Kouris, Wayne Luk, Nicholas D Lane, and Mohamed S Abdelfattah. 2022. Adapt- able butterfly accelerator for attention-based NNs via hardware and algorithm co-design. In IEEE/ACM International Symposium on Microarchitecture . IEEE, 599–615

  20. [20]

    Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: An efficient GPU serving system for Transformer models. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . ACM, 389–402

  21. [21]

    Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare 3, 1 (2021), 1–23

  22. [22]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  23. [23]

    Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W Lee, et al. 2020. 𝐴3: Accelerating attention mechanisms in neural networks with approximation. In High Performance Computer Architecture. IEEE, 328–341

  24. [24]

    Tae Jun Ham, Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W Lee. 2021. ELSA: Hardware-software co-design for efficient, light- weight self-attention mechanism in neural networks. In International Symposium on Computer Architecture. ACM/IEEE, 692–705

  25. [25]

    Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yazdanbakhsh, and Tushar Krishna. 2023. FLAT: An optimized dataflow for mitigating attention bottlenecks. In International Conference on Architectural Support for Programming Languages and Operating Systems . ACM, 295–310

  26. [26]

    Chendi Li, Yufan Xu, Sina Mahdipour Saravani, and Ponnuswamy Sadayappan

  27. [27]

    In Inter- national Conference on Supercomputing

    Accelerated auto-tuning of GPU kernels for tensor computations. In Inter- national Conference on Supercomputing . ACM, 549–561

  28. [28]

    Mingzhen Li, Hailong Yang, Shanjun Zhang, Fengwei Yu, Ruihao Gong, Yi Liu, Zhongzhi Luan, and Depei Qian. 2023. Exploiting subgraph similarities for efficient auto-tuning of tensor programs. In International Conference on Parallel Processing. ACM, 786–796

  29. [29]

    Shutao Li, Weiwei Song, Leyuan Fang, Yushi Chen, Pedram Ghamisi, and Jon Atli Benediktsson. 2019. Deep learning for hyperspectral image classification: An overview. IEEE Transactions on Geoscience and Remote Sensing 57, 9 (2019), 6690–6709

  30. [30]

    Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of Transformers. AI open 3 (2022), 111–132

  31. [31]

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshu- mali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. 2023. Deja Vu: Contextual sparsity for efficient LLMs at inference time. In International Conference on Machine Learning . ACM, 22137–22176

  32. [32]

    Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, and Zhongfeng Wang. 2020. Hardware accelerator for multi-head attention and position-wise feed-forward in the Transformer. In International System-on-Chip Conference. IEEE, 84–89

  33. [33]

    Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenx- iang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. InUSENIX Symposium on Operating Systems Design and Implementation . USENIX, 881–897

  34. [34]

    Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. 2019. PaddlePaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Computing 1, 1 (2019), 105–115

  35. [35]

    Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNN- Fusion: Accelerating deep neural networks execution with advanced operator fusion. In International Conference on Programming Language Design and Imple- mentation. ACM, 883–898

  36. [36]

    Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu

  37. [37]

    In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

    TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 90–106

  38. [38]

    NVIDIA. 2022. https://github.com/NVIDIA/FasterTransformer

  39. [39]

    NVIDIA. 2022. https://github.com/NVIDIA/cutlass

  40. [40]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. In Con- ference on Neural Information Processing Systems . MIT Press, 27730–27744

  41. [41]

    Philip Pfaffe, Tobias Grosser, and Martin Tillmann. 2019. Efficient hierarchical online-autotuning: A case study on polyhedral accelerator mapping. In Interna- tional Conference on Supercomputing . ACM, 354–366

  42. [42]

    Yubin Qin, Yang Wang, Dazheng Deng, Zhiren Zhao, Xiaolong Yang, Leibo Liu, Shaojun Wei, Yang Hu, and Shouyi Yin. 2023. FACT: FFN-attention co-optimized Transformer architecture with eager correlation prediction. In International Sym- posium on Computer Architecture. ACM/IEEE, 1–14

  43. [43]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9

  44. [44]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, ichael MMatena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. 11 Conference acronym ’XX, June 03–05, 2028, Woodstock, NY Dai et al

  45. [45]

    Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM Sigplan Notices 48, 6 (2013), 519–530

  46. [46]

    Thomas Randall, Jaehoon Koo, Brice Videau, Michael Kruse, Xingfu Wu, Paul Hovland, Mary Hall, Rong Ge, and Prasanna Balaprakash. 2023. Transfer-learning- based autotuning using Gaussian copula. In International Conference on Super- computing. ACM, 37–49

  47. [47]

    Francisco Romero, Mark Zhao, Neeraja J Yadwadkar, and Christos Kozyrakis

  48. [48]

    In ACM Symposium on Cloud Computing

    Llama: A heterogeneous & serverless framework for auto-tuning video analytics pipelines. In ACM Symposium on Cloud Computing . ACM, 1–17

  49. [49]

    Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Effi- cient content-based sparse attention with routing Transformers. Transactions of the Association for Computational Linguistics 9 (2021), 53–68

  50. [50]

    Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. 2023. Welder: Scheduling deep learning memory access via tile-graph. In USENIX Symposium on Operating Systems Design and Implementations. USENIX, 701–718

  51. [51]

    Felix Stahlberg. 2020. Neural machine translation: A review. Journal of Artificial Intelligence Research 69 (2020), 343–418

  52. [52]

    Qingxiao Sun, Yi Liu, Hailong Yang, Zhonghui Jiang, Xiaoyan Liu, Ming Dun, Zhongzhi Luan, and Depei Qian. 2021. csTuner: Scalable auto-tuning framework for complex stencil computation on GPUs. In IEEE International Conference on Cluster Computing. IEEE, 192–203

  53. [53]

    Qingxiao Sun, Yi Liu, Hailong Yang, Zhonghui Jiang, Zhongzhi Luan, and Depei Qian. 2024. Adaptive auto-tuning framework for global exploration of stencil optimization on GPUs. IEEE Transactions on Parallel and Distributed Systems 35, 1 (2024), 20–33

  54. [54]

    Qi Sun, Xinyun Zhang, Hao Geng, Yuxuan Zhao, Yang Bai, Haisheng Zheng, and Bei Yu. 2022. GTuner: Tuning DNN computations on GPU via graph attention network. In Asia and South Pacific Design Automation Conference . ACM/IEEE, 1045–1050

  55. [55]

    Niko Sünderhauf, Oliver Brock, Walter Scheirer, Raia Hadsell, Dieter Fox, Jürgen Leitner, Ben Upcroft, Pieter Abbeel, Wolfram Burgard, Michael Milford, et al

  56. [56]

    The International journal of robotics research 37, 4-5 (2018), 405–420

    The limits and potentials of deep learning for robotics. The International journal of robotics research 37, 4-5 (2018), 405–420

  57. [57]

    Ryan Swann, Muhammad Osama, Karthik Sangaiah, and Jalal Mahmud. 2024. Seer: Predictive runtime kernel selection for irregular problems. In Code Generation and Optimization. IEEE, 133–142

  58. [58]

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine. Nature medicine 29, 8 (2023), 1930–1940

  59. [59]

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: An intermediate language and compiler for tiled neural network computations. In ACM SIGPLAN International Workshop on Machine Learning and Programming Languages . ACM, 10–19

  60. [60]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Conference on Neural Information Processing Systems . MIT Press, 6000–6010

  61. [61]

    Guoxia Wang, Jinle Zeng, Xiyuan Xiao, Siming Wu, Jiabin Yang, Lujing Zheng, Zeyu Chen, Jiang Bian, Dianhai Yu, and Haifeng Wang. 2024. FlashMask: Efficient and rich mask extension of FlashAttention.arXiv preprint arXiv:2410.01359 (2024)

  62. [62]

    Haoran Wang, Haobo Xu, Ying Wang, and Yinhe Han. 2023. CTA: Hardware- software co-design for compressed token attention mechanism. In High Perfor- mance Computer Architecture. IEEE, 429–441

  63. [63]

    Hulin Wang, Donglin Yang, Yaqi Xia, Zheng Zhang, Qigang Wang, Jianping Fan, Xiaobo Zhou, and Dazhao Cheng. 2024. Raptor-T: A fused and memory-efficient sparse Transformer for long and variable-length sequences. IEEE Trans. Comput. 73, 7 (2024), 1852–1865

  64. [64]

    Xiaohui Wang, Yang Wei, Ying Xiong, Guyue Huang, Xian Qian, Yufei Ding, Mingxuan Wang, and Lei Li. 2022. LightSeq2: Accelerated training for Transformer-based models on GPUs. In International Conference on High Perfor- mance Computing, Networking, Storage and Analysis . IEEE, 1–14

  65. [65]

    Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. 2022. Bolt: Bridging the gap between auto-tuners and hardware-native performance. Proceedings of Machine Learning and Systems 4 (2022), 204–216

  66. [66]

    Jiaming Xu, Shan Huang, Jinhao Li, Guyue Huang, Yuan Xie, Yu Wang, and Guo- hao Dai. 2024. Enabling efficient sparse multiplications on GPUs with heuristic adaptability. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems PP (2024), 1–1

  67. [67]

    Zhiying Xu, Jiafan Xu, Hongding Peng, Wei Wang, Xiaoliang Wang, Haoran Wan, Haipeng Dai, Yixu Xu, Hao Cheng, Kun Wang, et al. 2023. ALT: Breaking the wall between data layout and loop optimizations for deep learning compilation. In European Conference on Computer Systems . ACM, 199–214

  68. [68]

    Haoran You, Zhanyi Sun, Huihong Shi, Zhongzhi Yu, Yang Zhao, Yongan Zhang, Chaojian Li, Baopu Li, and Yingyan Lin. 2023. ViTCoD: Vision Transformer accel- eration via dedicated algorithm and accelerator co-design. In High Performance Computer Architecture. IEEE, 273–286

  69. [69]

    Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for longer sequences. In Conference on Neural Information Processing Systems . MIT Press, 1–15

  70. [70]

    Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. 2023. ByteTransformer: A high-performance Trans- former boosted for variable-length inputs. In International Parallel & Distributed Processing Symposium. IEEE, 344–355

  71. [71]

    Zheng Zhang, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. 2024. MCFuser: High-performance and rapid fusion of memory-bound compute-intensive opera- tors. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15

  72. [72]

    Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, et al. 2021. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In ACM SIGPLAN Conference on Programming Language Design and Implementation . ACM, 1233– 1248

  73. [73]

    Jieru Zhao, Pai Zeng, Guan Shen, Quan Chen, and Minyi Guo. 2024. Hard- ware–software co-design enabling static and dynamic sparse attention mecha- nisms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43, 9 (2024), 2783–2796

  74. [74]

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 1, 2 (2023)

  75. [75]

    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating high-performance tensor programs for deep learning. In USENIX Symposium on Operating Systems Design and Implementation . USENIX, 863–879

  76. [76]

    Lianmin Zheng, Ruochen Liu, Junru Shao, Tianqi Chen, Joseph E Gonzalez, Ion Stoica, and Ameer Haj Ali. 2021. TenSet: A large-scale program performance dataset for learned tensor compilers. In Conference on Neural Information Pro- cessing Systems. MIT Press, 1–14

  77. [77]

    Size Zheng, Renze Chen, Yicheng Jin, Anjiang Wei, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. 2021. NeoFlow: A flexible framework for enabling efficient compilation for high performance DNN training. IEEE Transactions on Parallel and Distributed Systems 33, 11 (2021), 3220–3232

  78. [78]

    Size Zheng, Siyuan Chen, Peidi Song, Renze Chen, Xiuhong Li, Shengen Yan, Dahua Lin, Jingwen Leng, and Yun Liang. 2023. Chimera: An analytical optimizing framework for effective compute-intensive operators fusion. InHigh Performance Computer Architecture. ACM, 1113–1126

  79. [79]

    Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. Flex- Tensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In International Conference on Architectural Support for Programming Languages and Operating Systems . ACM, 859–873

  80. [80]

    Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, et al. 2022. AStitch: Enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In International Conference on Architectural Support for Programming Languages and...

Showing first 80 references.