Accelerating Sparse Transformer Inference on GPU

Fangxin Liu; Hailong Yang; Haodong Deng; Hongyu Liu; Mengfei Rong; Qianwen Cao; Qingxiao Sun; Wenhao Dai; Xinyu Yang

arxiv: 2506.06095 · v4 · pith:JRGJDV2Nnew · submitted 2025-06-06 · 💻 cs.LG

Accelerating Sparse Transformer Inference on GPU

Wenhao Dai , Haodong Deng , Mengfei Rong , Xinyu Yang , Hongyu Liu , Fangxin Liu , Hailong Yang , Qianwen Cao

show 1 more author

Qingxiao Sun

This is my paper

Pith reviewed 2026-05-22 01:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse transformerGPU kernel optimizationoperator fusionmulti-head attentioninference accelerationcompilation templatesanalytical modeling

0 comments

The pith

STOF accelerates sparse Transformer inference on GPUs by mapping attention to row-wise or blockwise kernels and searching fusion templates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STOF as a framework to optimize inference for sparse Transformers on GPUs, where mask layers reduce computations but require careful handling for performance. It uses analytical modeling to decide between row-wise and blockwise kernel mappings for multi-head attention, each with tailored storage formats. For other operators it applies two-stage search over compilation templates to select fusion schemes that fit the scenario. This combination supports flexible masking while delivering measured speedups. Readers care because LLMs rely on Transformers and sparsity offers efficiency gains only if the GPU execution is tuned to the sparse pattern.

Core claim

STOF is a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the state-of-the-art work, STOF achieves maximum speedups of 1.6x in MHA computation and 1.4x in end-to-end inference.

What carries the argument

Analytical modeling to select row-wise versus blockwise kernels for MHA plus two-stage search over compilation templates for downstream operator fusion.

If this is right

MHA computation in sparse Transformers runs faster when the kernel layout matches the sparsity pattern through modeling.
End-to-end inference time decreases when fusion schemes are chosen by searching compilation templates rather than using static rules.
Flexible masking becomes practical on GPUs because the framework adapts storage and execution to the mask without manual rewriting.
Speedups of 1.6x in attention and 1.4x overall hold when the two-stage search converges on good templates for the target hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modeling and search approach could be applied to other sparse operators beyond attention to compound gains in full model inference.
Extending the analytical model to predict energy use in addition to latency would help deployment decisions for large-scale LLM serving.
Testing the framework on emerging GPU architectures would reveal whether the row-versus-block decision rules remain stable or need recalibration.

Load-bearing premise

Analytical modeling of GPU kernel performance for row-wise versus blockwise mappings, together with two-stage template search, will reliably identify optimal configurations across diverse sparse masks and hardware.

What would settle it

Running STOF on a new sparse mask pattern or different GPU model and finding that a hand-tuned or alternative kernel choice runs faster than the automatically selected configuration.

Figures

Figures reproduced from arXiv: 2506.06095 by Fangxin Liu, Hailong Yang, Haodong Deng, Hongyu Liu, Mengfei Rong, Qianwen Cao, Qingxiao Sun, Wenhao Dai, Xinyu Yang.

**Figure 2.** Figure 2: Kernel fusion for MHA computation. Early works focus on the manual fusion of dense attention without the mask layer. TurboTransformer [19] processes element-wise operations in embarrassingly parallel. ByteTransformer [65] implements a set of hand-written kernels. For short sequences, the intermediate matrix is completely held in shared memory (SMEM) and registers. For relatively long sequences, the grou… view at source ↗

**Figure 3.** Figure 3: Performance comparison of detached operators and fused operator under different configurations. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Performance comparison of fused operators using parameter settings from individual tuning and post-fusion tuning. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: The design overview of STOF [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Block-wise computation with sparse storage format. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Bank conflict-free wmma warp scheduling. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: The workflow of fusion scheme converter. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: The workflow of hierarchical search engine. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: The MHA performance of the methods normalized to that of PyTorch Native on NVIDIA RTX 4090 GPU. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: The MHA performance of the methods normalized to that of PyTorch Native on NVIDIA A100 GPU. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: The end-to-end performance of the methods normalized to that of PyTorch Native on RTX 4090 and A100 GPUs. [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 14.** Figure 14: Time breakdown of the STOF overhead normalized [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 13.** Figure 13: The speedup of STOF with only MHA module or [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗

read the original abstract

Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. In addition, current static operator fusion schemes fail to adapt to diverse application scenarios. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the stateof-the-art work, STOF achieves maximum speedups of 1.6x in MHA computation and 1.4x in end-to-end inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STOF offers practical speedups for sparse Transformer inference via kernel mapping and fusion search, but evidence is limited so far.

read the letter

The one or two things your colleague should know about this paper is that it introduces STOF, a framework for accelerating sparse Transformer inference on GPUs. It uses analytical modeling to map multi-head attention to either row-wise or blockwise kernels with appropriate storage, and applies a two-stage search to find good fusion configurations for other operators. They report maximum speedups of 1.6 times in MHA and 1.4 times end-to-end compared to state-of-the-art. What the paper does well is address the limitations of static operator fusion by making it more adaptable to various sparse mask patterns. This combination of kernel mapping and search-based fusion seems like a logical way to improve performance in diverse application scenarios for LLMs with sparsity. The work is grounded in real GPU constraints, which gives it practical value if the results are reproducible. On the soft spots, the abstract doesn't provide enough on the experimental setup, such as what sparsity densities were used, which datasets, or the specific baselines. This makes it difficult to evaluate the strength of the claims. There's also the question of whether the analytical performance model and search method work reliably outside the tested conditions, as the stress-test suggests it might be tuned to particular hardware and masks rather than generalizing broadly. This kind of paper is for researchers and engineers working on efficient inference for large language models, especially those implementing custom sparse attention on GPUs. A reader in that area could get value from the specific techniques described. I think it deserves a serious referee because the topic is timely for LLM deployment and the proposed framework has potential, even if more evidence is needed to confirm the benefits.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes STOF, a framework for accelerating sparse Transformer inference on GPUs. It enables flexible masking and operator fusion by mapping MHA computations to row-wise or blockwise kernels (with tailored storage formats) via analytical performance modeling, and by mapping downstream operator fusion to compilation templates whose optimal configurations are selected via two-stage search. Experiments are reported to yield maximum speedups of 1.6x in MHA and 1.4x end-to-end versus prior state-of-the-art work.

Significance. If the modeling and search reliably generalize, the work could deliver practical speedups for sparse Transformer deployments on GPUs. The combination of analytical modeling with lightweight search is a potentially efficient alternative to exhaustive autotuning or purely static fusion, provided the predictions hold across varied sparsity patterns and hardware.

major comments (2)

The central claim that analytical modeling of row-wise versus blockwise kernel performance, together with two-stage search, identifies configurations that generalize to diverse sparse masks and hardware (abstract and STOF design description) is load-bearing for the adaptability and speedup assertions. The manuscript provides no explicit validation of the modeling equations or search procedure on held-out mask families, alternate densities, or different GPU architectures; without such evidence the reported 1.6x/1.4x gains risk being instance-specific rather than framework-driven.
Experimental results section: the maximum speedups are stated without accompanying details on sparsity densities, mask generation methods, number of runs, error bars, or the precise set of baselines and datasets. This absence prevents assessment of whether the gains are robust or sensitive to post-hoc configuration choices.

minor comments (2)

Abstract: 'stateof-the-art' is missing a hyphen.
Notation for the unique storage formats associated with row-wise and blockwise mappings should be introduced explicitly when first used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We provide detailed responses to each major comment below and indicate the revisions we plan to make to address the concerns raised.

read point-by-point responses

Referee: The central claim that analytical modeling of row-wise versus blockwise kernel performance, together with two-stage search, identifies configurations that generalize to diverse sparse masks and hardware (abstract and STOF design description) is load-bearing for the adaptability and speedup assertions. The manuscript provides no explicit validation of the modeling equations or search procedure on held-out mask families, alternate densities, or different GPU architectures; without such evidence the reported 1.6x/1.4x gains risk being instance-specific rather than framework-driven.

Authors: We recognize that the manuscript does not present explicit validation on held-out data or different hardware. The modeling equations are derived from analytical performance models based on GPU memory hierarchy and arithmetic intensity, which are intended to be general. However, to fully substantiate the generalization claim, we will add experiments validating the model predictions on additional mask families and report results on a second GPU architecture in the revised manuscript. We have also clarified in the design section how the two-stage search adapts to new configurations. revision: yes
Referee: Experimental results section: the maximum speedups are stated without accompanying details on sparsity densities, mask generation methods, number of runs, error bars, or the precise set of baselines and datasets. This absence prevents assessment of whether the gains are robust or sensitive to post-hoc configuration choices.

Authors: We agree with the referee that additional experimental details are necessary. The revised manuscript now includes: sparsity densities used in experiments (e.g., 25%, 50%, 75% sparsity), mask generation methods (random masking and block-sparse patterns), number of runs (10 runs per configuration with mean and standard deviation reported), error bars in all figures, the full list of baselines with citations, and the specific datasets for end-to-end inference (e.g., language modeling tasks). These additions will allow readers to better evaluate the robustness of the reported speedups. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework with independent experimental validation

full rationale

The paper describes STOF as a GPU optimization framework that applies analytical modeling to select row-wise or blockwise kernels for MHA and uses two-stage search over compilation templates for operator fusion. No equations, fitted parameters presented as predictions, or self-citation chains are invoked to derive the reported speedups. The 1.6x MHA and 1.4x end-to-end gains are measured against external SOTA baselines on concrete hardware and masks, making the contribution self-contained rather than tautological. This is a standard engineering result with no load-bearing derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the framework rests on domain assumptions about GPU kernel behavior and sparsity patterns; no free parameters or invented entities are explicitly introduced in the provided text.

axioms (2)

domain assumption Analytical modeling can accurately predict and select between row-wise and blockwise kernel performance for sparse MHA on GPU hardware.
Invoked when mapping computations to kernels according to analytical modeling.
domain assumption Two-stage search over compilation templates will identify near-optimal fusion configurations for downstream operators.
Used to determine the optimal running configuration.

pith-pipeline@v0.9.0 · 5730 in / 1372 out tokens · 51710 ms · 2026-05-22T01:12:21.020797+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling... two-stage searching
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose STOF, a framework that incorporates optimizations for Sparse Transformer via flexible masking and operator fusion on GPU

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 7 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant Nair, Ilya Solovey- chik, and Purushotham Kamath. 2024. Keyformer: KV cache reduction through key tokens selection for efficient generative inference. In Conference on Machine Learning and Systems. MIT Press, 114–127

work page 2024
[3]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. 2022. DeepSpeed Inference: Enabling efficient inference of Transformer models at unprecedented scale. In International Conference on High Performance Computing, Networking, Storage and Analysis . IEEE, 1–15

work page 2022
[4]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al . 2024. PyTorch 2: Faster machine learning through dynamic python bytecode trans- formation and graph compilation. In International Conference on Architectural Support for Programming Languages and Operating...

work page 2024
[5]

Zhenyu Bai, Pranav Dangi, Huize Li, and Tulika Mitra. 2024. SWAT: Scalable and efficient window attention-based Transformers acceleration on FPGAs. In Design Automation Conference. ACM, 1–6

work page 2024
[6]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document Transformer. arXiv preprint arXiv:2004.05150 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[7]

Lefaudeux Benjamin, Massa Francisco, Liskovich Diana, Xiong Wenhan, Caggiano Vittorio, Naren Sean, Xu Min, Hu Jieru, Tintore Marta, Zhang Su- san, Labatut Patrick, Haziza Daniel, Wehrstedt Luca, Reizenstein Jeremy, and Sizov Grigory. 2022. xFormers: A modular and hackable Transformer modelling library. https://github.com/facebookresearch/xformers

work page 2022
[8]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology 15, 3 (2024), 1–45

work page 2024
[9]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In USENIX Symposium on Operating Systems Design and Implementations . USENIX, 579–594

work page 2018
[10]

Zhaodong Chen, Andrew Kerr, Richard Cai, Jack Kosaian, Haicheng Wu, Yufei Ding, and Yuan Xie. 2024. EVT: Accelerating deep learning training with Epilogue Visitor Tree. In International Conference on Architectural Support for Programming Languages and Operating Systems . ACM, 301–316

work page 2024
[11]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse Transformers. arXiv preprint arXiv:1904.10509 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[12]

Younghyun Cho, James W Demmel, Jacob King, Xiaoye S Li, Yang Liu, and Hengrui Luo. 2023. Harnessing the crowd for autotuning high-performance com- puting applications. In International Parallel & Distributed Processing Symposium . IEEE, 635–645

work page 2023
[13]

Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAt- tention: Fast and memory-efficient exact attention with IO-awareness. Advances in neural information processing systems 35 (2022), 16344–16359

work page 2022
[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional Transformers for language understanding. In Annual Conference of the North American chapter of the association for computa- tional linguistics: human language technologies . 4171–4186

work page 2019
[16]

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. 2024. Flex Attention: A programming model for generating optimized attention kernels. arXiv preprint arXiv:2412.05496 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Jack Dongarra, Mark Gates, Jakub Kurzak, Piotr Luszczek, and Yaohung M Tsai

work page
[18]

Autotuning numerical dense linear algebra for batched computation with GPU hardware accelerators. Proc. IEEE 106, 11 (2018), 2040–2055

work page 2018
[19]

Hongxiang Fan, Thomas Chau, Stylianos I Venieris, Royson Lee, Alexandros Kouris, Wayne Luk, Nicholas D Lane, and Mohamed S Abdelfattah. 2022. Adapt- able butterfly accelerator for attention-based NNs via hardware and algorithm co-design. In IEEE/ACM International Symposium on Microarchitecture . IEEE, 599–615

work page 2022
[20]

Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: An efficient GPU serving system for Transformer models. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . ACM, 389–402

work page 2021
[21]

Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare 3, 1 (2021), 1–23

work page 2021
[22]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W Lee, et al. 2020. 𝐴3: Accelerating attention mechanisms in neural networks with approximation. In High Performance Computer Architecture. IEEE, 328–341

work page 2020
[24]

Tae Jun Ham, Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W Lee. 2021. ELSA: Hardware-software co-design for efficient, light- weight self-attention mechanism in neural networks. In International Symposium on Computer Architecture. ACM/IEEE, 692–705

work page 2021
[25]

Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yazdanbakhsh, and Tushar Krishna. 2023. FLAT: An optimized dataflow for mitigating attention bottlenecks. In International Conference on Architectural Support for Programming Languages and Operating Systems . ACM, 295–310

work page 2023
[26]

Chendi Li, Yufan Xu, Sina Mahdipour Saravani, and Ponnuswamy Sadayappan

work page
[27]

In Inter- national Conference on Supercomputing

Accelerated auto-tuning of GPU kernels for tensor computations. In Inter- national Conference on Supercomputing . ACM, 549–561

work page
[28]

Mingzhen Li, Hailong Yang, Shanjun Zhang, Fengwei Yu, Ruihao Gong, Yi Liu, Zhongzhi Luan, and Depei Qian. 2023. Exploiting subgraph similarities for efficient auto-tuning of tensor programs. In International Conference on Parallel Processing. ACM, 786–796

work page 2023
[29]

Shutao Li, Weiwei Song, Leyuan Fang, Yushi Chen, Pedram Ghamisi, and Jon Atli Benediktsson. 2019. Deep learning for hyperspectral image classification: An overview. IEEE Transactions on Geoscience and Remote Sensing 57, 9 (2019), 6690–6709

work page 2019
[30]

Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of Transformers. AI open 3 (2022), 111–132

work page 2022
[31]

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshu- mali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. 2023. Deja Vu: Contextual sparsity for efficient LLMs at inference time. In International Conference on Machine Learning . ACM, 22137–22176

work page 2023
[32]

Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, and Zhongfeng Wang. 2020. Hardware accelerator for multi-head attention and position-wise feed-forward in the Transformer. In International System-on-Chip Conference. IEEE, 84–89

work page 2020
[33]

Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenx- iang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. InUSENIX Symposium on Operating Systems Design and Implementation . USENIX, 881–897

work page 2020
[34]

Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. 2019. PaddlePaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Computing 1, 1 (2019), 105–115

work page 2019
[35]

Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNN- Fusion: Accelerating deep neural networks execution with advanced operator fusion. In International Conference on Programming Language Design and Imple- mentation. ACM, 883–898

work page 2021
[36]

Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu

work page
[37]

In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 90–106

work page
[38]

NVIDIA. 2022. https://github.com/NVIDIA/FasterTransformer

work page 2022
[39]

NVIDIA. 2022. https://github.com/NVIDIA/cutlass

work page 2022
[40]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. In Con- ference on Neural Information Processing Systems . MIT Press, 27730–27744

work page 2022
[41]

Philip Pfaffe, Tobias Grosser, and Martin Tillmann. 2019. Efficient hierarchical online-autotuning: A case study on polyhedral accelerator mapping. In Interna- tional Conference on Supercomputing . ACM, 354–366

work page 2019
[42]

Yubin Qin, Yang Wang, Dazheng Deng, Zhiren Zhao, Xiaolong Yang, Leibo Liu, Shaojun Wei, Yang Hu, and Shouyi Yin. 2023. FACT: FFN-attention co-optimized Transformer architecture with eager correlation prediction. In International Sym- posium on Computer Architecture. ACM/IEEE, 1–14

work page 2023
[43]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9

work page 2019
[44]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, ichael MMatena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. 11 Conference acronym ’XX, June 03–05, 2028, Woodstock, NY Dai et al

work page 2020
[45]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM Sigplan Notices 48, 6 (2013), 519–530

work page 2013
[46]

Thomas Randall, Jaehoon Koo, Brice Videau, Michael Kruse, Xingfu Wu, Paul Hovland, Mary Hall, Rong Ge, and Prasanna Balaprakash. 2023. Transfer-learning- based autotuning using Gaussian copula. In International Conference on Super- computing. ACM, 37–49

work page 2023
[47]

Francisco Romero, Mark Zhao, Neeraja J Yadwadkar, and Christos Kozyrakis

work page
[48]

In ACM Symposium on Cloud Computing

Llama: A heterogeneous & serverless framework for auto-tuning video analytics pipelines. In ACM Symposium on Cloud Computing . ACM, 1–17

work page
[49]

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Effi- cient content-based sparse attention with routing Transformers. Transactions of the Association for Computational Linguistics 9 (2021), 53–68

work page 2021
[50]

Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. 2023. Welder: Scheduling deep learning memory access via tile-graph. In USENIX Symposium on Operating Systems Design and Implementations. USENIX, 701–718

work page 2023
[51]

Felix Stahlberg. 2020. Neural machine translation: A review. Journal of Artificial Intelligence Research 69 (2020), 343–418

work page 2020
[52]

Qingxiao Sun, Yi Liu, Hailong Yang, Zhonghui Jiang, Xiaoyan Liu, Ming Dun, Zhongzhi Luan, and Depei Qian. 2021. csTuner: Scalable auto-tuning framework for complex stencil computation on GPUs. In IEEE International Conference on Cluster Computing. IEEE, 192–203

work page 2021
[53]

Qingxiao Sun, Yi Liu, Hailong Yang, Zhonghui Jiang, Zhongzhi Luan, and Depei Qian. 2024. Adaptive auto-tuning framework for global exploration of stencil optimization on GPUs. IEEE Transactions on Parallel and Distributed Systems 35, 1 (2024), 20–33

work page 2024
[54]

Qi Sun, Xinyun Zhang, Hao Geng, Yuxuan Zhao, Yang Bai, Haisheng Zheng, and Bei Yu. 2022. GTuner: Tuning DNN computations on GPU via graph attention network. In Asia and South Pacific Design Automation Conference . ACM/IEEE, 1045–1050

work page 2022
[55]

Niko Sünderhauf, Oliver Brock, Walter Scheirer, Raia Hadsell, Dieter Fox, Jürgen Leitner, Ben Upcroft, Pieter Abbeel, Wolfram Burgard, Michael Milford, et al

work page
[56]

The International journal of robotics research 37, 4-5 (2018), 405–420

The limits and potentials of deep learning for robotics. The International journal of robotics research 37, 4-5 (2018), 405–420

work page 2018
[57]

Ryan Swann, Muhammad Osama, Karthik Sangaiah, and Jalal Mahmud. 2024. Seer: Predictive runtime kernel selection for irregular problems. In Code Generation and Optimization. IEEE, 133–142

work page 2024
[58]

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine. Nature medicine 29, 8 (2023), 1930–1940

work page 2023
[59]

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: An intermediate language and compiler for tiled neural network computations. In ACM SIGPLAN International Workshop on Machine Learning and Programming Languages . ACM, 10–19

work page 2019
[60]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Conference on Neural Information Processing Systems . MIT Press, 6000–6010

work page 2017
[61]

Guoxia Wang, Jinle Zeng, Xiyuan Xiao, Siming Wu, Jiabin Yang, Lujing Zheng, Zeyu Chen, Jiang Bian, Dianhai Yu, and Haifeng Wang. 2024. FlashMask: Efficient and rich mask extension of FlashAttention.arXiv preprint arXiv:2410.01359 (2024)

work page arXiv 2024
[62]

Haoran Wang, Haobo Xu, Ying Wang, and Yinhe Han. 2023. CTA: Hardware- software co-design for compressed token attention mechanism. In High Perfor- mance Computer Architecture. IEEE, 429–441

work page 2023
[63]

Hulin Wang, Donglin Yang, Yaqi Xia, Zheng Zhang, Qigang Wang, Jianping Fan, Xiaobo Zhou, and Dazhao Cheng. 2024. Raptor-T: A fused and memory-efficient sparse Transformer for long and variable-length sequences. IEEE Trans. Comput. 73, 7 (2024), 1852–1865

work page 2024
[64]

Xiaohui Wang, Yang Wei, Ying Xiong, Guyue Huang, Xian Qian, Yufei Ding, Mingxuan Wang, and Lei Li. 2022. LightSeq2: Accelerated training for Transformer-based models on GPUs. In International Conference on High Perfor- mance Computing, Networking, Storage and Analysis . IEEE, 1–14

work page 2022
[65]

Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. 2022. Bolt: Bridging the gap between auto-tuners and hardware-native performance. Proceedings of Machine Learning and Systems 4 (2022), 204–216

work page 2022
[66]

Jiaming Xu, Shan Huang, Jinhao Li, Guyue Huang, Yuan Xie, Yu Wang, and Guo- hao Dai. 2024. Enabling efficient sparse multiplications on GPUs with heuristic adaptability. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems PP (2024), 1–1

work page 2024
[67]

Zhiying Xu, Jiafan Xu, Hongding Peng, Wei Wang, Xiaoliang Wang, Haoran Wan, Haipeng Dai, Yixu Xu, Hao Cheng, Kun Wang, et al. 2023. ALT: Breaking the wall between data layout and loop optimizations for deep learning compilation. In European Conference on Computer Systems . ACM, 199–214

work page 2023
[68]

Haoran You, Zhanyi Sun, Huihong Shi, Zhongzhi Yu, Yang Zhao, Yongan Zhang, Chaojian Li, Baopu Li, and Yingyan Lin. 2023. ViTCoD: Vision Transformer accel- eration via dedicated algorithm and accelerator co-design. In High Performance Computer Architecture. IEEE, 273–286

work page 2023
[69]

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for longer sequences. In Conference on Neural Information Processing Systems . MIT Press, 1–15

work page 2020
[70]

Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. 2023. ByteTransformer: A high-performance Trans- former boosted for variable-length inputs. In International Parallel & Distributed Processing Symposium. IEEE, 344–355

work page 2023
[71]

Zheng Zhang, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. 2024. MCFuser: High-performance and rapid fusion of memory-bound compute-intensive opera- tors. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15

work page 2024
[72]

Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, et al. 2021. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In ACM SIGPLAN Conference on Programming Language Design and Implementation . ACM, 1233– 1248

work page 2021
[73]

Jieru Zhao, Pai Zeng, Guan Shen, Quan Chen, and Minyi Guo. 2024. Hard- ware–software co-design enabling static and dynamic sparse attention mecha- nisms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43, 9 (2024), 2783–2796

work page 2024
[74]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 1, 2 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating high-performance tensor programs for deep learning. In USENIX Symposium on Operating Systems Design and Implementation . USENIX, 863–879

work page 2020
[76]

Lianmin Zheng, Ruochen Liu, Junru Shao, Tianqi Chen, Joseph E Gonzalez, Ion Stoica, and Ameer Haj Ali. 2021. TenSet: A large-scale program performance dataset for learned tensor compilers. In Conference on Neural Information Pro- cessing Systems. MIT Press, 1–14

work page 2021
[77]

Size Zheng, Renze Chen, Yicheng Jin, Anjiang Wei, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. 2021. NeoFlow: A flexible framework for enabling efficient compilation for high performance DNN training. IEEE Transactions on Parallel and Distributed Systems 33, 11 (2021), 3220–3232

work page 2021
[78]

Size Zheng, Siyuan Chen, Peidi Song, Renze Chen, Xiuhong Li, Shengen Yan, Dahua Lin, Jingwen Leng, and Yun Liang. 2023. Chimera: An analytical optimizing framework for effective compute-intensive operators fusion. InHigh Performance Computer Architecture. ACM, 1113–1126

work page 2023
[79]

Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. Flex- Tensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In International Conference on Architectural Support for Programming Languages and Operating Systems . ACM, 859–873

work page 2020
[80]

Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, et al. 2022. AStitch: Enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In International Conference on Architectural Support for Programming Languages and...

work page 2022

Showing first 80 references.

[1] [1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant Nair, Ilya Solovey- chik, and Purushotham Kamath. 2024. Keyformer: KV cache reduction through key tokens selection for efficient generative inference. In Conference on Machine Learning and Systems. MIT Press, 114–127

work page 2024

[3] [3]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. 2022. DeepSpeed Inference: Enabling efficient inference of Transformer models at unprecedented scale. In International Conference on High Performance Computing, Networking, Storage and Analysis . IEEE, 1–15

work page 2022

[4] [4]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al . 2024. PyTorch 2: Faster machine learning through dynamic python bytecode trans- formation and graph compilation. In International Conference on Architectural Support for Programming Languages and Operating...

work page 2024

[5] [5]

Zhenyu Bai, Pranav Dangi, Huize Li, and Tulika Mitra. 2024. SWAT: Scalable and efficient window attention-based Transformers acceleration on FPGAs. In Design Automation Conference. ACM, 1–6

work page 2024

[6] [6]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document Transformer. arXiv preprint arXiv:2004.05150 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[7] [7]

Lefaudeux Benjamin, Massa Francisco, Liskovich Diana, Xiong Wenhan, Caggiano Vittorio, Naren Sean, Xu Min, Hu Jieru, Tintore Marta, Zhang Su- san, Labatut Patrick, Haziza Daniel, Wehrstedt Luca, Reizenstein Jeremy, and Sizov Grigory. 2022. xFormers: A modular and hackable Transformer modelling library. https://github.com/facebookresearch/xformers

work page 2022

[8] [8]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology 15, 3 (2024), 1–45

work page 2024

[9] [9]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In USENIX Symposium on Operating Systems Design and Implementations . USENIX, 579–594

work page 2018

[10] [10]

Zhaodong Chen, Andrew Kerr, Richard Cai, Jack Kosaian, Haicheng Wu, Yufei Ding, and Yuan Xie. 2024. EVT: Accelerating deep learning training with Epilogue Visitor Tree. In International Conference on Architectural Support for Programming Languages and Operating Systems . ACM, 301–316

work page 2024

[11] [11]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse Transformers. arXiv preprint arXiv:1904.10509 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[12] [12]

Younghyun Cho, James W Demmel, Jacob King, Xiaoye S Li, Yang Liu, and Hengrui Luo. 2023. Harnessing the crowd for autotuning high-performance com- puting applications. In International Parallel & Distributed Processing Symposium . IEEE, 635–645

work page 2023

[13] [13]

Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAt- tention: Fast and memory-efficient exact attention with IO-awareness. Advances in neural information processing systems 35 (2022), 16344–16359

work page 2022

[15] [15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional Transformers for language understanding. In Annual Conference of the North American chapter of the association for computa- tional linguistics: human language technologies . 4171–4186

work page 2019

[16] [16]

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. 2024. Flex Attention: A programming model for generating optimized attention kernels. arXiv preprint arXiv:2412.05496 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Jack Dongarra, Mark Gates, Jakub Kurzak, Piotr Luszczek, and Yaohung M Tsai

work page

[18] [18]

Autotuning numerical dense linear algebra for batched computation with GPU hardware accelerators. Proc. IEEE 106, 11 (2018), 2040–2055

work page 2018

[19] [19]

Hongxiang Fan, Thomas Chau, Stylianos I Venieris, Royson Lee, Alexandros Kouris, Wayne Luk, Nicholas D Lane, and Mohamed S Abdelfattah. 2022. Adapt- able butterfly accelerator for attention-based NNs via hardware and algorithm co-design. In IEEE/ACM International Symposium on Microarchitecture . IEEE, 599–615

work page 2022

[20] [20]

Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: An efficient GPU serving system for Transformer models. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . ACM, 389–402

work page 2021

[21] [21]

Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare 3, 1 (2021), 1–23

work page 2021

[22] [22]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W Lee, et al. 2020. 𝐴3: Accelerating attention mechanisms in neural networks with approximation. In High Performance Computer Architecture. IEEE, 328–341

work page 2020

[24] [24]

Tae Jun Ham, Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W Lee. 2021. ELSA: Hardware-software co-design for efficient, light- weight self-attention mechanism in neural networks. In International Symposium on Computer Architecture. ACM/IEEE, 692–705

work page 2021

[25] [25]

Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yazdanbakhsh, and Tushar Krishna. 2023. FLAT: An optimized dataflow for mitigating attention bottlenecks. In International Conference on Architectural Support for Programming Languages and Operating Systems . ACM, 295–310

work page 2023

[26] [26]

Chendi Li, Yufan Xu, Sina Mahdipour Saravani, and Ponnuswamy Sadayappan

work page

[27] [27]

In Inter- national Conference on Supercomputing

Accelerated auto-tuning of GPU kernels for tensor computations. In Inter- national Conference on Supercomputing . ACM, 549–561

work page

[28] [28]

Mingzhen Li, Hailong Yang, Shanjun Zhang, Fengwei Yu, Ruihao Gong, Yi Liu, Zhongzhi Luan, and Depei Qian. 2023. Exploiting subgraph similarities for efficient auto-tuning of tensor programs. In International Conference on Parallel Processing. ACM, 786–796

work page 2023

[29] [29]

Shutao Li, Weiwei Song, Leyuan Fang, Yushi Chen, Pedram Ghamisi, and Jon Atli Benediktsson. 2019. Deep learning for hyperspectral image classification: An overview. IEEE Transactions on Geoscience and Remote Sensing 57, 9 (2019), 6690–6709

work page 2019

[30] [30]

Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of Transformers. AI open 3 (2022), 111–132

work page 2022

[31] [31]

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshu- mali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. 2023. Deja Vu: Contextual sparsity for efficient LLMs at inference time. In International Conference on Machine Learning . ACM, 22137–22176

work page 2023

[32] [32]

Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, and Zhongfeng Wang. 2020. Hardware accelerator for multi-head attention and position-wise feed-forward in the Transformer. In International System-on-Chip Conference. IEEE, 84–89

work page 2020

[33] [33]

Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenx- iang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. InUSENIX Symposium on Operating Systems Design and Implementation . USENIX, 881–897

work page 2020

[34] [34]

Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. 2019. PaddlePaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Computing 1, 1 (2019), 105–115

work page 2019

[35] [35]

Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNN- Fusion: Accelerating deep neural networks execution with advanced operator fusion. In International Conference on Programming Language Design and Imple- mentation. ACM, 883–898

work page 2021

[36] [36]

Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu

work page

[37] [37]

In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 90–106

work page

[38] [38]

NVIDIA. 2022. https://github.com/NVIDIA/FasterTransformer

work page 2022

[39] [39]

NVIDIA. 2022. https://github.com/NVIDIA/cutlass

work page 2022

[40] [40]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. In Con- ference on Neural Information Processing Systems . MIT Press, 27730–27744

work page 2022

[41] [41]

Philip Pfaffe, Tobias Grosser, and Martin Tillmann. 2019. Efficient hierarchical online-autotuning: A case study on polyhedral accelerator mapping. In Interna- tional Conference on Supercomputing . ACM, 354–366

work page 2019

[42] [42]

Yubin Qin, Yang Wang, Dazheng Deng, Zhiren Zhao, Xiaolong Yang, Leibo Liu, Shaojun Wei, Yang Hu, and Shouyi Yin. 2023. FACT: FFN-attention co-optimized Transformer architecture with eager correlation prediction. In International Sym- posium on Computer Architecture. ACM/IEEE, 1–14

work page 2023

[43] [43]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9

work page 2019

[44] [44]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, ichael MMatena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. 11 Conference acronym ’XX, June 03–05, 2028, Woodstock, NY Dai et al

work page 2020

[45] [45]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM Sigplan Notices 48, 6 (2013), 519–530

work page 2013

[46] [46]

Thomas Randall, Jaehoon Koo, Brice Videau, Michael Kruse, Xingfu Wu, Paul Hovland, Mary Hall, Rong Ge, and Prasanna Balaprakash. 2023. Transfer-learning- based autotuning using Gaussian copula. In International Conference on Super- computing. ACM, 37–49

work page 2023

[47] [47]

Francisco Romero, Mark Zhao, Neeraja J Yadwadkar, and Christos Kozyrakis

work page

[48] [48]

In ACM Symposium on Cloud Computing

Llama: A heterogeneous & serverless framework for auto-tuning video analytics pipelines. In ACM Symposium on Cloud Computing . ACM, 1–17

work page

[49] [49]

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Effi- cient content-based sparse attention with routing Transformers. Transactions of the Association for Computational Linguistics 9 (2021), 53–68

work page 2021

[50] [50]

Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. 2023. Welder: Scheduling deep learning memory access via tile-graph. In USENIX Symposium on Operating Systems Design and Implementations. USENIX, 701–718

work page 2023

[51] [51]

Felix Stahlberg. 2020. Neural machine translation: A review. Journal of Artificial Intelligence Research 69 (2020), 343–418

work page 2020

[52] [52]

Qingxiao Sun, Yi Liu, Hailong Yang, Zhonghui Jiang, Xiaoyan Liu, Ming Dun, Zhongzhi Luan, and Depei Qian. 2021. csTuner: Scalable auto-tuning framework for complex stencil computation on GPUs. In IEEE International Conference on Cluster Computing. IEEE, 192–203

work page 2021

[53] [53]

Qingxiao Sun, Yi Liu, Hailong Yang, Zhonghui Jiang, Zhongzhi Luan, and Depei Qian. 2024. Adaptive auto-tuning framework for global exploration of stencil optimization on GPUs. IEEE Transactions on Parallel and Distributed Systems 35, 1 (2024), 20–33

work page 2024

[54] [54]

Qi Sun, Xinyun Zhang, Hao Geng, Yuxuan Zhao, Yang Bai, Haisheng Zheng, and Bei Yu. 2022. GTuner: Tuning DNN computations on GPU via graph attention network. In Asia and South Pacific Design Automation Conference . ACM/IEEE, 1045–1050

work page 2022

[55] [55]

Niko Sünderhauf, Oliver Brock, Walter Scheirer, Raia Hadsell, Dieter Fox, Jürgen Leitner, Ben Upcroft, Pieter Abbeel, Wolfram Burgard, Michael Milford, et al

work page

[56] [56]

The International journal of robotics research 37, 4-5 (2018), 405–420

The limits and potentials of deep learning for robotics. The International journal of robotics research 37, 4-5 (2018), 405–420

work page 2018

[57] [57]

Ryan Swann, Muhammad Osama, Karthik Sangaiah, and Jalal Mahmud. 2024. Seer: Predictive runtime kernel selection for irregular problems. In Code Generation and Optimization. IEEE, 133–142

work page 2024

[58] [58]

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine. Nature medicine 29, 8 (2023), 1930–1940

work page 2023

[59] [59]

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: An intermediate language and compiler for tiled neural network computations. In ACM SIGPLAN International Workshop on Machine Learning and Programming Languages . ACM, 10–19

work page 2019

[60] [60]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Conference on Neural Information Processing Systems . MIT Press, 6000–6010

work page 2017

[61] [61]

Guoxia Wang, Jinle Zeng, Xiyuan Xiao, Siming Wu, Jiabin Yang, Lujing Zheng, Zeyu Chen, Jiang Bian, Dianhai Yu, and Haifeng Wang. 2024. FlashMask: Efficient and rich mask extension of FlashAttention.arXiv preprint arXiv:2410.01359 (2024)

work page arXiv 2024

[62] [62]

Haoran Wang, Haobo Xu, Ying Wang, and Yinhe Han. 2023. CTA: Hardware- software co-design for compressed token attention mechanism. In High Perfor- mance Computer Architecture. IEEE, 429–441

work page 2023

[63] [63]

Hulin Wang, Donglin Yang, Yaqi Xia, Zheng Zhang, Qigang Wang, Jianping Fan, Xiaobo Zhou, and Dazhao Cheng. 2024. Raptor-T: A fused and memory-efficient sparse Transformer for long and variable-length sequences. IEEE Trans. Comput. 73, 7 (2024), 1852–1865

work page 2024

[64] [64]

Xiaohui Wang, Yang Wei, Ying Xiong, Guyue Huang, Xian Qian, Yufei Ding, Mingxuan Wang, and Lei Li. 2022. LightSeq2: Accelerated training for Transformer-based models on GPUs. In International Conference on High Perfor- mance Computing, Networking, Storage and Analysis . IEEE, 1–14

work page 2022

[65] [65]

Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. 2022. Bolt: Bridging the gap between auto-tuners and hardware-native performance. Proceedings of Machine Learning and Systems 4 (2022), 204–216

work page 2022

[66] [66]

Jiaming Xu, Shan Huang, Jinhao Li, Guyue Huang, Yuan Xie, Yu Wang, and Guo- hao Dai. 2024. Enabling efficient sparse multiplications on GPUs with heuristic adaptability. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems PP (2024), 1–1

work page 2024

[67] [67]

Zhiying Xu, Jiafan Xu, Hongding Peng, Wei Wang, Xiaoliang Wang, Haoran Wan, Haipeng Dai, Yixu Xu, Hao Cheng, Kun Wang, et al. 2023. ALT: Breaking the wall between data layout and loop optimizations for deep learning compilation. In European Conference on Computer Systems . ACM, 199–214

work page 2023

[68] [68]

Haoran You, Zhanyi Sun, Huihong Shi, Zhongzhi Yu, Yang Zhao, Yongan Zhang, Chaojian Li, Baopu Li, and Yingyan Lin. 2023. ViTCoD: Vision Transformer accel- eration via dedicated algorithm and accelerator co-design. In High Performance Computer Architecture. IEEE, 273–286

work page 2023

[69] [69]

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for longer sequences. In Conference on Neural Information Processing Systems . MIT Press, 1–15

work page 2020

[70] [70]

Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. 2023. ByteTransformer: A high-performance Trans- former boosted for variable-length inputs. In International Parallel & Distributed Processing Symposium. IEEE, 344–355

work page 2023

[71] [71]

Zheng Zhang, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. 2024. MCFuser: High-performance and rapid fusion of memory-bound compute-intensive opera- tors. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15

work page 2024

[72] [72]

Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, et al. 2021. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In ACM SIGPLAN Conference on Programming Language Design and Implementation . ACM, 1233– 1248

work page 2021

[73] [73]

Jieru Zhao, Pai Zeng, Guan Shen, Quan Chen, and Minyi Guo. 2024. Hard- ware–software co-design enabling static and dynamic sparse attention mecha- nisms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43, 9 (2024), 2783–2796

work page 2024

[74] [74]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 1, 2 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[75] [75]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating high-performance tensor programs for deep learning. In USENIX Symposium on Operating Systems Design and Implementation . USENIX, 863–879

work page 2020

[76] [76]

Lianmin Zheng, Ruochen Liu, Junru Shao, Tianqi Chen, Joseph E Gonzalez, Ion Stoica, and Ameer Haj Ali. 2021. TenSet: A large-scale program performance dataset for learned tensor compilers. In Conference on Neural Information Pro- cessing Systems. MIT Press, 1–14

work page 2021

[77] [77]

Size Zheng, Renze Chen, Yicheng Jin, Anjiang Wei, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. 2021. NeoFlow: A flexible framework for enabling efficient compilation for high performance DNN training. IEEE Transactions on Parallel and Distributed Systems 33, 11 (2021), 3220–3232

work page 2021

[78] [78]

Size Zheng, Siyuan Chen, Peidi Song, Renze Chen, Xiuhong Li, Shengen Yan, Dahua Lin, Jingwen Leng, and Yun Liang. 2023. Chimera: An analytical optimizing framework for effective compute-intensive operators fusion. InHigh Performance Computer Architecture. ACM, 1113–1126

work page 2023

[79] [79]

Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. Flex- Tensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In International Conference on Architectural Support for Programming Languages and Operating Systems . ACM, 859–873

work page 2020

[80] [80]

Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, et al. 2022. AStitch: Enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In International Conference on Architectural Support for Programming Languages and...

work page 2022