pith. sign in

arxiv: 2407.08608 · v2 · pith:UQKLUT5Onew · submitted 2024-07-11 · 💻 cs.LG · cs.AI

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Pith reviewed 2026-05-20 19:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords FlashAttentionattention mechanismGPU optimizationHopper architectureFP8 low precisionasynchronywarp specializationTransformer
0
0 comments X

The pith

FlashAttention-3 achieves 1.5-2x faster attention on H100 GPUs by exploiting asynchrony and FP8 precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces FlashAttention-3 to accelerate the attention layer that bottlenecks Transformer models on recent NVIDIA Hopper GPUs. The authors combine warp specialization to overlap computation with data movement, interleave matrix multiplications and softmax steps at the block level, and add block quantization plus incoherent processing to use FP8 low precision. These changes raise utilization to 75 percent in FP16 and push FP8 throughput near 1.2 petaflops per second while cutting numerical error by a factor of 2.6 relative to ordinary FP8 attention. A reader would care because attention speed directly limits how large or long-context a model can be trained or run on a given GPU.

Core claim

We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to overlap overall computation and data movement via warp-specialization and interleave block-wise matmul and softmax operations, and block quantization and incoherent processing that leverages hardware support for FP8 low-precision. FlashAttention-3 achieves speedup on H100 GPUs by 1.5-2.0× with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6× lower numerical error than a baseline FP8 attention.

What carries the argument

Warp specialization to overlap Tensor Core computation with TMA data movement, block-level interleaving of matmul and softmax, and block quantization with incoherent processing to support FP8 arithmetic.

If this is right

  • Attention no longer limits throughput as severely for long-context or large-batch Transformer workloads on Hopper hardware.
  • FP8 attention can sustain nearly 1.2 PFLOPs/s while preserving higher accuracy than prior low-precision baselines.
  • Overall training and inference time for models that use attention drops by 1.5 to 2 times on the same GPU.
  • Higher compute utilization (up to 75 percent) becomes reachable without changing model architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same overlap and quantization ideas could be applied to other memory-bound operations such as feed-forward layers.
  • Hardware vendors might expose similar asynchronous primitives on future chips, allowing these speedups to generalize beyond Hopper.
  • Incoherent block processing may extend to even lower precisions such as FP4 if hardware support appears.

Load-bearing premise

The asynchronous execution model of Tensor Cores and TMA on Hopper GPUs can be safely exploited through warp specialization and interleaving without synchronization bugs or incorrect attention outputs.

What would settle it

Run FlashAttention-3 on an H100 GPU, record measured TFLOPs/s in FP16 and FP8 modes, and compare the numerical error of the FP8 output against a standard FP8 attention implementation to check if the claimed 2.6× error reduction appears.

read the original abstract

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0$\times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6$\times$ lower numerical error than a baseline FP8 attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. This paper proposes FlashAttention-3, an attention algorithm optimized for Hopper GPUs. It uses three techniques: warp specialization to overlap computation and data movement by exploiting asynchrony between Tensor Cores and TMA, interleaving of matmul and softmax operations, and block FP8 quantization with incoherent processing. The authors report achieving 1.5-2.0× speedups, with FP16 performance up to 740 TFLOPs/s at 75% utilization and FP8 up to 1.2 PFLOPs/s, and 2.6× lower numerical error than baseline FP8 attention.

Significance. The results, if they hold, would be significant for improving the efficiency of Transformer models on cutting-edge hardware. By increasing GPU utilization for attention to 75% and demonstrating benefits of low-precision with reduced error, this work addresses a key bottleneck in scaling LLMs. Credit is due for the direct empirical validation on H100 hardware without reliance on any free parameters or circular reasoning.

major comments (1)
  1. §5 (Experimental Results): While concrete TFLOPs/s and error numbers are reported, the section does not provide error bars, detailed benchmark setup including sequence lengths tested, or rules for data exclusion, making it difficult to verify the claimed speedups and error reductions.
minor comments (2)
  1. Abstract: It would be clearer to report the utilization percentage for FP8 as well, for consistency with the FP16 case.
  2. Related Work: Ensure all prior FlashAttention papers are cited with their specific utilization numbers for context.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the single major comment below and will incorporate the suggested improvements into the revised manuscript.

read point-by-point responses
  1. Referee: §5 (Experimental Results): While concrete TFLOPs/s and error numbers are reported, the section does not provide error bars, detailed benchmark setup including sequence lengths tested, or rules for data exclusion, making it difficult to verify the claimed speedups and error reductions.

    Authors: We agree that additional details would strengthen reproducibility. In the revised Section 5, we will add error bars to all reported TFLOPs/s and numerical error figures, computed over at least five independent runs with different random seeds. We will expand the benchmark description to explicitly list the sequence lengths evaluated (512 to 131072 tokens), batch sizes, head dimensions, and the precise H100 GPU configuration (including CUDA version and PyTorch version). We will also state that no measurements were excluded; all collected data points are reported without selective omission. These changes address the verification concern directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes hardware-specific optimizations (warp specialization for asynchrony, interleaving of block matmul/softmax, and block FP8 quantization) for attention on H100 GPUs. All central claims—speedups of 1.5-2.0× reaching 740 TFLOPs/s (FP16) or 1.2 PFLOPs/s (FP8), plus 2.6× lower numerical error—are direct empirical measurements on hardware against explicit baselines. No equations, fitted parameters, or derivations are presented that could reduce to self-definition or self-citation. Prior FlashAttention citations supply background but are not invoked as uniqueness theorems or load-bearing justifications for the new results, which stand on external hardware benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions about linear-algebra correctness and hardware behavior rather than new free parameters or invented entities.

axioms (2)
  • domain assumption Matrix multiplication and softmax operations can be interleaved while preserving mathematical equivalence when properly synchronized.
    Invoked by the interleaving technique described in the abstract.
  • domain assumption Block quantization with incoherent processing preserves sufficient numerical fidelity for attention outputs.
    Required for the FP8 accuracy claim.

pith-pipeline@v0.9.0 · 5756 in / 1367 out tokens · 65561 ms · 2026-05-20T19:39:48.078510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

    cs.DC 2026-05 unverdicted novelty 7.0

    Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

  2. CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.

  3. Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels

    cs.PL 2026-04 unverdicted novelty 7.0

    Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.

  4. Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs

    cs.AR 2026-04 unverdicted novelty 7.0

    Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on A...

  5. KernelBench: Can LLMs Write Efficient GPU Kernels?

    cs.LG 2025-02 accept novelty 7.0

    KernelBench shows that even the best current LLMs generate correct and faster-than-baseline GPU kernels in fewer than 20 percent of realistic ML workloads.

  6. A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

    cs.DC 2026-05 conditional novelty 6.0

    PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.

  7. Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis

    cs.AR 2026-05 unverdicted novelty 6.0

    Sim-FA is a new simulator that instruments FlashAttention-3 for cycle-accurate GPGPU analysis, achieving 5.7% average error on H800 while explaining inaccuracies in existing DRAM traffic models.

  8. FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers

    cs.CV 2026-04 unverdicted novelty 6.0

    FreqFormer applies heterogeneous attention (dense global on low frequencies, block-sparse on mid, local on high) plus adaptive spectral routing to reduce attention cost in long-sequence video diffusion transformers.

  9. Nucleus-Image: Sparse MoE for Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

  10. PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction

    cs.PF 2026-01 unverdicted novelty 6.0

    PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.

  11. HunyuanVideo 1.5 Technical Report

    cs.CV 2025-11 unverdicted novelty 6.0

    HunyuanVideo 1.5 delivers state-of-the-art open-source text-to-video and image-to-video generation with an 8.3B parameter DiT model featuring SSTA attention, glyph-aware encoding, and progressive training.

  12. Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse

    cs.LG 2025-11 unverdicted novelty 6.0

    Tree Training serializes tree trajectories via DFS and uses redundancy-free partitioning to compute weighted per-token losses exactly once per token, achieving up to 6.2x training speedup on dense and MoE models.

  13. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  14. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

    cs.LG 2025-04 unverdicted novelty 6.0

    TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a fac...

  15. Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    cs.LG 2024-12 unverdicted novelty 6.0

    FlexAttention supplies a compiler-driven interface that expresses common attention variants in a few lines of PyTorch and emits optimized kernels whose speed matches hand-written implementations.

  16. Scaling Diffusion Language Models via Adaptation from Autoregressive Models

    cs.CL 2024-10 conditional novelty 6.0

    Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on langua...

  17. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

    cs.LG 2024-09 conditional novelty 6.0

    RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-...

  18. StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    cs.LG 2026-05 accept novelty 5.0

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  19. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

    cs.CL 2024-12 unverdicted novelty 5.0

    ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.

  20. Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

    cs.AI 2026-05 unverdicted novelty 4.0

    The paper defines Computational Token Economics and introduces the Token Economics Trilemma as a framework for trade-offs in granularity, real-time performance, and optimality, while outlining a research agenda for th...

  21. Hierarchical vs. Flat Iteration in Shared-Weight Transformers

    cs.CL 2026-04 unverdicted novelty 4.0

    Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 21 Pith papers · 19 internal anchors

  1. [1]

    Performance, design, and autotuning of batched gemm for gpus

    Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. Performance, design, and autotuning of batched gemm for gpus. pages 21–38, 06 2016. ISBN 978-3-319-41320-4. doi: 10.1007/978-3-319-41321-1_2

  2. [2]

    Introducing jamba: Ai21’s groundbreaking ssm-transformer model.AI21 blog, 2024

    AI21. Introducing jamba: Ai21’s groundbreaking ssm-transformer model.AI21 blog, 2024

  3. [3]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023

  4. [4]

    CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

    Michael Bauer, Henry Cook, and Brucek Khailany. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450307710. doi: 10.1145/2063384.2063400. URLhttps:...

  5. [5]

    xlstm: Extended long short-term memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024. 10For our benchmarks, FP16FlashA ttention-3has a persistent kernel and load balancing strategy, while FP8FlashA t...

  6. [6]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

  7. [7]

    Delivering 1 PFLOP/s of Performance with FP8 FlashAttention-2, 2024

    Ganesh Bikshandi and Jay Shah. Delivering 1 PFLOP/s of Performance with FP8 FlashAttention-2, 2024. URL https://research.colfax-intl.com/adding-fp8-to-flashattention/

  8. [8]

    Striped attention: Faster ring attention for causal transformers.arXiv preprint arXiv:2311.09431, 2023

    William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers.arXiv preprint arXiv:2311.09431, 2023

  9. [9]

    Quip: 2-bit quantization of large language models with guarantees.Advances in Neural Information Processing Systems, 36, 2024

    Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. Quip: 2-bit quantization of large language models with guarantees.Advances in Neural Information Processing Systems, 36, 2024

  10. [10]

    Scatterbrain: Unifying sparse and low-rank attention

    Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  11. [11]

    Scaling vision transformers to gigapixel images via hierarchical self-supervised learning

    Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mahmood. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16144–16155, 2022

  12. [12]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

  13. [13]

    Rethinking attention with performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In The International Conference on Learning Representations (ICLR), 2021

  14. [14]

    Rethinking attention with performers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In International Conference on Learning Representations (ICLR), 2020

  15. [15]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, 2023. URL https://arxiv.org/abs/2307.08691

  16. [16]

    Transformers are SSMs: Generalized models and efficient algorithms with structured state space duality

    Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms with structured state space duality. InInternational Conference on Machine Learning (ICML), 2024

  17. [17]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022

  18. [18]

    Hungry hungry hippos: Towards language modeling with state space models

    Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models. InThe International Conference on Learning Representations (ICLR), 2023

  19. [19]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  20. [20]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale.CoRR abs/2208.07339, 2022

  21. [21]

    Is flash attention stable?arXiv preprint arXiv:2405.02803, 2024

    Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, et al. Is flash attention stable?arXiv preprint arXiv:2405.02803, 2024

  22. [22]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. 2023

  23. [23]

    Conformer: Convolution-augmented transformer for speech recognition

    Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020. 13

  24. [24]

    Longt5: Efficient text-to-text transformer for long sequences.arXiv preprint arXiv:2112.07916, 2021

    Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. Longt5: Efficient text-to-text transformer for long sequences.arXiv preprint arXiv:2112.07916, 2021

  25. [25]

    Video diffusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022

  26. [26]

    arXiv preprint arXiv:2401.18079

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024

  27. [27]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, pages 5156–5165. PMLR, 2020

  28. [28]

    Reformer: The efficient transformer

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. InThe International Conference on Machine Learning (ICML), 2020

  29. [29]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  30. [30]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161, 2023

  31. [31]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023

  32. [32]

    World Model on Million-Length Video And Language With Blockwise RingAttention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention.arXiv preprint arXiv:2402.08268, 2024

  33. [33]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

  34. [34]

    Benchmarking and Dissecting the Nvidia Hopper GPU Architecture, 2024

    Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, and Xiaowen Chu. Benchmarking and Dissecting the Nvidia Hopper GPU Architecture, 2024. URLhttps://arxiv.org/abs/2402.13499

  35. [35]

    Mega: Moving average equipped gated attention

    Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. Mega: Moving average equipped gated attention. InThe International Conference on Learning Representations (ICLR), 2023

  36. [36]

    Megalodon: Efficient llm pretraining and inference with unlimited context length

    Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, and Chunting Zhou. Megalodon: Efficient llm pretraining and inference with unlimited context length. arXiv preprint arXiv:2404.08801, 2024

  37. [37]

    FP8 Formats for Deep Learning

    Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

  38. [38]

    CUDA Programming Guide Version 12.4, 2024

    NVIDIA. CUDA Programming Guide Version 12.4, 2024. URL https://docs.nvidia.com/cuda/ cuda-c-programming-guide/index.html

  39. [39]

    Accelerating transformers with nvidia cudnn 9.Nvidia blog, 2024

    Nvidia. Accelerating transformers with nvidia cudnn 9.Nvidia blog, 2024. URL https://developer.nvidia. com/blog/accelerating-transformers-with-nvidia-cudnn-9/

  40. [40]

    Parallel Thread Execution ISA Version 8.4, 2024

    NVIDIA. Parallel Thread Execution ISA Version 8.4, 2024. URLhttps://docs.nvidia.com/cuda/pdf/ptx_ isa_8.4.pdf. 14

  41. [41]

    Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, and John D. Owens. Stream-k: Work- centric parallel decomposition for dense matrix-matrix multiplication on the gpu. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’23, pages 429–431, New York, NY, USA, 2023. Association for Comput...

  42. [42]

    RWKV: Reinventing RNNs for the Transformer Era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, et al. RWKV: Reinventing RNNs for the Transformer era.arXiv preprint arXiv:2305.13048, 2023

  43. [43]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

  44. [44]

    Random feature attention

    Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong. Random feature attention. InThe International Conference on Learning Representations (ICLR), 2021

  45. [45]

    Self-attention does not need 𝑂 (𝑛2) memory

    Markus N Rabe and Charles Staats. Self-attention does not need 𝑂 (𝑛2) memory. arXiv preprint arXiv:2112.05682, 2021

  46. [46]

    Tutorial: Matrix Transpose in CUTLASS, 2024

    Colfax Research. Tutorial: Matrix Transpose in CUTLASS, 2024. URLhttps://research.colfax-intl. com/tutorial-matrix-transpose-in-cutlass/

  47. [47]

    Efficient content-based sparse attention with routing Transformers.arXiv preprint arXiv:2003.05997, 2020

    Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing Transformers.arXiv preprint arXiv:2003.05997, 2020

  48. [48]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

  49. [49]

    Amant, Victor Rühle, and Saravan Rajmohan

    Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor Rühle, and Saravan Rajmohan. Lean attention: Hardware-aware scalable attention mechanism for the decode-phase of transformers. 2024

  50. [50]

    Scrolls: Standardized comparison over long language sequences.arXiv preprint arXiv:2201.03533, 2022

    Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, et al. Scrolls: Standardized comparison over long language sequences.arXiv preprint arXiv:2201.03533, 2022

  51. [51]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

  52. [52]

    URLhttps://github.com/ HazyResearch/ThunderKittens

    Benjamin Spector, Aaryan Singhal, Simran Arora, and Christopher Ré, 2024. URLhttps://github.com/ HazyResearch/ThunderKittens

  53. [53]

    Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer

    Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. InProceedings of the 28th ACM international conference on information and knowledge management, pages 1441–1450, 2019

  54. [54]

    Massive Activations in Large Language Models

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

  55. [55]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  56. [56]

    Efficient transformers: A survey.arXiv preprint arXiv:2009.06732, 2020

    Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey.arXiv preprint arXiv:2009.06732, 2020

  57. [57]

    CUTLASS, January 2023

    Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. CUTLASS, January 2023. URL https://github.com/NVIDIA/cutlass. 15

  58. [58]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

    Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396, 2024

  59. [59]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  60. [60]

    An Empirical Study of Mamba-based Language Models

    Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba-based language models. arXiv preprint arXiv:2406.07887, 2024

  61. [61]

    Nyströmformer: A nystöm-based algorithm for approximating self-attention

    Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nystöm-based algorithm for approximating self-attention. InProceedings of the AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, volume 35, page 14138, 2021

  62. [62]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  63. [63]

    Big bird: Transformers for longer sequences

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020

  64. [64]

    Zyphra unveils zamba: A compact 7b ssm hybrid model.Zyphra blog, 2024

    Zyphra. Zyphra unveils zamba: A compact 7b ssm hybrid model.Zyphra blog, 2024. 16 A Related Work Attention variants and distributed attentionEver since attention became popular with the Transformer architecture [59], there has been a large body of work on approximating attention to scale it to longer sequences. These approximation methods can generally be...

  65. [65]

    Softmax is reordered to the very beginning, even before the first WGMMA

  66. [66]

    This indicates that WGMMA and non-WGMMAs are executed in parallel

    The first WGMMA is interleaved with softmax and FP32→ FP16 datatype conversion ofS. This indicates that WGMMA and non-WGMMAs are executed in parallel

  67. [67]

    exp2, row\_sum, O rescaling and FP32→ FP16 conversions are interleaved together

  68. [68]

    Overall, SASS shows that the 2-stage pipelining idea works as expected

    The second WGMMA is not overlapped with other instructions, as expected. Overall, SASS shows that the 2-stage pipelining idea works as expected. 19 B.3 3-Stage Pipelining Algorithm We experiment with a 3-stage pipelining algorithm to parallelize the first WGMMA from iteration𝑗 + 2, softmax from iteration 𝑗 + 1, and the second WGMMA from iteration𝑗. We des...