Recognition: 3 theorem links
· Lean TheoremFlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Pith reviewed 2026-05-12 16:16 UTC · model grok-4.3
The pith
FlashAttention reduces GPU memory traffic for exact attention computation using tiling, achieving faster Transformer training on long sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlashAttention is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. It requires fewer HBM accesses than standard attention and is optimal for a range of SRAM sizes. This yields 15% end-to-end speedup on BERT-large, 3x on GPT-2, 2.4x on long-range arena, and enables the first Transformers to succeed on Path-X and Path-256 tasks with sequences of 16K and 64K.
What carries the argument
Tiling that divides attention into blocks sized for SRAM and fuses softmax with matrix multiplies to avoid repeated HBM writes.
If this is right
- Transformers train 15% to 3x faster end-to-end on sequences from 512 to 4K.
- Models achieve better perplexity and classification accuracy from longer contexts.
- Block-sparse FlashAttention runs faster than prior approximate attention methods while remaining close to exact.
- Transformers reach above-chance performance on 16K and 64K sequence tasks for the first time.
Where Pith is reading between the lines
- Similar tiling ideas could speed up other memory-heavy layers such as convolutions or recurrent steps.
- Hardware with larger on-chip SRAM would amplify the benefits of this style of algorithm.
- The approach suggests attention can stay exact even as sequence lengths grow into tens of thousands.
Load-bearing premise
The tiling and kernel implementation incurs negligible overhead from synchronization or register pressure and generalizes across GPU architectures without requiring architecture-specific tuning beyond the stated SRAM sizes.
What would settle it
Compare HBM access counts or wall-clock runtime of FlashAttention versus standard attention on a long sequence; if the number of accesses or runtime shows no reduction, the IO savings claim would be false.
read the original abstract
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FlashAttention, an IO-aware exact attention algorithm that employs tiling to minimize reads and writes between GPU high-bandwidth memory (HBM) and on-chip SRAM. It derives an IO complexity analysis showing fewer HBM accesses than standard attention and optimality for a range of SRAM sizes under a two-level memory model. The approach is extended to block-sparse attention, and experiments demonstrate wall-clock speedups (e.g., 3x on GPT-2 with seq. length 1K) along with improved long-context performance on tasks such as Path-X (seq. length 16K).
Significance. If the central claims hold, the work is significant for providing a hardware-aware optimization that achieves practical speedups and longer contexts without trading off exactness or model quality, unlike prior approximate attention methods. The first-principles IO complexity analysis (derived from standard memory-access counting) and the empirical results enabling new capabilities (e.g., better-than-chance Path-256 performance) represent a concrete advance in efficient Transformer training and inference.
major comments (1)
- [§3] §3 (IO complexity analysis): The optimality claim for a range of SRAM sizes and the assertion of fewer HBM accesses rest on an idealized two-level memory model; the manuscript does not report direct profiling of actual HBM traffic or an overhead breakdown (synchronization, register pressure, bank conflicts) in the CUDA kernel to confirm that the predicted reduction is realized when tile counts increase with sequence length.
minor comments (2)
- [§4] §4 (experiments): Speedup results are reported exclusively on A100 GPUs; adding results or analysis for other architectures (e.g., V100 or H100) with the same tiling parameters would strengthen the generalization claim.
- [Abstract] Abstract and §5: The block-sparse extension is presented as faster than existing approximate methods, but the manuscript could clarify the sparsity pattern selection criteria and any quality trade-offs more explicitly in the main text.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work, the recognition of its significance, and the recommendation for minor revision. We address the major comment point by point below.
read point-by-point responses
-
Referee: [§3] §3 (IO complexity analysis): The optimality claim for a range of SRAM sizes and the assertion of fewer HBM accesses rest on an idealized two-level memory model; the manuscript does not report direct profiling of actual HBM traffic or an overhead breakdown (synchronization, register pressure, bank conflicts) in the CUDA kernel to confirm that the predicted reduction is realized when tile counts increase with sequence length.
Authors: We thank the referee for this careful observation. The IO complexity analysis in Section 3 is derived under the standard two-level memory hierarchy model (HBM vs. on-chip SRAM), which is the conventional abstraction used in the analysis of cache-efficient and IO-aware algorithms. Under this model we prove both the asymptotic reduction in HBM accesses relative to standard attention and optimality for a range of SRAM sizes. We acknowledge that the manuscript does not include direct hardware-counter measurements of HBM traffic or a fine-grained breakdown of CUDA-level overheads (synchronization, register pressure, bank conflicts). Such profiling is technically challenging to obtain without substantial instrumentation overhead and is not reported in the current version. Instead, we validate the practical impact of the IO savings through wall-clock timings that improve with increasing sequence length (Figures 3–5 and Tables 1–2), an outcome that would be impossible if the predicted HBM reduction were not realized. The open-source kernel is publicly available for independent profiling. In the revised manuscript we will add a short paragraph in Section 3 clarifying the modeling assumptions and the empirical validation strategy. revision: yes
Circularity Check
No significant circularity: IO complexity follows from standard memory-access counting on the tiled algorithm
full rationale
The paper's core derivation (IO complexity of FlashAttention via tiling) is obtained by direct counting of HBM reads/writes under a standard two-level memory model, without any fitted parameters, self-referential definitions, or load-bearing self-citations. The optimality claim for a range of SRAM sizes is a direct consequence of the access formulas applied to the algorithm's structure. Empirical speedups and block-sparse extension are separate experimental results, not part of the derivation chain. No step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPU memory hierarchy consists of HBM (slow, large) and SRAM (fast, small) with known relative access costs
Lean theorems connected to this paper
-
Cost.FunctionalEquationJcost uniqueness unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM.
-
DimensionForcingD=3 forcing unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes.
-
HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 28 Pith papers
-
Efficient Training on Multiple Consumer GPUs with RoundPipe
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
-
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
-
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
-
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
-
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
Kerncap automatically extracts isolated, reproducible GPU kernels from large HIP and Triton applications on AMD GPUs by capturing HSA dispatches and producing self-contained reproducer projects that preserve virtual-a...
-
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.
-
Projection-Free Transformers via Gaussian Kernel Attention
Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
-
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.
-
Transactional Attention: Semantic Sponsorship for KV-Cache Retention
Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
-
Remember to Forget: Gated Adaptive Positional Encoding
GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
-
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
-
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
-
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Efficient Streaming Language Models with Attention Sinks
StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.
-
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
-
Token Merging: Your ViT But Faster
Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.
-
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
A Hybrid Method for Low-Resource Named Entity Recognition
The hybrid method with LLM-augmented data achieves F1 improvements of 7-24 points over baselines on five Vietnamese domain datasets.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
-
Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports
SFT followed by GRPO improves LLM accuracy and reasoning recall in disease classification from radiology reports on three radiologist-annotated datasets.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
-
Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a on...
Reference graph
Works this paper leans on
-
[1]
The input/output complexity of sorting and related problems
Alok Aggarwal and S Vitter, Jeffrey. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, 1988
work page 1988
-
[2]
Irwan Bello. LambdaNetworks: Modeling long-range interactions without attention.arXiv preprint arXiv:2102.08602, 2021
-
[3]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
An updated set of basic linear algebra subprograms (blas)
L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28(2):135–151, 2002
work page 2002
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[6]
Neural legal judgment prediction in English
Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. Neural legal judgment prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4317–4323, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1424. URL https://www.aclweb.org/anthology/P19-1424
-
[7]
Ilias Chalkidis, Manos Fergadiotis, Dimitrios Tsarapatsanis, Nikolaos Aletras, Ion Androutsopoulos, and Prodromos Malakasiotis. Paragraph-level rationale extraction through regularization: A case study on european court of human rights cases. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguist...
work page 2021
-
[8]
Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, and Ghislain Durif. Kernel operations on the gpu, with autodiff, without memory overflows.Journal of Machine Learning Research, 22(74):1–6, 2021. URL http://jmlr.org/papers/v22/20-275.html
work page 2021
-
[9]
Scatterbrain: Unifying sparse and low-rank attention
Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[10]
Training Deep Nets with Sublinear Memory Cost
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[12]
Rethinking attention with performers
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Representations (ICLR), 2020
work page 2020
-
[13]
Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Elliott. Revisiting transformer-based models for long document classification.arXiv preprint arXiv:2204.06683, 2022
-
[14]
Transformer-XL: Attentive language models beyond a fixed-length context
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019. 11
work page 2019
-
[15]
Learning fast algorithms for linear transforms using butterfly factorizations
Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher Ré. Learning fast algorithms for linear transforms using butterfly factorizations. InInternational Conference on Machine Learning (ICML), 2019
work page 2019
-
[16]
Kaleidoscope: An efficient, learnable representation for all structured linear maps
Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, and Christopher Ré. Kaleidoscope: An efficient, learnable representation for all structured linear maps. In International Conference on Learning Representations (ICLR), 2020
work page 2020
-
[17]
Pixelated butterfly: Simple and efficient sparse training for neural network models
Tri Dao, Beidi Chen, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Ré. Pixelated butterfly: Simple and efficient sparse training for neural network models. InInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[18]
Monarch: Expressive structured matrices for efficient and accurate training
Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, and Christopher Ré. Monarch: Expressive structured matrices for efficient and accurate training. In International Conference on Machine Learning (ICML), 2022
work page 2022
-
[19]
Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf-efficient attention using asymmetric clustering.Advances in Neural Information Processing Systems, 33:6476–6489, 2020
work page 2020
-
[20]
A two-pronged progress in structured dense matrix vector multiplication
Christopher De Sa, Albert Gu, Rohan Puttagunta, Christopher Ré, and Atri Rudra. A two-pronged progress in structured dense matrix vector multiplication. InProceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1060–1079. SIAM, 2018
work page 2018
-
[21]
The working set model for program behavior.Communications of the ACM, 11(5): 323–333, 1968
Peter J Denning. The working set model for program behavior.Communications of the ACM, 11(5): 323–333, 1968
work page 1968
-
[22]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. 2019
work page 2019
-
[23]
Xin Dong, Shangyu Chen, and Sinno Jialin Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon.arXiv preprint arXiv:1705.07565, 2017
-
[24]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2020
work page 2020
-
[25]
On a new class of structured matrices.Integral Equations and Operator Theory, 34(3):293–324, 1999
Y Eidelman and I Gohberg. On a new class of structured matrices.Integral Equations and Operator Theory, 34(3):293–324, 1999
work page 1999
-
[26]
Fast geometric learning with symbolic matrices
Jean Feydy, Joan Glaunès, Benjamin Charlier, and Michael Bronstein. Fast geometric learning with symbolic matrices. Advances in Neural Information Processing Systems, 33, 2020
work page 2020
-
[27]
Jörg Flum and Martin Grohe.Parameterized Complexity Theory. Springer, 2006
work page 2006
-
[28]
The lottery ticket hypothesis: Finding sparse, trainable neural networks
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2018
work page 2018
-
[29]
Stabilizing the lottery ticket hypothesis.arXiv preprint arXiv:1903.01611, 2019
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket hypothesis.arXiv preprint arXiv:1903.01611, 2019
-
[30]
Linear mode connectivity and the lottery ticket hypothesis
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational Conference on Machine Learning, pages 3259–3269. PMLR, 2020
work page 2020
-
[31]
It’s raw! audio generation with state-space models
Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. It’s raw! audio generation with state-space models. In International Conference on Machine Learning (ICML), 2022
work page 2022
-
[32]
Aaron Gokaslan, Vanya Cohen, Pavlick Ellie, and Stefanie Tellex. Openwebtext corpus, 2019. 12
work page 2019
-
[33]
Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals.Data mining and knowledge discovery, 1(1):29–53, 1997
work page 1997
-
[34]
Andreas Griewank and Andrea Walther.Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008
work page 2008
-
[35]
Hippo: Recurrent memory with optimal polynomial projections
Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. InAdvances in neural information processing systems (NeurIPS), 2020
work page 2020
-
[36]
Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[37]
Efficiently modeling long sequences with structured state spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[38]
Learning both Weights and Connections for Efficient Neural Networks
Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural networks.arXiv preprint arXiv:1506.02626, 2015
work page Pith review arXiv 2015
-
[39]
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, 2016
work page 2016
-
[40]
Memory hierarchy design.Computer Architecture: A Quantitative Approach, pages 390–525, 2003
John Hennessy and David Patterson. Memory hierarchy design.Computer Architecture: A Quantitative Approach, pages 390–525, 2003
work page 2003
-
[41]
Sara Hooker. The hardware lottery.arXiv preprint arXiv:2009.06489, 2020
-
[42]
Transformer quality in linear time.arXiv preprint arXiv:2202.10447, 2022
Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V Le. Transformer quality in linear time.arXiv preprint arXiv:2202.10447, 2022
-
[43]
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Data movement is all you need: A case study on optimizing transformers.Proceedings of Machine Learning and Systems, 3: 711–732, 2021
work page 2021
-
[44]
Dissecting the Ampere GPU architecture via microbenchmarking
Zhe Jia and Peter Van Sandt. Dissecting the Ampere GPU architecture via microbenchmarking. GPU Technology Conference, 2021
work page 2021
- [45]
-
[46]
Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. Dissecting the graphcore IPU architecture via microbenchmarking.arXiv preprint arXiv:1912.03413, 2019
-
[47]
Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016
Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016
work page 2016
-
[48]
In-datacenter performance analysis of a tensor processing unit
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th annual international symposium on computer architecture, pages 1–12, 2017
work page 2017
-
[49]
Displacement ranks of matrices and linear equations
Thomas Kailath, Sun-Yuan Kung, and Martin Morf. Displacement ranks of matrices and linear equations. Journal of Mathematical Analysis and Applications, 68(2):395–407, 1979
work page 1979
-
[50]
Transformers are RNNs: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, pages 5156–5165. PMLR, 2020. 13
work page 2020
-
[51]
Reformer: The efficient transformer
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. InThe International Conference on Machine Learning (ICML), 2020
work page 2020
-
[52]
Albert: A lite BEDRT for self-supervised learning of language representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite BEDRT for self-supervised learning of language representations. InThe International Conference on Learning Representations (ICLR), 2020
work page 2020
-
[53]
The deep learning compiler: A comprehensive survey
Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. The deep learning compiler: A comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 32(3):708–727, 2020
work page 2020
-
[54]
Sub-linear memory: How to make performers slim.arXiv preprint arXiv:2012.11346, 2020
Valerii Likhosherstov, Krzysztof Choromanski, Jared Davis, Xingyou Song, and Adrian Weller. Sub-linear memory: How to make performers slim.arXiv preprint arXiv:2012.11346, 2020
-
[55]
Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[56]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[57]
Luna: Linear unified nested attention.Advances in Neural Information Processing Systems, 34, 2021
Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. Luna: Linear unified nested attention.Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[58]
Mlperf training benchmark.Proceedings of Machine Learning and Systems, 2:336–349, 2020
Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, et al. Mlperf training benchmark.Proceedings of Machine Learning and Systems, 2:336–349, 2020
work page 2020
-
[59]
Scalability! but at whatfCOSTg? In 15th Workshop on Hot Topics in Operating Systems (HotOS XV), 2015
Frank McSherry, Michael Isard, and Derek G Murray. Scalability! but at whatfCOSTg? In 15th Workshop on Hot Topics in Operating Systems (HotOS XV), 2015
work page 2015
-
[60]
Online normalizer calculation for softmax,
Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018
-
[61]
Nvidia Tesla V100 GPU architecture, 2017
NVIDIA. Nvidia Tesla V100 GPU architecture, 2017
work page 2017
-
[62]
Nvidia A100 tensor core GPU architecture, 2020
NVIDIA. Nvidia A100 tensor core GPU architecture, 2020
work page 2020
-
[63]
Nvidia H100 tensor core GPU architecture, 2022
NVIDIA. Nvidia H100 tensor core GPU architecture, 2022
work page 2022
-
[64]
Random butterfly transformations with applications in computational linear algebra
D Stott Parker. Random butterfly transformations with applications in computational linear algebra. 1995
work page 1995
-
[65]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library.Advances in neural information processing systems, 32, 2019
work page 2019
-
[66]
Self-attention does not needo(n 2)memory.arXiv preprint arXiv:2112.05682,
Markus N Rabe and Charles Staats. Self-attention does not need 𝑂¹𝑛2º memory. arXiv preprint arXiv:2112.05682, 2021
-
[67]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
work page 2019
-
[68]
Jack Rae and Ali Razavi. Do transformers need deep long-range memory? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.acl-main.672
work page 2020
-
[69]
Compressive trans- formers for long-range sequence modelling
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive trans- formers for long-range sequence modelling. InThe International Conference on Learning Representations (ICLR), 2020. 14
work page 2020
-
[70]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines.Acm Sigplan Notices, 48(6):519–530, 2013
work page 2013
-
[71]
Raghu Ramakrishnan, Johannes Gehrke, and Johannes Gehrke.Database management systems, volume 3. McGraw-Hill New York, 2003
work page 2003
-
[72]
Parallel stochastic gradient algorithms for large-scale matrix completion
Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013
work page 2013
-
[73]
Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, and Bo Dai. Combiner: Full attention transformer with sparse computation cost.Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[74]
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9: 53–68, 2021
work page 2021
-
[75]
XLA: Compiling machine learning for peak performance
Amit Sabne. XLA: Compiling machine learning for peak performance. 2020
work page 2020
-
[76]
Movement pruning: Adaptive sparsity by fine-tuning
Victor Sanh, Thomas Wolf, and Alexander M Rush. Movement pruning: Adaptive sparsity by fine-tuning. arXiv preprint arXiv:2005.07683, 2020
-
[77]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[78]
Structured transforms for small-footprint deep learning
Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deep learning. In Advances in Neural Information Processing Systems, pages 3088–3096, 2015
work page 2015
-
[79]
Adaptive attention span in transformers
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2019
work page 2019
-
[80]
Long range arena: A benchmark for efficient transformers
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers. In International Conference on Learning Representations, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.