Approaching I/O-optimality for Approximate Attention
Pith reviewed 2026-05-25 04:54 UTC · model grok-4.3
The pith
Approximate attention can be computed with I/O cost that grows almost linearly with sequence length n in most regimes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present I/O-efficient algorithms for computing the attention matrix that achieve almost-linear dependence on n in most parameter regimes by building on the Alman-Song approximate attention framework, together with matching lower bounds that show the algorithms are close to I/O-optimal.
What carries the argument
I/O-efficient algorithms derived from the Alman and Song approximate attention framework that reorganize computation to keep the number of fast-slow memory transfers almost linear in n.
If this is right
- In the dominant parameter regimes the total I/O cost is almost linear in n rather than quadratic.
- Matching lower bounds exist for every combination of n, d, and M, confirming near-optimality.
- The algorithms inherit the same approximation error bounds as the underlying Alman-Song method.
- The reduction in data movement applies directly to the forward pass of attention in large language models.
Where Pith is reading between the lines
- Hardware with a fixed fast-memory size M could handle substantially longer contexts before I/O becomes the bottleneck.
- The same reorganization technique might apply to other dense matrix operations that currently suffer quadratic I/O.
- Practical speed-ups would still require an efficient implementation of the Alman-Song primitives inside the I/O scheduler.
Load-bearing premise
The approximation guarantees and computational primitives of the Alman and Song framework translate without extra error when the computation is scheduled under I/O constraints.
What would settle it
Measure the actual number of words transferred between fast and slow memory on an implementation of the new algorithms and check whether the count stays within a small constant factor of the predicted almost-linear bound while the output approximation error remains as claimed.
Figures
read the original abstract
We revisit the I/O complexity of attention in large language models. Given query-key-value matrices $Q,K,V\in\mathbb{R}^{n\times d}$, and a machine with fast memory size $M$, the goal is to compute the "attention matrix" $A=\text{softmax}(Q K ^{\top}/\sqrt{d}) V$ with the minimal number of data transfers between fast and slow memory. Existing methods in the literature, most notably FlashAttention and its variants, incur an I/O cost that depends quadratically on $n$, while a trivial lower bound only requires $\Omega(nd)$ I/O's to read the inputs and write the output. In this work, we present a technique for computing attention where the I/O cost only depends almost-linearly on $n$ in most parameter regimes. This is achieved by developing I/O-efficient algorithms inspired by the recent approximate attention framework of Alman and Song. We also prove corresponding lower bounds in each parameter regime to show that our algorithms are indeed close to I/O-optimal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to develop I/O-efficient algorithms for computing the attention matrix A = softmax(QK^T / sqrt(d)) V in the external-memory model (fast memory M), achieving I/O cost that depends almost linearly on sequence length n in most parameter regimes. The algorithms adapt the approximate attention framework of Alman and Song; matching lower bounds are proved in each regime to establish near I/O-optimality, improving on the quadratic I/O of FlashAttention variants while respecting the trivial Omega(nd) lower bound.
Significance. If the claimed almost-linear I/O bounds are realized while preserving approximation guarantees, the result would be significant for scaling attention to long sequences under memory constraints. The matching lower bounds are a strength, as they provide evidence of tightness rather than loose upper bounds. However, the significance hinges on whether the Alman-Song primitives (sampling, sketches, matrix-vector products) can be scheduled without incurring hidden quadratic I/O passes.
major comments (2)
- [Abstract] The central claim that I/O cost is almost-linear in n rests on lifting Alman-Song primitives into the I/O model without Omega(n^2/M) costs, yet the manuscript provides no explicit I/O recurrence, blocking schedule, or pseudocode showing how random-access sampling or low-rank sketches are tiled when M << n. This is load-bearing for the almost-linear regime.
- It is unclear whether the approximation error bounds from Alman and Song translate directly once the primitives are realized with I/O-efficient blocking; any additional passes required for random access would invalidate the claimed savings even if RAM-model correctness holds.
minor comments (1)
- [Abstract] Notation for the I/O model parameters (M, B, etc.) and the precise definition of 'almost-linear' should be introduced earlier and used consistently.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying areas where the I/O scheduling details require clearer exposition. We address the two major comments below and will incorporate the requested clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] The central claim that I/O cost is almost-linear in n rests on lifting Alman-Song primitives into the I/O model without Omega(n^2/M) costs, yet the manuscript provides no explicit I/O recurrence, blocking schedule, or pseudocode showing how random-access sampling or low-rank sketches are tiled when M << n. This is load-bearing for the almost-linear regime.
Authors: We agree that explicit I/O recurrences and schedules are essential to substantiate the almost-linear claim. The manuscript derives these recurrences in Sections 3–4 (e.g., blocked matrix sketching and sampling via a constant number of passes with cost O(nd log n / M) per primitive), but we acknowledge the presentation could be more self-contained. In the revision we will add a dedicated subsection with the full I/O recurrence for each primitive, a blocking schedule diagram, and pseudocode showing how random accesses are resolved via column-wise tiling and sorting passes that remain within the almost-linear bound. revision: yes
-
Referee: It is unclear whether the approximation error bounds from Alman and Song translate directly once the primitives are realized with I/O-efficient blocking; any additional passes required for random access would invalidate the claimed savings even if RAM-model correctness holds.
Authors: The error bounds translate directly because the I/O-efficient realizations compute identical arithmetic results to the RAM-model Alman–Song algorithms; only the order of memory accesses changes. We prove that the schedule uses a constant (parameter-regime-dependent) number of passes with no extra random-access overhead beyond the stated bound. In the revision we will insert a short lemma establishing equivalence of the computed matrices and therefore identical approximation guarantees. revision: yes
Circularity Check
No significant circularity; I/O bounds and lower bounds derived independently
full rationale
The paper constructs I/O-efficient algorithms by lifting the external Alman-Song approximate attention framework and separately proves matching lower bounds in each regime. No equations, fitted parameters, or self-citations reduce any claimed bound to a tautology by construction. The lower bounds are explicitly independent of the upper-bound construction, and the derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard external-memory model with fast memory size M and block transfers
Reference graph
Works this paper leans on
-
[1]
Optimal-degree polynomial approximations for exponentials and gaussian kernel density estimation
Amol Aggarwal and Josh Alman. Optimal-degree polynomial approximations for exponentials and gaussian kernel density estimation. In37th Computational Complexity Conference (CCC), page 1, 2022
work page 2022
-
[2]
More asymmetry yields faster matrix multiplication
Josh Alman, Ran Duan, Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. More asymmetry yields faster matrix multiplication. InProceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2005–2039. SIAM, 2025
work page 2025
-
[3]
Josh Alman and Zhao Song. Fast attention requires bounded entries.Advances in Neural Information Processing Systems (NeurIPS), 36:63117–63135, 2023
work page 2023
-
[4]
Improving the leading constant of matrix multiplication
Josh Alman and Hantao Yu. Improving the leading constant of matrix multiplication. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1933–1971. SIAM, 2025
work page 2025
-
[5]
Toni Böhnlein, Pál András Papp, and Albert-Jan N. Yzelman. Red-blue pebbling with multiple processors: Time, communication and memory trade-offs. InInternational Colloquium on Structural Information and Communication Complexity (SIROCCO), pages 109–126. Springer, 2025
work page 2025
-
[6]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[7]
Rethinking attention with performers
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[8]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[9]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems (NeurIPS), 35:16344–16359, 2022
work page 2022
-
[10]
Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf-efficient attention using asymmetric clustering.Advances in Neural Information Processing Systems (NeurIPS), 33:6476–6489, 2020
work page 2020
-
[11]
Erik D Demaine, Andrea Lincoln, Quanquan C Liu, Jayson Lynch, and Virginia Vas- silevska Williams. Fine-grained i/o complexity via reductions: New lower bounds, faster algorithms, and a time hierarchy. In9th Innovations in Theoretical Computer Science Confer- ence (ITCS), pages 34–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2018
work page 2018
-
[12]
Red-blue pebble game: Complexity of computing the trade-off between cache size and memory transfers
Erik D Demaine and Quanquan C Liu. Red-blue pebble game: Complexity of computing the trade-off between cache size and memory transfers. InProceedings of the 30th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 195–204, 2018
work page 2018
-
[13]
Smoothing the gap between np and er.SIAM Journal on Computing, 53(6):FOCS20–102, 2022
Jeff Erickson, Ivor Van Der Hoog, and Tillmann Miltzow. Smoothing the gap between np and er.SIAM Journal on Computing, 53(6):FOCS20–102, 2022
work page 2022
-
[14]
Hyperattention: Long-context attention in near-linear time
Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[15]
I/O complexity: The red-blue pebble game
Jia-Wei Hong and Hsiang-Tsung Kung. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing (STOC), pages 326–333, 1981
work page 1981
-
[16]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning (ICML), pages 5156–5165. PMLR, 2020. 10
work page 2020
-
[17]
Reformer: The efficient transformer
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations (ICLR), 2020
work page 2020
-
[18]
Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication
Grzegorz Kwasniewski, Marko Kabi´c, Maciej Besta, Joost VandeV ondele, Raffaele Solcà, and Torsten Hoefler. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1–22, 2019
work page 2019
-
[19]
The impact of partial computations on the red-blue pebble game
Pál András Papp, Aleksandros Sobczyk, and Albert-Jan N Yzelman. The impact of partial computations on the red-blue pebble game. InProceedings of the 37th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 328–338, 2025
work page 2025
-
[20]
On the hardness of red-blue pebble games
Pál András Papp and Roger Wattenhofer. On the hardness of red-blue pebble games. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 419–429, 2020
work page 2020
-
[21]
Barna Saha and Christopher Ye. I/O complexity of attention, or how optimal is flashattention? InProceedings of the 41st International Conference on Machine Learning (ICML), pages 43024–43042, 2024
work page 2024
-
[22]
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems (NeurIPS), 37:68658–68685, 2024
work page 2024
-
[23]
Aleksandros Sobczyk. I/o complexity and pebble games with partial computations.Information Processing Letters, page 106637, 2026
work page 2026
-
[24]
Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017
work page 2017
-
[25]
Nyströmformer: A nyström-based algorithm for approximating self-attention
Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14138–14148, 2021
work page 2021
-
[26]
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems (NeurIPS), 33:17283– 17297, 2020. A Proof of Key Lemma 3.2 A.1 Upper bound proof The upper bound proof...
work page 2020
-
[27]
We also have M=O(g δ2)≤g δ′ 2 for some δ′ 2 > δ 2. Hence we only need to show 15 gδ′ 2 ≤4·g· e 2 (gδ′ 1) ; this indeed holds asymptotically for any constants δ′ 1, δ′ 2, since the left side is polynomial, while the right side is exponential ing. It remains to show the I/O strategy that proves the upper bound in Equation (3). This is very similar to the up...
-
[28]
The left side is then upper bounded by 1 for anyg≥1, and hence it is sufficient to haved≥5·g, which holds due to our assumption. We note that if we modify the restriction condition from g 2 to a smaller linear factor of g, the above proof can be applied similarly; this would allow us to loosen the condition d≥5·g to obtain a slightly smaller constant fact...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.