pith. sign in

arxiv: 2605.23751 · v1 · pith:SPHBHSTUnew · submitted 2026-05-22 · 💻 cs.LG

Approaching I/O-optimality for Approximate Attention

Pith reviewed 2026-05-25 04:54 UTC · model grok-4.3

classification 💻 cs.LG
keywords attention mechanismI/O complexityapproximate attentionFlashAttentionlarge language modelsmemory hierarchyalgorithmic efficiencysoftmax
0
0 comments X

The pith

Approximate attention can be computed with I/O cost that grows almost linearly with sequence length n in most regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Attention in large language models requires moving the query, key, and value matrices between fast and slow memory, and current methods pay a cost quadratic in the sequence length n. A trivial lower bound shows that only linear dependence on n and d is necessary just to read the inputs and write the output. The paper adapts an existing approximate-attention framework to produce algorithms whose I/O cost depends almost linearly on n across most combinations of n, d, and fast-memory size M. It also supplies matching lower bounds for each regime, establishing that the new algorithms are close to I/O-optimal. If the claimed scaling holds in practice, attention becomes feasible for much longer sequences without a proportional rise in memory traffic.

Core claim

We present I/O-efficient algorithms for computing the attention matrix that achieve almost-linear dependence on n in most parameter regimes by building on the Alman-Song approximate attention framework, together with matching lower bounds that show the algorithms are close to I/O-optimal.

What carries the argument

I/O-efficient algorithms derived from the Alman and Song approximate attention framework that reorganize computation to keep the number of fast-slow memory transfers almost linear in n.

If this is right

  • In the dominant parameter regimes the total I/O cost is almost linear in n rather than quadratic.
  • Matching lower bounds exist for every combination of n, d, and M, confirming near-optimality.
  • The algorithms inherit the same approximation error bounds as the underlying Alman-Song method.
  • The reduction in data movement applies directly to the forward pass of attention in large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hardware with a fixed fast-memory size M could handle substantially longer contexts before I/O becomes the bottleneck.
  • The same reorganization technique might apply to other dense matrix operations that currently suffer quadratic I/O.
  • Practical speed-ups would still require an efficient implementation of the Alman-Song primitives inside the I/O scheduler.

Load-bearing premise

The approximation guarantees and computational primitives of the Alman and Song framework translate without extra error when the computation is scheduled under I/O constraints.

What would settle it

Measure the actual number of words transferred between fast and slow memory on an implementation of the new algorithms and check whether the count stays within a small constant factor of the predicted almost-linear bound while the output approximation error remains as claimed.

Figures

Figures reproduced from arXiv: 2605.23751 by Aleksandros Sobczyk, Anastasios Zouzias, P\'al Andr\'as Papp.

Figure 2
Figure 2. Figure 2: Illustration of matrix multiplica￾tion with tiles, for the example U1H. The output matrix is split to tiles of size t1 × t2, and each of these is aggregated in tiles of width t3 (respectively, height t3) in the cor￾responding row strip of U1 (column strip of H). Altogether, the output matrix is com￾puted via n t1 · d t2 · r t3 tiling steps. U1H; note that U1 ∈ R n×r and H ∈ R r×d . The lower and upper boun… view at source ↗
read the original abstract

We revisit the I/O complexity of attention in large language models. Given query-key-value matrices $Q,K,V\in\mathbb{R}^{n\times d}$, and a machine with fast memory size $M$, the goal is to compute the "attention matrix" $A=\text{softmax}(Q K ^{\top}/\sqrt{d}) V$ with the minimal number of data transfers between fast and slow memory. Existing methods in the literature, most notably FlashAttention and its variants, incur an I/O cost that depends quadratically on $n$, while a trivial lower bound only requires $\Omega(nd)$ I/O's to read the inputs and write the output. In this work, we present a technique for computing attention where the I/O cost only depends almost-linearly on $n$ in most parameter regimes. This is achieved by developing I/O-efficient algorithms inspired by the recent approximate attention framework of Alman and Song. We also prove corresponding lower bounds in each parameter regime to show that our algorithms are indeed close to I/O-optimal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to develop I/O-efficient algorithms for computing the attention matrix A = softmax(QK^T / sqrt(d)) V in the external-memory model (fast memory M), achieving I/O cost that depends almost linearly on sequence length n in most parameter regimes. The algorithms adapt the approximate attention framework of Alman and Song; matching lower bounds are proved in each regime to establish near I/O-optimality, improving on the quadratic I/O of FlashAttention variants while respecting the trivial Omega(nd) lower bound.

Significance. If the claimed almost-linear I/O bounds are realized while preserving approximation guarantees, the result would be significant for scaling attention to long sequences under memory constraints. The matching lower bounds are a strength, as they provide evidence of tightness rather than loose upper bounds. However, the significance hinges on whether the Alman-Song primitives (sampling, sketches, matrix-vector products) can be scheduled without incurring hidden quadratic I/O passes.

major comments (2)
  1. [Abstract] The central claim that I/O cost is almost-linear in n rests on lifting Alman-Song primitives into the I/O model without Omega(n^2/M) costs, yet the manuscript provides no explicit I/O recurrence, blocking schedule, or pseudocode showing how random-access sampling or low-rank sketches are tiled when M << n. This is load-bearing for the almost-linear regime.
  2. It is unclear whether the approximation error bounds from Alman and Song translate directly once the primitives are realized with I/O-efficient blocking; any additional passes required for random access would invalidate the claimed savings even if RAM-model correctness holds.
minor comments (1)
  1. [Abstract] Notation for the I/O model parameters (M, B, etc.) and the precise definition of 'almost-linear' should be introduced earlier and used consistently.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying areas where the I/O scheduling details require clearer exposition. We address the two major comments below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] The central claim that I/O cost is almost-linear in n rests on lifting Alman-Song primitives into the I/O model without Omega(n^2/M) costs, yet the manuscript provides no explicit I/O recurrence, blocking schedule, or pseudocode showing how random-access sampling or low-rank sketches are tiled when M << n. This is load-bearing for the almost-linear regime.

    Authors: We agree that explicit I/O recurrences and schedules are essential to substantiate the almost-linear claim. The manuscript derives these recurrences in Sections 3–4 (e.g., blocked matrix sketching and sampling via a constant number of passes with cost O(nd log n / M) per primitive), but we acknowledge the presentation could be more self-contained. In the revision we will add a dedicated subsection with the full I/O recurrence for each primitive, a blocking schedule diagram, and pseudocode showing how random accesses are resolved via column-wise tiling and sorting passes that remain within the almost-linear bound. revision: yes

  2. Referee: It is unclear whether the approximation error bounds from Alman and Song translate directly once the primitives are realized with I/O-efficient blocking; any additional passes required for random access would invalidate the claimed savings even if RAM-model correctness holds.

    Authors: The error bounds translate directly because the I/O-efficient realizations compute identical arithmetic results to the RAM-model Alman–Song algorithms; only the order of memory accesses changes. We prove that the schedule uses a constant (parameter-regime-dependent) number of passes with no extra random-access overhead beyond the stated bound. In the revision we will insert a short lemma establishing equivalence of the computed matrices and therefore identical approximation guarantees. revision: yes

Circularity Check

0 steps flagged

No significant circularity; I/O bounds and lower bounds derived independently

full rationale

The paper constructs I/O-efficient algorithms by lifting the external Alman-Song approximate attention framework and separately proves matching lower bounds in each regime. No equations, fitted parameters, or self-citations reduce any claimed bound to a tautology by construction. The lower bounds are explicitly independent of the upper-bound construction, and the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; no explicit free parameters, ad-hoc constants, or new entities are introduced. The work inherits its approximation model from prior literature.

axioms (1)
  • domain assumption Standard external-memory model with fast memory size M and block transfers
    Implicit in all I/O-complexity statements about attention

pith-pipeline@v0.9.0 · 5716 in / 1194 out tokens · 19667 ms · 2026-05-25T04:54:26.844456+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Optimal-degree polynomial approximations for exponentials and gaussian kernel density estimation

    Amol Aggarwal and Josh Alman. Optimal-degree polynomial approximations for exponentials and gaussian kernel density estimation. In37th Computational Complexity Conference (CCC), page 1, 2022

  2. [2]

    More asymmetry yields faster matrix multiplication

    Josh Alman, Ran Duan, Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. More asymmetry yields faster matrix multiplication. InProceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2005–2039. SIAM, 2025

  3. [3]

    Fast attention requires bounded entries.Advances in Neural Information Processing Systems (NeurIPS), 36:63117–63135, 2023

    Josh Alman and Zhao Song. Fast attention requires bounded entries.Advances in Neural Information Processing Systems (NeurIPS), 36:63117–63135, 2023

  4. [4]

    Improving the leading constant of matrix multiplication

    Josh Alman and Hantao Yu. Improving the leading constant of matrix multiplication. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1933–1971. SIAM, 2025

  5. [5]

    Toni Böhnlein, Pál András Papp, and Albert-Jan N. Yzelman. Red-blue pebbling with multiple processors: Time, communication and memory trade-offs. InInternational Colloquium on Structural Information and Communication Complexity (SIROCCO), pages 109–126. Springer, 2025

  6. [6]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

  7. [7]

    Rethinking attention with performers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. InInternational Conference on Learning Representations (ICLR), 2021

  8. [8]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

  9. [9]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems (NeurIPS), 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems (NeurIPS), 35:16344–16359, 2022

  10. [10]

    Smyrf-efficient attention using asymmetric clustering.Advances in Neural Information Processing Systems (NeurIPS), 33:6476–6489, 2020

    Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf-efficient attention using asymmetric clustering.Advances in Neural Information Processing Systems (NeurIPS), 33:6476–6489, 2020

  11. [11]

    Fine-grained i/o complexity via reductions: New lower bounds, faster algorithms, and a time hierarchy

    Erik D Demaine, Andrea Lincoln, Quanquan C Liu, Jayson Lynch, and Virginia Vas- silevska Williams. Fine-grained i/o complexity via reductions: New lower bounds, faster algorithms, and a time hierarchy. In9th Innovations in Theoretical Computer Science Confer- ence (ITCS), pages 34–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2018

  12. [12]

    Red-blue pebble game: Complexity of computing the trade-off between cache size and memory transfers

    Erik D Demaine and Quanquan C Liu. Red-blue pebble game: Complexity of computing the trade-off between cache size and memory transfers. InProceedings of the 30th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 195–204, 2018

  13. [13]

    Smoothing the gap between np and er.SIAM Journal on Computing, 53(6):FOCS20–102, 2022

    Jeff Erickson, Ivor Van Der Hoog, and Tillmann Miltzow. Smoothing the gap between np and er.SIAM Journal on Computing, 53(6):FOCS20–102, 2022

  14. [14]

    Hyperattention: Long-context attention in near-linear time

    Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time. InInternational Conference on Learning Representations (ICLR), 2024

  15. [15]

    I/O complexity: The red-blue pebble game

    Jia-Wei Hong and Hsiang-Tsung Kung. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing (STOC), pages 326–333, 1981

  16. [16]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning (ICML), pages 5156–5165. PMLR, 2020. 10

  17. [17]

    Reformer: The efficient transformer

    Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations (ICLR), 2020

  18. [18]

    Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

    Grzegorz Kwasniewski, Marko Kabi´c, Maciej Besta, Joost VandeV ondele, Raffaele Solcà, and Torsten Hoefler. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1–22, 2019

  19. [19]

    The impact of partial computations on the red-blue pebble game

    Pál András Papp, Aleksandros Sobczyk, and Albert-Jan N Yzelman. The impact of partial computations on the red-blue pebble game. InProceedings of the 37th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 328–338, 2025

  20. [20]

    On the hardness of red-blue pebble games

    Pál András Papp and Roger Wattenhofer. On the hardness of red-blue pebble games. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 419–429, 2020

  21. [21]

    I/O complexity of attention, or how optimal is flashattention? InProceedings of the 41st International Conference on Machine Learning (ICML), pages 43024–43042, 2024

    Barna Saha and Christopher Ye. I/O complexity of attention, or how optimal is flashattention? InProceedings of the 41st International Conference on Machine Learning (ICML), pages 43024–43042, 2024

  22. [22]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems (NeurIPS), 37:68658–68685, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems (NeurIPS), 37:68658–68685, 2024

  23. [23]

    I/o complexity and pebble games with partial computations.Information Processing Letters, page 106637, 2026

    Aleksandros Sobczyk. I/o complexity and pebble games with partial computations.Information Processing Letters, page 106637, 2026

  24. [24]

    Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017

  25. [25]

    Nyströmformer: A nyström-based algorithm for approximating self-attention

    Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14138–14148, 2021

  26. [26]

    Big bird: Transformers for longer sequences.Advances in neural information processing systems (NeurIPS), 33:17283– 17297, 2020

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems (NeurIPS), 33:17283– 17297, 2020. A Proof of Key Lemma 3.2 A.1 Upper bound proof The upper bound proof...

  27. [27]

    We also have M=O(g δ2)≤g δ′ 2 for some δ′ 2 > δ 2. Hence we only need to show 15 gδ′ 2 ≤4·g· e 2 (gδ′ 1) ; this indeed holds asymptotically for any constants δ′ 1, δ′ 2, since the left side is polynomial, while the right side is exponential ing. It remains to show the I/O strategy that proves the upper bound in Equation (3). This is very similar to the up...

  28. [28]

    The left side is then upper bounded by 1 for anyg≥1, and hence it is sufficient to haved≥5·g, which holds due to our assumption. We note that if we modify the restriction condition from g 2 to a smaller linear factor of g, the above proof can be applied similarly; this would allow us to loosen the condition d≥5·g to obtain a slightly smaller constant fact...