Approaching I/O-optimality for Approximate Attention

Aleksandros Sobczyk; Anastasios Zouzias; P\'al Andr\'as Papp

arxiv: 2605.23751 · v1 · pith:SPHBHSTUnew · submitted 2026-05-22 · 💻 cs.LG

Approaching I/O-optimality for Approximate Attention

P\'al Andr\'as Papp , Aleksandros Sobczyk , Anastasios Zouzias This is my paper

Pith reviewed 2026-05-25 04:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords attention mechanismI/O complexityapproximate attentionFlashAttentionlarge language modelsmemory hierarchyalgorithmic efficiencysoftmax

0 comments

The pith

Approximate attention can be computed with I/O cost that grows almost linearly with sequence length n in most regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Attention in large language models requires moving the query, key, and value matrices between fast and slow memory, and current methods pay a cost quadratic in the sequence length n. A trivial lower bound shows that only linear dependence on n and d is necessary just to read the inputs and write the output. The paper adapts an existing approximate-attention framework to produce algorithms whose I/O cost depends almost linearly on n across most combinations of n, d, and fast-memory size M. It also supplies matching lower bounds for each regime, establishing that the new algorithms are close to I/O-optimal. If the claimed scaling holds in practice, attention becomes feasible for much longer sequences without a proportional rise in memory traffic.

Core claim

We present I/O-efficient algorithms for computing the attention matrix that achieve almost-linear dependence on n in most parameter regimes by building on the Alman-Song approximate attention framework, together with matching lower bounds that show the algorithms are close to I/O-optimal.

What carries the argument

I/O-efficient algorithms derived from the Alman and Song approximate attention framework that reorganize computation to keep the number of fast-slow memory transfers almost linear in n.

If this is right

In the dominant parameter regimes the total I/O cost is almost linear in n rather than quadratic.
Matching lower bounds exist for every combination of n, d, and M, confirming near-optimality.
The algorithms inherit the same approximation error bounds as the underlying Alman-Song method.
The reduction in data movement applies directly to the forward pass of attention in large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware with a fixed fast-memory size M could handle substantially longer contexts before I/O becomes the bottleneck.
The same reorganization technique might apply to other dense matrix operations that currently suffer quadratic I/O.
Practical speed-ups would still require an efficient implementation of the Alman-Song primitives inside the I/O scheduler.

Load-bearing premise

The approximation guarantees and computational primitives of the Alman and Song framework translate without extra error when the computation is scheduled under I/O constraints.

What would settle it

Measure the actual number of words transferred between fast and slow memory on an implementation of the new algorithms and check whether the count stays within a small constant factor of the predicted almost-linear bound while the output approximation error remains as claimed.

Figures

Figures reproduced from arXiv: 2605.23751 by Aleksandros Sobczyk, Anastasios Zouzias, P\'al Andr\'as Papp.

**Figure 2.** Figure 2: Illustration of matrix multiplication with tiles, for the example U1H. The output matrix is split to tiles of size t1 × t2, and each of these is aggregated in tiles of width t3 (respectively, height t3) in the corresponding row strip of U1 (column strip of H). Altogether, the output matrix is computed via n t1 · d t2 · r t3 tiling steps. U1H; note that U1 ∈ R n×r and H ∈ R r×d . The lower and upper boun… view at source ↗

read the original abstract

We revisit the I/O complexity of attention in large language models. Given query-key-value matrices $Q,K,V\in\mathbb{R}^{n\times d}$, and a machine with fast memory size $M$, the goal is to compute the "attention matrix" $A=\text{softmax}(Q K ^{\top}/\sqrt{d}) V$ with the minimal number of data transfers between fast and slow memory. Existing methods in the literature, most notably FlashAttention and its variants, incur an I/O cost that depends quadratically on $n$, while a trivial lower bound only requires $\Omega(nd)$ I/O's to read the inputs and write the output. In this work, we present a technique for computing attention where the I/O cost only depends almost-linearly on $n$ in most parameter regimes. This is achieved by developing I/O-efficient algorithms inspired by the recent approximate attention framework of Alman and Song. We also prove corresponding lower bounds in each parameter regime to show that our algorithms are indeed close to I/O-optimal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They adapt Alman-Song to claim near-linear I/O for approximate attention plus matching lower bounds per regime, but the external-memory scheduling of the primitives is the part that still needs verification.

read the letter

The main takeaway is that this paper moves the Alman-Song approximate attention framework into the I/O model and gets an almost-linear dependence on n in most regimes, together with lower bounds that match in each regime. That combination is the actual new piece relative to FlashAttention work. The lower bounds are presented as independent, which strengthens the optimality claim and is worth having on record. The upper-bound construction inherits its approximation properties from Alman-Song, so the correctness side does not need to be re-proven from scratch. What is less clear is how the sampling, sketching, and matrix-vector steps are actually blocked and scheduled so that they avoid extra passes when M is much smaller than n. If any of those primitives force random or repeated accesses that cannot be tiled, the I/O cost could still hit quadratic in practice even if the RAM-model analysis stays intact. The abstract does not show the recurrences or pseudocode, so the translation step remains the soft spot. The rest of the paper looks like standard I/O analysis once that step is granted. This is aimed at people who care about memory-hierarchy costs in long-context training and inference. A reader who already knows the Alman-Song primitives and the external-memory model will get the most out of the regime-specific bounds. The work is coherent enough on its own terms to deserve a serious referee, even if the scheduling details end up needing revision.

Referee Report

2 major / 1 minor

Summary. The paper claims to develop I/O-efficient algorithms for computing the attention matrix A = softmax(QK^T / sqrt(d)) V in the external-memory model (fast memory M), achieving I/O cost that depends almost linearly on sequence length n in most parameter regimes. The algorithms adapt the approximate attention framework of Alman and Song; matching lower bounds are proved in each regime to establish near I/O-optimality, improving on the quadratic I/O of FlashAttention variants while respecting the trivial Omega(nd) lower bound.

Significance. If the claimed almost-linear I/O bounds are realized while preserving approximation guarantees, the result would be significant for scaling attention to long sequences under memory constraints. The matching lower bounds are a strength, as they provide evidence of tightness rather than loose upper bounds. However, the significance hinges on whether the Alman-Song primitives (sampling, sketches, matrix-vector products) can be scheduled without incurring hidden quadratic I/O passes.

major comments (2)

[Abstract] The central claim that I/O cost is almost-linear in n rests on lifting Alman-Song primitives into the I/O model without Omega(n^2/M) costs, yet the manuscript provides no explicit I/O recurrence, blocking schedule, or pseudocode showing how random-access sampling or low-rank sketches are tiled when M << n. This is load-bearing for the almost-linear regime.
It is unclear whether the approximation error bounds from Alman and Song translate directly once the primitives are realized with I/O-efficient blocking; any additional passes required for random access would invalidate the claimed savings even if RAM-model correctness holds.

minor comments (1)

[Abstract] Notation for the I/O model parameters (M, B, etc.) and the precise definition of 'almost-linear' should be introduced earlier and used consistently.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying areas where the I/O scheduling details require clearer exposition. We address the two major comments below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses

Referee: [Abstract] The central claim that I/O cost is almost-linear in n rests on lifting Alman-Song primitives into the I/O model without Omega(n^2/M) costs, yet the manuscript provides no explicit I/O recurrence, blocking schedule, or pseudocode showing how random-access sampling or low-rank sketches are tiled when M << n. This is load-bearing for the almost-linear regime.

Authors: We agree that explicit I/O recurrences and schedules are essential to substantiate the almost-linear claim. The manuscript derives these recurrences in Sections 3–4 (e.g., blocked matrix sketching and sampling via a constant number of passes with cost O(nd log n / M) per primitive), but we acknowledge the presentation could be more self-contained. In the revision we will add a dedicated subsection with the full I/O recurrence for each primitive, a blocking schedule diagram, and pseudocode showing how random accesses are resolved via column-wise tiling and sorting passes that remain within the almost-linear bound. revision: yes
Referee: It is unclear whether the approximation error bounds from Alman and Song translate directly once the primitives are realized with I/O-efficient blocking; any additional passes required for random access would invalidate the claimed savings even if RAM-model correctness holds.

Authors: The error bounds translate directly because the I/O-efficient realizations compute identical arithmetic results to the RAM-model Alman–Song algorithms; only the order of memory accesses changes. We prove that the schedule uses a constant (parameter-regime-dependent) number of passes with no extra random-access overhead beyond the stated bound. In the revision we will insert a short lemma establishing equivalence of the computed matrices and therefore identical approximation guarantees. revision: yes

Circularity Check

0 steps flagged

No significant circularity; I/O bounds and lower bounds derived independently

full rationale

The paper constructs I/O-efficient algorithms by lifting the external Alman-Song approximate attention framework and separately proves matching lower bounds in each regime. No equations, fitted parameters, or self-citations reduce any claimed bound to a tautology by construction. The lower bounds are explicitly independent of the upper-bound construction, and the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; no explicit free parameters, ad-hoc constants, or new entities are introduced. The work inherits its approximation model from prior literature.

axioms (1)

domain assumption Standard external-memory model with fast memory size M and block transfers
Implicit in all I/O-complexity statements about attention

pith-pipeline@v0.9.0 · 5716 in / 1194 out tokens · 19667 ms · 2026-05-25T04:54:26.844456+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

Optimal-degree polynomial approximations for exponentials and gaussian kernel density estimation

Amol Aggarwal and Josh Alman. Optimal-degree polynomial approximations for exponentials and gaussian kernel density estimation. In37th Computational Complexity Conference (CCC), page 1, 2022

work page 2022
[2]

More asymmetry yields faster matrix multiplication

Josh Alman, Ran Duan, Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. More asymmetry yields faster matrix multiplication. InProceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2005–2039. SIAM, 2025

work page 2025
[3]

Fast attention requires bounded entries.Advances in Neural Information Processing Systems (NeurIPS), 36:63117–63135, 2023

Josh Alman and Zhao Song. Fast attention requires bounded entries.Advances in Neural Information Processing Systems (NeurIPS), 36:63117–63135, 2023

work page 2023
[4]

Improving the leading constant of matrix multiplication

Josh Alman and Hantao Yu. Improving the leading constant of matrix multiplication. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1933–1971. SIAM, 2025

work page 2025
[5]

Toni Böhnlein, Pál András Papp, and Albert-Jan N. Yzelman. Red-blue pebbling with multiple processors: Time, communication and memory trade-offs. InInternational Colloquium on Structural Information and Communication Complexity (SIROCCO), pages 109–126. Springer, 2025

work page 2025
[6]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[7]

Rethinking attention with performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[8]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[9]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems (NeurIPS), 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems (NeurIPS), 35:16344–16359, 2022

work page 2022
[10]

Smyrf-efficient attention using asymmetric clustering.Advances in Neural Information Processing Systems (NeurIPS), 33:6476–6489, 2020

Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf-efficient attention using asymmetric clustering.Advances in Neural Information Processing Systems (NeurIPS), 33:6476–6489, 2020

work page 2020
[11]

Fine-grained i/o complexity via reductions: New lower bounds, faster algorithms, and a time hierarchy

Erik D Demaine, Andrea Lincoln, Quanquan C Liu, Jayson Lynch, and Virginia Vas- silevska Williams. Fine-grained i/o complexity via reductions: New lower bounds, faster algorithms, and a time hierarchy. In9th Innovations in Theoretical Computer Science Confer- ence (ITCS), pages 34–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2018

work page 2018
[12]

Red-blue pebble game: Complexity of computing the trade-off between cache size and memory transfers

Erik D Demaine and Quanquan C Liu. Red-blue pebble game: Complexity of computing the trade-off between cache size and memory transfers. InProceedings of the 30th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 195–204, 2018

work page 2018
[13]

Smoothing the gap between np and er.SIAM Journal on Computing, 53(6):FOCS20–102, 2022

Jeff Erickson, Ivor Van Der Hoog, and Tillmann Miltzow. Smoothing the gap between np and er.SIAM Journal on Computing, 53(6):FOCS20–102, 2022

work page 2022
[14]

Hyperattention: Long-context attention in near-linear time

Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[15]

I/O complexity: The red-blue pebble game

Jia-Wei Hong and Hsiang-Tsung Kung. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing (STOC), pages 326–333, 1981

work page 1981
[16]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning (ICML), pages 5156–5165. PMLR, 2020. 10

work page 2020
[17]

Reformer: The efficient transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations (ICLR), 2020

work page 2020
[18]

Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

Grzegorz Kwasniewski, Marko Kabi´c, Maciej Besta, Joost VandeV ondele, Raffaele Solcà, and Torsten Hoefler. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1–22, 2019

work page 2019
[19]

The impact of partial computations on the red-blue pebble game

Pál András Papp, Aleksandros Sobczyk, and Albert-Jan N Yzelman. The impact of partial computations on the red-blue pebble game. InProceedings of the 37th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 328–338, 2025

work page 2025
[20]

On the hardness of red-blue pebble games

Pál András Papp and Roger Wattenhofer. On the hardness of red-blue pebble games. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 419–429, 2020

work page 2020
[21]

I/O complexity of attention, or how optimal is flashattention? InProceedings of the 41st International Conference on Machine Learning (ICML), pages 43024–43042, 2024

Barna Saha and Christopher Ye. I/O complexity of attention, or how optimal is flashattention? InProceedings of the 41st International Conference on Machine Learning (ICML), pages 43024–43042, 2024

work page 2024
[22]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems (NeurIPS), 37:68658–68685, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems (NeurIPS), 37:68658–68685, 2024

work page 2024
[23]

I/o complexity and pebble games with partial computations.Information Processing Letters, page 106637, 2026

Aleksandros Sobczyk. I/o complexity and pebble games with partial computations.Information Processing Letters, page 106637, 2026

work page 2026
[24]

Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017

work page 2017
[25]

Nyströmformer: A nyström-based algorithm for approximating self-attention

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14138–14148, 2021

work page 2021
[26]

Big bird: Transformers for longer sequences.Advances in neural information processing systems (NeurIPS), 33:17283– 17297, 2020

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems (NeurIPS), 33:17283– 17297, 2020. A Proof of Key Lemma 3.2 A.1 Upper bound proof The upper bound proof...

work page 2020
[27]

We also have M=O(g δ2)≤g δ′ 2 for some δ′ 2 > δ 2. Hence we only need to show 15 gδ′ 2 ≤4·g· e 2 (gδ′ 1) ; this indeed holds asymptotically for any constants δ′ 1, δ′ 2, since the left side is polynomial, while the right side is exponential ing. It remains to show the I/O strategy that proves the upper bound in Equation (3). This is very similar to the up...

work page
[28]

The left side is then upper bounded by 1 for anyg≥1, and hence it is sufficient to haved≥5·g, which holds due to our assumption. We note that if we modify the restriction condition from g 2 to a smaller linear factor of g, the above proof can be applied similarly; this would allow us to loosen the condition d≥5·g to obtain a slightly smaller constant fact...

work page

[1] [1]

Optimal-degree polynomial approximations for exponentials and gaussian kernel density estimation

Amol Aggarwal and Josh Alman. Optimal-degree polynomial approximations for exponentials and gaussian kernel density estimation. In37th Computational Complexity Conference (CCC), page 1, 2022

work page 2022

[2] [2]

More asymmetry yields faster matrix multiplication

Josh Alman, Ran Duan, Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. More asymmetry yields faster matrix multiplication. InProceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2005–2039. SIAM, 2025

work page 2025

[3] [3]

Fast attention requires bounded entries.Advances in Neural Information Processing Systems (NeurIPS), 36:63117–63135, 2023

Josh Alman and Zhao Song. Fast attention requires bounded entries.Advances in Neural Information Processing Systems (NeurIPS), 36:63117–63135, 2023

work page 2023

[4] [4]

Improving the leading constant of matrix multiplication

Josh Alman and Hantao Yu. Improving the leading constant of matrix multiplication. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1933–1971. SIAM, 2025

work page 2025

[5] [5]

Toni Böhnlein, Pál András Papp, and Albert-Jan N. Yzelman. Red-blue pebbling with multiple processors: Time, communication and memory trade-offs. InInternational Colloquium on Structural Information and Communication Complexity (SIROCCO), pages 109–126. Springer, 2025

work page 2025

[6] [6]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[7] [7]

Rethinking attention with performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[8] [8]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[9] [9]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems (NeurIPS), 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems (NeurIPS), 35:16344–16359, 2022

work page 2022

[10] [10]

Smyrf-efficient attention using asymmetric clustering.Advances in Neural Information Processing Systems (NeurIPS), 33:6476–6489, 2020

Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf-efficient attention using asymmetric clustering.Advances in Neural Information Processing Systems (NeurIPS), 33:6476–6489, 2020

work page 2020

[11] [11]

Fine-grained i/o complexity via reductions: New lower bounds, faster algorithms, and a time hierarchy

Erik D Demaine, Andrea Lincoln, Quanquan C Liu, Jayson Lynch, and Virginia Vas- silevska Williams. Fine-grained i/o complexity via reductions: New lower bounds, faster algorithms, and a time hierarchy. In9th Innovations in Theoretical Computer Science Confer- ence (ITCS), pages 34–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2018

work page 2018

[12] [12]

Red-blue pebble game: Complexity of computing the trade-off between cache size and memory transfers

Erik D Demaine and Quanquan C Liu. Red-blue pebble game: Complexity of computing the trade-off between cache size and memory transfers. InProceedings of the 30th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 195–204, 2018

work page 2018

[13] [13]

Smoothing the gap between np and er.SIAM Journal on Computing, 53(6):FOCS20–102, 2022

Jeff Erickson, Ivor Van Der Hoog, and Tillmann Miltzow. Smoothing the gap between np and er.SIAM Journal on Computing, 53(6):FOCS20–102, 2022

work page 2022

[14] [14]

Hyperattention: Long-context attention in near-linear time

Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[15] [15]

I/O complexity: The red-blue pebble game

Jia-Wei Hong and Hsiang-Tsung Kung. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing (STOC), pages 326–333, 1981

work page 1981

[16] [16]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning (ICML), pages 5156–5165. PMLR, 2020. 10

work page 2020

[17] [17]

Reformer: The efficient transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations (ICLR), 2020

work page 2020

[18] [18]

Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

Grzegorz Kwasniewski, Marko Kabi´c, Maciej Besta, Joost VandeV ondele, Raffaele Solcà, and Torsten Hoefler. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1–22, 2019

work page 2019

[19] [19]

The impact of partial computations on the red-blue pebble game

Pál András Papp, Aleksandros Sobczyk, and Albert-Jan N Yzelman. The impact of partial computations on the red-blue pebble game. InProceedings of the 37th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 328–338, 2025

work page 2025

[20] [20]

On the hardness of red-blue pebble games

Pál András Papp and Roger Wattenhofer. On the hardness of red-blue pebble games. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 419–429, 2020

work page 2020

[21] [21]

I/O complexity of attention, or how optimal is flashattention? InProceedings of the 41st International Conference on Machine Learning (ICML), pages 43024–43042, 2024

Barna Saha and Christopher Ye. I/O complexity of attention, or how optimal is flashattention? InProceedings of the 41st International Conference on Machine Learning (ICML), pages 43024–43042, 2024

work page 2024

[22] [22]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems (NeurIPS), 37:68658–68685, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems (NeurIPS), 37:68658–68685, 2024

work page 2024

[23] [23]

I/o complexity and pebble games with partial computations.Information Processing Letters, page 106637, 2026

Aleksandros Sobczyk. I/o complexity and pebble games with partial computations.Information Processing Letters, page 106637, 2026

work page 2026

[24] [24]

Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems (NeurIPS), 30, 2017

work page 2017

[25] [25]

Nyströmformer: A nyström-based algorithm for approximating self-attention

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14138–14148, 2021

work page 2021

[26] [26]

Big bird: Transformers for longer sequences.Advances in neural information processing systems (NeurIPS), 33:17283– 17297, 2020

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems (NeurIPS), 33:17283– 17297, 2020. A Proof of Key Lemma 3.2 A.1 Upper bound proof The upper bound proof...

work page 2020

[27] [27]

We also have M=O(g δ2)≤g δ′ 2 for some δ′ 2 > δ 2. Hence we only need to show 15 gδ′ 2 ≤4·g· e 2 (gδ′ 1) ; this indeed holds asymptotically for any constants δ′ 1, δ′ 2, since the left side is polynomial, while the right side is exponential ing. It remains to show the I/O strategy that proves the upper bound in Equation (3). This is very similar to the up...

work page

[28] [28]

The left side is then upper bounded by 1 for anyg≥1, and hence it is sufficient to haved≥5·g, which holds due to our assumption. We note that if we modify the restriction condition from g 2 to a smaller linear factor of g, the above proof can be applied similarly; this would allow us to loosen the condition d≥5·g to obtain a slightly smaller constant fact...

work page