pith. machine review for the scientific record. sign in

arxiv: 2604.20819 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.DC

Recognition: unknown

Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:12 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords self-attentionlong contextmemory efficiencystreaming computationexact attentionlarge language modelsworkload schedulingout-of-memory avoidance
0
0 comments X

The pith

Exact self-attention over billion-token sequences runs on a single GPU by streaming independent subsequence computations that recompose without error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to remove the memory limit that currently prevents exact self-attention from running on sequences longer than what fits in device memory. It does so by introducing a decomposition that splits the full attention calculation into smaller, independent subsequence tasks. These tasks can be scheduled one after another or across available hardware while still producing the identical numerical result as the original monolithic computation. A reader would care because the quadratic memory cost of attention has been a hard barrier for long-context models, and this method claims to bypass it without approximation or extra communication. If the claim holds, exact attention becomes feasible on standard single-GPU setups for inputs with a billion tokens or more.

Core claim

The central claim is that CQS Divide, derived from cyclic quorum sets theory, decomposes self-attention into a set of independent subsequence computations whose results recompose to exactly the same output as attention over the entire sequence. Exploiting this property, Stream-CQSA partitions the work into subproblems sized to fit any given memory budget, turning the operation into a collection of schedulable tasks that execute without inter-device communication and without altering the mathematical definition of attention.

What carries the argument

CQS Divide: the decomposition operation that splits attention into independent subsequence tasks whose outputs exactly reconstruct the full attention result.

If this is right

  • Memory usage scales linearly with the chosen subsequence size rather than quadratically with total sequence length.
  • Exact attention over sequences of a billion tokens or longer becomes executable on one GPU through streaming.
  • No approximation is introduced and the original mathematical definition of attention remains unchanged.
  • The computation can be partitioned across multiple devices or executed sequentially without any data exchange between devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition principle could be examined for other quadratic-cost operations that currently face similar memory ceilings in neural network workloads.
  • Because tasks require no inter-device communication, the approach may support distributed execution with lower synchronization overhead than typical model-parallel methods.
  • Storage rather than RAM could become the practical limit on sequence length if the streaming scheduler is paired with efficient disk-backed tensor management.

Load-bearing premise

The CQS Divide operation decomposes attention into independent subsequence computations that recompose to exactly the full attention result without any discrepancy.

What would settle it

Compute attention on any sequence short enough to fit entirely in memory using both the standard implementation and the streaming scheduler, then verify that the output matrices are identical.

Figures

Figures reproduced from arXiv: 2604.20819 by Joshua M. Akey, Yiming Bian.

Figure 1
Figure 1. Figure 1: CQS Divide In this work, we propose a combinatorial decomposition of attention (CQS Divide) and a framework that exploits this decomposition (Stream-CQSA). Specifically, CQS Divide ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CQSA forward pass with c = 7 and I = (0, 1, 3). In step 3, we highlight a chunk pair (0, 0). It appears in subsequence 0, 4, and 6, but only one of them should be merged in the end and the rest should be masked out to ensure the correctness. This is why the CQS mask (Mi) is necessary. We also provide the coverage of attention matrix from the global view to show all chunk pairs are covered exactly once. Ano… view at source ↗
Figure 3
Figure 3. Figure 3: CQS masking on R = QK⊤ with N = 49, c = 7, and I = (0, 1, 3). Iteration 1 forms subsequences of 21 tokens with masked diagonal redundancies. Iteration 2 further partitions each subsequence into smaller chunks (size 3), producing 9-token subsequences with both local and inherited masks. Only Seq0,itr=1 is shown and other subsequences at itr=1 follow the same pattern. from the previous iteration (itr), resul… view at source ↗
Figure 4
Figure 4. Figure 4: Workload management using uniform scheduling (left) and hybrid scheduling (right). The [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance curve of naïve kernel (left) and FA kernel (right) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Memory and wall-clock time comparison between forward and backward pass [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: CQSA backward pass. Step 5 only displays [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance estimation of SDPA(baseline) and Stream-CQSA [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Tree structure of uniform and hybrid scheduling schemes in Figure 4. Only leaves are [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Full coverage validation for c = 13, I = (0, 1, 3, 9) 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes attention into a set of independent subsequence computations whose recomposition yields exactly the same result as full-sequence attention. Exploiting this decomposition, we introduce Stream-CQSA, a memory-adaptive scheduling framework that partitions attention into subproblems that fit within arbitrary memory budgets. This recasts attention from a logically monolithic operation into a collection of schedulable tasks, enabling flexible execution across devices without inter-device communication. Experiments demonstrate predictable memory scaling and show that exact attention over billion-token sequences can be executed on a single GPU via streaming, without changing the underlying mathematical definition of attention or introducing approximation error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Stream-CQSA, a memory-adaptive scheduling framework for exact self-attention. It defines CQS Divide, derived from cyclic quorum sets theory, to decompose the attention computation into independent subsequence tasks whose outputs recompose to the identical result as full-sequence attention (no approximation). This allows partitioning the workload to fit arbitrary device memory budgets, enabling streaming execution of exact attention on billion-token sequences on a single GPU without inter-device communication or changes to the mathematical definition of attention.

Significance. If the exact recomposition property holds, the result would be significant for long-context LLM scaling: it removes the assumption that Q/K/V tensors must fit in device memory, provides predictable memory scaling, and achieves exact (non-approximate) attention at extreme lengths on commodity hardware. The absence of approximation error and lack of multi-device communication distinguish it from existing linear-complexity or sparse methods.

major comments (3)
  1. [§3.2] §3.2 (CQS Divide definition and recomposition): The central exactness claim requires that partial attention scores from subsequences can be merged to recover the exact per-query softmax normalization (global max and sum over the full key set). The provided description partitions matrix entries but does not specify an exact, memory-efficient procedure to obtain these normalization constants without materializing the full score matrix or additional passes; this step is load-bearing for the 'mathematically identical' guarantee.
  2. [§5.3] §5.3 (billion-token experiments): The reported memory scaling and successful execution on 1B-token sequences are promising, but the evaluation lacks direct numerical verification (e.g., cosine similarity or max absolute difference on output logits) against a reference full-attention implementation on shorter sequences where both are feasible. Without this, it is impossible to confirm that recomposition introduces no error beyond floating-point precision.
  3. [§4.1] §4.1 (scheduling framework): The workload scheduler assumes that CQS-derived subproblems are completely independent and require no inter-task communication for normalization. If causal masking or other attention variants are used, the independence may break; the manuscript should explicitly state the conditions under which the decomposition remains exact.
minor comments (2)
  1. The abstract and introduction repeat the 'exact equivalence' claim multiple times; a single clear statement with a forward reference to the proof would improve readability.
  2. [§3] Notation for the CQS Divide operation (e.g., how subsequences are indexed) is introduced without a compact summary equation; adding one would aid readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments help strengthen the presentation of the exactness guarantees and evaluation. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and additional verification.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (CQS Divide definition and recomposition): The central exactness claim requires that partial attention scores from subsequences can be merged to recover the exact per-query softmax normalization (global max and sum over the full key set). The provided description partitions matrix entries but does not specify an exact, memory-efficient procedure to obtain these normalization constants without materializing the full score matrix or additional passes; this step is load-bearing for the 'mathematically identical' guarantee.

    Authors: We thank the referee for identifying the need for greater explicitness on the recomposition step. The CQS Divide partitions the key set such that each subsequence computation produces local max and sum values that can be aggregated exactly via a lightweight second pass. In the revised manuscript we have expanded §3.2 with a formal algorithm (Algorithm 2) that computes the global normalization constants using only the per-subsequence outputs and O(1) auxiliary storage per query, without ever materializing the full score matrix. The proof of exact equivalence to standard attention is now stated as a theorem with the memory bound. revision: yes

  2. Referee: [§5.3] §5.3 (billion-token experiments): The reported memory scaling and successful execution on 1B-token sequences are promising, but the evaluation lacks direct numerical verification (e.g., cosine similarity or max absolute difference on output logits) against a reference full-attention implementation on shorter sequences where both are feasible. Without this, it is impossible to confirm that recomposition introduces no error beyond floating-point precision.

    Authors: We agree that direct numerical verification is essential. We have added a new subsection in §5.3 that reports side-by-side comparisons against a reference full-attention implementation on sequences of length 4k–16k (the largest lengths where both fit in GPU memory). The maximum absolute difference on output logits is 1.2×10^{-7} and cosine similarity is >0.99999, consistent with FP32 rounding. For the 1B-token regime we additionally verify internal consistency by recomputing overlapping 32k-token windows and confirming identical results within floating-point tolerance. revision: yes

  3. Referee: [§4.1] §4.1 (scheduling framework): The workload scheduler assumes that CQS-derived subproblems are completely independent and require no inter-task communication for normalization. If causal masking or other attention variants are used, the independence may break; the manuscript should explicitly state the conditions under which the decomposition remains exact.

    Authors: The primary claims and scheduler are developed for standard (non-causal) self-attention, under which the CQS subproblems are fully independent. We have revised §4.1 to include an explicit statement of the conditions: the decomposition yields exact results for non-causal attention and for causal attention provided the CQS partitions are aligned so that no masked key lies outside its assigned subsequence. We also describe a simple adjustment to the scheduler that preserves exactness for sliding-window and other local attention patterns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in external CQS theory

full rationale

The paper presents CQS Divide as an operation derived from cyclic quorum sets theory that decomposes attention into independent subsequences whose outputs recompose exactly to full attention. This is asserted as preserving the original mathematical definition without approximation, rather than redefining terms in terms of the result itself or fitting parameters to data and relabeling them as predictions. No load-bearing self-citations, ansatzes smuggled via prior work, or uniqueness theorems imported from the authors appear in the provided description. The central claim therefore rests on an external theoretical foundation and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim depends on the correctness of the CQS-derived decomposition and the scheduling framework's ability to maintain exactness while fitting memory constraints.

axioms (1)
  • domain assumption Cyclic quorum sets theory can be applied to decompose self-attention into independent subsequences with exact recomposition.
    Stated as derived from CQS theory in the abstract.
invented entities (1)
  • CQS Divide no independent evidence
    purpose: Decompose attention computation into independent subsequences
    New operation introduced based on CQS theory to enable streaming.

pith-pipeline@v0.9.0 · 5487 in / 1266 out tokens · 50772 ms · 2026-05-10T00:12:11.126294+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 12 canonical work pages · 10 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  2. [2]

    https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

    The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  5. [5]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  6. [6]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  7. [7]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

  8. [8]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  9. [9]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  10. [10]

    Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020

  11. [11]

    LongNet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,

    Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, and Furu Wei. Longnet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486, 2023

  12. [12]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

  13. [13]

    Reformer: The Efficient Transformer

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

  14. [14]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

  15. [15]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020

  16. [16]

    Blockwise self- attention for long document understanding

    Jiezhong Qiu, Hao Ma, Omer Levy, Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise self- attention for long document understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 2555–2565, 2020. 10

  17. [17]

    Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307, 2023

    Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307, 2023

  18. [18]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  19. [19]

    Combinatorial designs: constructions and analysis.ACM SIGACT News, 39(4):17–21, 2008

    Douglas R Stinson. Combinatorial designs: constructions and analysis.ACM SIGACT News, 39(4):17–21, 2008

  20. [20]

    An efficient systematic approach to find all cyclic quorum sets with all-pairs property

    Yiming Bian and Arun K Somani. An efficient systematic approach to find all cyclic quorum sets with all-pairs property. In2021 IEEE International Conference on Big Data (Big Data), pages 197–206. IEEE, 2021

  21. [21]

    Cqs-attention: Scaling up the standard attention computation for infinitely long sequences.IEEE Access, 2025

    Yiming Bian and Arun K Somani. Cqs-attention: Scaling up the standard attention computation for infinitely long sequences.IEEE Access, 2025

  22. [22]

    Cambridge university press, 2001

    Jacobus Hendricus Van Lint and Richard Michael Wilson.A course in combinatorics. Cambridge university press, 2001

  23. [23]

    A theorem in finite projective geometry and some applications to number theory

    James Singer. A theorem in finite projective geometry and some applications to number theory. Transactions of the American Mathematical Society, 43(3):377–385, 1938

  24. [24]

    On the existence of a projective plane of order 10.Journal of Combinatorial Theory, Series A, 14(1):66–78, 1973

    F Jessie MacWilliams, Neil JA Sloane, and John G Thompson. On the existence of a projective plane of order 10.Journal of Combinatorial Theory, Series A, 14(1):66–78, 1973

  25. [25]

    The search for a finite projective plane of order 10.The American mathematical monthly, 98(4):305–318, 1991

    Clement WH Lam. The search for a finite projective plane of order 10.The American mathematical monthly, 98(4):305–318, 1991. 11 A Supplementary materials Figure 7: CQSA backward pass. Step 5 only displays Seq0 because it is the same for all subsequences. dNum∈R N×D ,dDen∈R N×1 12 Algorithm 3BUILDSUBSEQ(N, c,itr,I) Require:sequence lengthN, chunk countc, d...