RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
Pith reviewed 2026-05-18 10:02 UTC · model grok-4.3
The pith
RACE Attention approximates softmax attention using sharpened angular similarity and soft LSH to achieve strictly linear time in sequence length.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RACE Attention is a kernel-inspired method that approximates attention outputs via Gaussian random projections and soft Locality-Sensitive Hashing (LSH) after replacing the exponential kernel with a sharpened angular similarity, yielding a strictly linear-time attention layer that avoids construction of the full attention matrix.
What carries the argument
Repeated Arrays-of-Count Estimators (RACE) that combine Gaussian random projections with soft LSH to estimate attention scores using sharpened angular similarity.
If this is right
- RACE Attention matches or outperforms strong baselines on language modeling, masked language modeling, and text/image classification up to 64K sequence length.
- It reduces wall-clock time and memory usage compared to quadratic attention.
- It enables a single forward-backward pass on up to 12 million tokens on an NVIDIA GH200 GPU.
- It enables processing up to 75 million tokens on an Intel Xeon Gold 5220R CPU in a single pass.
Where Pith is reading between the lines
- This approach could allow training large language models on contexts far beyond current limits without specialized hardware.
- Integration with other linear attention methods might further improve efficiency or accuracy.
- Testing on multimodal tasks with even longer sequences could reveal additional benefits or limitations of the approximation.
Load-bearing premise
The sharpened angular similarity combined with Gaussian random projections and soft LSH produces attention outputs of sufficient quality to match or exceed softmax attention on downstream tasks without harmful approximation artifacts.
What would settle it
Demonstrating that on a controlled long-context task with sequence lengths exceeding 1 million tokens, the downstream performance of RACE Attention falls below that of softmax attention due to accumulated approximation errors.
Figures
read the original abstract
Softmax Attention has a quadratic time complexity in sequence length, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention-2/3 (exact, GPU-optimized implementations of Softmax Attention) cannot complete a single forward-backward pass of a single attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce Repeated Arrays-of-Count Estimators (RACE) Attention, a kernel-inspired alternative to Softmax Attention that is strictly linear in sequence length and embedding size. RACE Attention replaces the exponential kernel with a sharpened angular similarity, and approximates attention outputs via Gaussian random projections and soft Locality-Sensitive Hashing (LSH), avoiding construction of the full attention matrix. Across language modeling, masked language modeling, and text/image classification, RACE Attention matches or outperforms strong baselines up to 64K seqeuence length while reducing wall-clock time and memory usage. In addition, we conduct a controlled scaling study on a single attention layer and demonstrate processing of up to 12 million tokens on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU in a single forward-backward pass, which is well beyond the capabilities of current state-of-the-art attention implementations. RACE Attention thus offers a practical and theoretically grounded mechanism for long-context training on today's hardware. We release our code at https://github.com/sahiljoshi515/RACE_Attention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents RACE Attention, a new attention layer that achieves strictly linear time complexity in sequence length by replacing the softmax kernel with a sharpened angular similarity and approximating the attention computation using Gaussian random projections combined with soft Locality-Sensitive Hashing (LSH) in a Repeated Arrays-of-Count Estimators (RACE) framework. The authors report that this approach matches or exceeds the performance of standard attention on language modeling, masked LM, and classification tasks for sequences up to 64K tokens, while enabling single-layer forward-backward passes on up to 12M tokens on GPU and 75M on CPU.
Significance. Should the approximation errors prove to be well-controlled and not to degrade representation quality or training stability at extreme lengths, the result would be of high significance. It directly addresses the quadratic scaling barrier in transformer attention, potentially allowing models to train on contexts that are currently infeasible. The empirical scaling demonstrations and code release strengthen the practical contribution.
major comments (3)
- §3 (RACE Attention Mechanism): The description of the RACE estimator via Gaussian projections and soft LSH lacks any derivation or bound on the approximation error, bias, or variance as a function of sequence length L and the number of projections/hash functions. Without this, it is difficult to assess whether the estimator remains faithful to softmax attention at 10M+ token scales, which is load-bearing for the central claim of unaffected downstream training dynamics.
- §4 (Experiments): The controlled scaling study is limited to a single attention layer. It is unclear if approximation artifacts compound across the multiple layers of a full transformer stack, potentially affecting the validity of extrapolating these results to complete model training on outrageously large contexts.
- Abstract and §3: The method is described as 'theoretically grounded,' yet the text provides no formal analysis or proof of properties such as unbiasedness or convergence rates for the combined sharpened angular similarity and soft LSH estimator.
minor comments (3)
- Abstract: Typo: 'seqeuence length' should read 'sequence length'.
- Throughout: The number of free parameters (e.g., number of Gaussian projections and hash functions) should be explicitly listed and their sensitivity analyzed in the main text or appendix.
- Related Work: A more detailed comparison to prior LSH-based methods such as Reformer would help clarify the novelty of the sharpened angular similarity and repeated arrays-of-counts approach.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our work. We address the major comments in detail below and outline the revisions we plan to make to strengthen the manuscript.
read point-by-point responses
-
Referee: §3 (RACE Attention Mechanism): The description of the RACE estimator via Gaussian projections and soft LSH lacks any derivation or bound on the approximation error, bias, or variance as a function of sequence length L and the number of projections/hash functions. Without this, it is difficult to assess whether the estimator remains faithful to softmax attention at 10M+ token scales, which is load-bearing for the central claim of unaffected downstream training dynamics.
Authors: We thank the referee for highlighting this important aspect. While the RACE framework leverages well-studied Gaussian projections for estimating angular similarities and soft LSH for efficient bucketing, our manuscript indeed prioritizes the algorithmic presentation and empirical results over a full theoretical error analysis. We agree that providing bounds on bias and variance would better support the claims at extreme scales. In the revision, we will add a theoretical analysis subsection in §3 that derives the expected error using properties of random projections and discusses variance reduction with increasing hash functions. We will also include empirical plots of approximation error versus sequence length to demonstrate control at large L. revision: yes
-
Referee: §4 (Experiments): The controlled scaling study is limited to a single attention layer. It is unclear if approximation artifacts compound across the multiple layers of a full transformer stack, potentially affecting the validity of extrapolating these results to complete model training on outrageously large contexts.
Authors: The referee is correct that our scaling experiments isolate the attention mechanism in a single layer to clearly demonstrate the linear scaling and extreme context capabilities. We acknowledge the potential for error accumulation in deeper models. To address this, we will include additional experiments in the revised manuscript using a multi-layer transformer (e.g., 4-6 layers) on long-context tasks, showing that training dynamics remain stable and performance is maintained. We will also add a discussion on why the per-layer approximation errors do not compound destructively based on our observations. revision: yes
-
Referee: Abstract and §3: The method is described as 'theoretically grounded,' yet the text provides no formal analysis or proof of properties such as unbiasedness or convergence rates for the combined sharpened angular similarity and soft LSH estimator.
Authors: We used 'theoretically grounded' to indicate that the approach is built on established theoretical tools from random matrix theory and locality-sensitive hashing, which provide guarantees for individual components. However, we accept that this phrasing may imply more formal analysis than is currently present for the integrated estimator. In the revision, we will temper the language in the abstract and §3, replacing it with 'inspired by theoretical results on kernel approximation' or similar, and add references to supporting theory. We do not claim new proofs of unbiasedness or specific convergence rates in this work, as the focus is on the practical linear-time attention implementation. revision: partial
Circularity Check
No significant circularity detected in RACE Attention derivation
full rationale
The paper presents RACE Attention as a direct construction: it replaces the exponential kernel with sharpened angular similarity and approximates outputs using Gaussian random projections plus soft LSH, avoiding the full matrix. This is a new estimator built from standard random-projection and LSH primitives rather than any fitted parameter renamed as a prediction, self-definitional equation, or load-bearing self-citation chain. The scaling claims rest on the linear-time properties of the chosen approximations, which are independent of the target result and externally verifiable via the released code. No step reduces the claimed linear-time behavior or performance equivalence to an input by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of Gaussian projections and hash functions
axioms (1)
- domain assumption Sharpened angular similarity preserves the essential ranking and weighting properties needed for effective attention.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RACE Attention replaces the exponential kernel with a sharpened angular similarity, and approximates attention outputs via Gaussian random projections and soft Locality-Sensitive Hashing (LSH)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sim(Qi,Kj) = (1 - cos⁻¹(QᵢᵀKⱼ / ‖Qi‖‖Kj‖) / π )^γ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
SOCKET: SOft Collision Kernel EsTimator for Sparse Attention
SOCKET replaces hard LSH bucket matches with soft probabilistic collision aggregation to enable efficient, high-quality token selection for sparse attention, matching or exceeding prior methods with up to 1.5x through...
Reference graph
Works this paper leans on
-
[1]
On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks
A. Backurs, P. Indyk, and L. Schmidt. On the fine-grained complexity of empirical risk mini- mization: Kernel methods and neural networks.arXiv preprint arXiv:1704.02958, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
Neural Machine Translation by Jointly Learning to Align and Translate
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[3]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. InProceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC’02), pages 380–388. ACM, 2002. doi: 10.1145/509907.509965
-
[5]
The Role of Downflows in Establishing Solar Near-Surface Shear
B. Chen and A. Shrivastava. Densified winner take all (wta) hashing for sparse datasets. InProceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI), 2018. arXiv:1810.00115
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Davis, A. Mohiuddin, Ł. Kaiser, D. Belanger, L. Colwell, and A. Weller. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021
work page 2021
-
[7]
K. M. Choromanski, M. Rowland, and A. Weller. The unreasonable effectiveness of structured random orthogonal embeddings. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems (NeurIPS) 30. Curran Associates, Inc., 2017. 12
work page 2017
-
[8]
B. Coleman and A. Shrivastava. Sub-linear RACE sketches for approximate kernel density es- timation on streaming data. InWWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20–24, 2020, pages 1739–1749. ACM / IW3C2, 2020. doi: 10.1145/3366423.3380244
-
[9]
B. Coleman, R. Baraniuk, and A. Shrivastava. Sub-linear memory sketches for near neighbor search on streaming data. In H. Daumé III and A. Singh, editors,Proceedings of the 37th In- ternational Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 2089–2099. PMLR, 13–18 Jul 2020
work page 2089
-
[10]
Flashattention: Fastandmemory-efficientexact attention with io-awareness
T.Dao, D.Y.Fu, S.Ermon, A.Rudra, andC.Ré. Flashattention: Fastandmemory-efficientexact attention with io-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[11]
M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser. Universal transformers. InPro- ceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 2019. OpenReview.net
work page 2019
-
[12]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[13]
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
R. Eldan and Y. Li. Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759, 2023
work page internal anchor Pith review arXiv 2023
- [14]
-
[15]
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregres- sive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning. PMLR, 2020
work page 2020
- [16]
-
[17]
A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009
work page 2009
-
[18]
C. Luo and A. Shrivastava. Arrays of (locality-sensitive) count estimators (ace): Anomaly de- tection on the edge. InProceedings of the 2018 World Wide Web Conference, WWW ’18, pages 1439–1448, New York, NY, USA, 2018. ACM. doi: 10.1145/3178876.3186056
- [19]
-
[20]
A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vec- tors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), pages 142–150, 2011
work page 2011
-
[21]
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Computational Linguistics, 19(2):313–330, 1993
work page 1993
-
[22]
Regularizing and Optimizing LSTM Language Models
S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing lstm language models.arXiv preprint arXiv:1708.02182, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran. Image trans- former. InProceedings of the 35th International Conference on Machine Learning (ICML), vol- 13 ume 80 ofProceedings of Machine Learning Research, pages 4052–4061, Stockholm, Sweden, 2018. PMLR
work page 2018
-
[24]
H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, and L. Kong. Random feature attention. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
- [25]
-
[26]
A. Rahimi and B. Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems 20 (NeurIPS 2007), pages 1177–1184, Vancouver, British Columbia, Canada, 2007. Curran Associates, Inc
work page 2007
-
[27]
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2383–2392, 2016
work page 2016
-
[28]
R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642, 2013
work page 2013
-
[29]
Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D.Metzler. Longrangearena: Abenchmarkforefficienttransformers. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[30]
J. A. Tropp. An introduction to matrix concentration inequalities.Foundations and Trends®in Machine Learning, 8(1-2):1–230, 2015
work page 2015
-
[31]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, vol- ume 30, 2017
work page 2017
-
[32]
S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[33]
H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [34]
- [35]
- [36]
-
[37]
X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. InAdvances in Neural Information Processing Systems, volume 28, 2015. 14 Appendix Table 13:Experiment Setup and Hyperparameters Dataset Task Hyperparameters PTB Language ModelingN=128; layers=1; heads=2;d=128; batch=16; lr=6e −4;β’s=(0.9, 0.999);ϵ=1e−8; wd=0.1...
work page 2015
-
[38]
ˆS(ℓ)= Φ (ℓ) Q ( Φ (ℓ) K )⊤
-
[39]
Each row ofΦ (ℓ) Q andΦ (ℓ) K is a probability vector; hence∥Φ(ℓ) Q∥F,∥Φ(ℓ) K∥F≤ √ N
-
[40]
16 Proof.Eachϕ (ℓ)(x)is a softmax overR= 2 P corners, so entries are nonnegative and sum to1
Consequently∥ˆS(ℓ)∥F≤N. 16 Proof.Eachϕ (ℓ)(x)is a softmax overR= 2 P corners, so entries are nonnegative and sum to1. Item (1) is by definition of ˆS(ℓ) ij . For (2), every rowpsatisfies∥p∥2≤∥p∥1 = 1, hence∥Φ(ℓ) Q∥2 F = ∑ i∥ϕ(ℓ)(Qi)∥2 2≤N(and similarly forΦ(ℓ) K ). Item (3) follows from∥AB∥F≤∥A∥F∥B∥F. Having controlled the feature-induced matrix norms, we...
-
[41]
Proof.(1)By definition,X (ℓ)= ˆS(ℓ)−E[ˆS(ℓ)], henceE[X (ℓ)] =E[ ˆS(ℓ)]−E[ˆS(ℓ)] = 0
With v:= max L∑ ℓ=1 E [( 1 LX (ℓ) )( 1 LX (ℓ) )⊤] 2 , L∑ ℓ=1 E [( 1 LX (ℓ) )⊤( 1 LX (ℓ) )] 2 , we havev≤4N2/L. Proof.(1)By definition,X (ℓ)= ˆS(ℓ)−E[ˆS(ℓ)], henceE[X (ℓ)] =E[ ˆS(ℓ)]−E[ˆS(ℓ)] = 0. (2)By Lemma 3(3) we have∥ˆS(ℓ)∥2≤∥ˆS(ℓ)∥F≤N. By convexity of the spectral norm, ∥E[ˆS(ℓ)]∥2≤E [ ∥ˆS(ℓ)∥2 ] ≤N. Therefore, by the...
-
[42]
If∥E∥2≤smin/2, then∥ˆD−1∥2≤2/smin. 21 Proof.(1) Row-sum control.Since ˆS=S+ ∆and ˆD= diag( ˆS1), we rewrite E:= ˆD−D= diag ( ( ˆS−S)1 ) = diag(∆1). Hence each diagonal entry isEii = (∆1) i = ∑N j=1 ∆ ij, so ∥E∥2 = max i |Eii|= max i ⏐⏐(∆1) i ⏐⏐ ≤max i N∑ j=1 |∆ij|=∥∆∥∞. For the second inequality, by Cauchy–Schwarz on each rowi, N∑ j=1 |∆ij| ≤ √ N ( N∑ j=1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.