RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts

· 2025 · cs.LG · arXiv 2510.04008

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Softmax Attention has a quadratic time complexity in sequence length, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention-2/3 (exact, GPU-optimized implementations of Softmax Attention) cannot complete a single forward-backward pass of a single attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce Repeated Arrays-of-Count Estimators (RACE) Attention, a kernel-inspired alternative to Softmax Attention that is strictly linear in sequence length and embedding size. RACE Attention replaces the exponential kernel with a sharpened angular similarity, and approximates attention outputs via Gaussian random projections and soft Locality-Sensitive Hashing (LSH), avoiding construction of the full attention matrix. Across language modeling, masked language modeling, and text/image classification, RACE Attention matches or outperforms strong baselines up to 64K seqeuence length while reducing wall-clock time and memory usage. In addition, we conduct a controlled scaling study on a single attention layer and demonstrate processing of up to 12 million tokens on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU in a single forward-backward pass, which is well beyond the capabilities of current state-of-the-art attention implementations. RACE Attention thus offers a practical and theoretically grounded mechanism for long-context training on today's hardware. We release our code at https://github.com/sahiljoshi515/RACE_Attention.

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

SOCKET: SOft Collision Kernel EsTimator for Sparse Attention

cs.LG · 2026-02-06 · unverdicted · novelty 5.0

SOCKET replaces hard LSH bucket matches with soft probabilistic collision aggregation to enable efficient, high-quality token selection for sparse attention, matching or exceeding prior methods with up to 1.5x throughput gains.

citing papers explorer

Showing 2 of 2 citing papers.

Elastic Attention Cores for Scalable Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 56 · internal anchor
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
SOCKET: SOft Collision Kernel EsTimator for Sparse Attention cs.LG · 2026-02-06 · unverdicted · none · ref 25 · internal anchor
SOCKET replaces hard LSH bucket matches with soft probabilistic collision aggregation to enable efficient, high-quality token selection for sparse attention, matching or exceeding prior methods with up to 1.5x throughput gains.

RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer