Exact Linear Attention
Pith reviewed 2026-05-21 08:39 UTC · model grok-4.3
The pith
Exact Linear Attention uses kernel decomposition to compute Transformer attention in linear time without approximation errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Exact Linear Attention works by exploiting the property that certain kernel functions allow the attention scores to be computed through associative operations that reduce to linear passes over the input sequence. The paper specifies three such kernels—the Hadamard exponential, summation squared Euclidean distance, and subtraction squared Euclidean distance—each chosen to be non-negative, to distinguish inputs clearly, and to carry geometric meaning. These choices remove the need for any approximation while directly tackling the problems of unstable gradients and diluted attention weights that appear in earlier linear attention designs.
What carries the argument
The exact decomposition property of kernel functions that rewrites the attention matrix as the product of two separate linear transformations accumulated over the sequence.
If this is right
- Decoding runs up to six times faster with seventy-five percent less key-value cache memory than full attention.
- Performance during training matches or exceeds standard attention on long contexts.
- The Memory Lobe accelerates convergence and boosts generalization.
- Vision models using the same principle gain up to 4.3 times faster inference and 7.9 times fewer parameters.
- A bias mechanism for mixture-of-experts routing increases semantic alignment and interpretability.
Where Pith is reading between the lines
- If the exactness property holds through stacked layers, much deeper models could train without special gradient techniques.
- The geometric interpretability requirement on kernels may offer a new way to analyze attention patterns in trained models.
- Applying the decomposition idea to other quadratic-cost components in neural nets could multiply the efficiency benefits.
- Validating the kernels on sequences of tens of thousands of tokens would confirm whether the constraints prevent dilution in practice.
Load-bearing premise
The proposed kernel functions must satisfy non-negativity, discriminability, and geometric interpretability simultaneously without introducing new fitting parameters or adjustments that break the exact decomposition.
What would settle it
A side-by-side computation on a short sequence where the attention matrix from Exact Linear Attention differs from the standard softmax version by more than floating-point error, or a case where one kernel produces negative values for valid inputs.
Figures
read the original abstract
This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by exploiting the exact decomposition property of kernel functions, thereby eliminating approximation error. We identify and address two key limitations of prior linear attention -- gradient explosion and token attention dilution -- by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel, each tailored for specific attention behaviors. Beyond the core attention formulation, the paper presents three engineering innovations: (1) a Hyper-Link structure that replaces traditional residual connections to mitigate gradient degradation; (2) a Memory Lobe module based on bidirectional linear attention, which captures "transformation flow" across layers to implement qualitative memory and an implicit reinforcement learning paradigm; and (3) a routing-score-based bias mechanism for Mixture-of-Experts (MoE) to improve interpretability and semantic alignment. Experimental results demonstrate that ELA achieves up to 6x faster decoding speed and 75% reduction in KV cache memory usage compared to full attention, while maintaining comparable or superior training performance. The proposed memory module accelerates convergence and enhances generalization. Furthermore, we extend the linear attention principle to vision models, yielding YOLO-LAT, which attains up to 4.3x GPU inference speedup and 7.9x parameter reduction with competitive detection accuracy. These results underline the broad applicability of exact linear attention for scaling Transformer models to ultra-long sequences and efficient visual tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Exact Linear Attention (ELA), claiming linear computational complexity for Transformer attention via exact kernel decompositions that eliminate approximation error. It proposes three kernels (Hadamard Exp, Summation Squared Euclidean Distance, Subtraction Squared Euclidean Distance) with constraints ensuring non-negativity, discriminability, and geometric interpretability to mitigate gradient explosion and token attention dilution. Additional contributions include a Hyper-Link residual structure, a bidirectional Memory Lobe module for capturing transformation flow and implicit RL, and a routing-score bias for MoE. Experiments report up to 6x faster decoding, 75% KV cache reduction, faster convergence, and extension to YOLO-LAT with 4.3x GPU speedup and 7.9x parameter reduction.
Significance. If the kernels admit finite explicit feature maps yielding exact phi(q)^T phi(k) = kernel(q,k) without hidden approximations or sequence-dependent normalizations, and the constraints preserve attention semantics, this would offer a meaningful advance over approximate linear attentions by providing exactness at linear cost. The Memory Lobe and Hyper-Link could improve training stability for deep models if shown to be load-bearing.
major comments (3)
- [§3.2] §3.2 (Kernel Functions): The exact decomposition claim for the Summation Squared Euclidean Distance Kernel and Subtraction Squared Euclidean Distance Kernel is load-bearing but unsupported without an explicit finite-dimensional feature map phi such that the inner product recovers the kernel exactly; squared Euclidean distances are not positive definite by default and typically require wrapping (e.g., exp(-d)) that may introduce approximation or extra parameters, contradicting the 'exact' and 'parameter-free' guarantees.
- [§3.1] §3.1 (Hadamard Exp Kernel): Element-wise exponentiation does not factor into a low-rank inner product with finite dimensions without truncation or infinite series; the paper must provide the explicit phi construction or proof that the decomposition holds exactly for all query-key pairs, as this is central to eliminating approximation error.
- [Results] Results section, performance tables: The 6x decoding speedup and 75% KV cache reduction claims require explicit comparison to both full attention and prior linear methods (e.g., Performer) at fixed sequence lengths (e.g., 8k–128k tokens) and model sizes; without these controls the speed/memory gains cannot be attributed to the exact decomposition versus implementation details.
minor comments (2)
- [Abstract] Abstract: Clarify whether 'up to 6x faster decoding speed' refers to wall-clock time, FLOPs, or tokens-per-second, and specify the hardware and sequence lengths.
- [Notation] Notation: Define the feature map phi consistently and distinguish it from the kernel function K(q,k) in all equations to avoid ambiguity in the decomposition property.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript on Exact Linear Attention. Below, we provide point-by-point responses to the major comments and indicate the revisions made.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Kernel Functions): The exact decomposition claim for the Summation Squared Euclidean Distance Kernel and Subtraction Squared Euclidean Distance Kernel is load-bearing but unsupported without an explicit finite-dimensional feature map phi such that the inner product recovers the kernel exactly; squared Euclidean distances are not positive definite by default and typically require wrapping (e.g., exp(-d)) that may introduce approximation or extra parameters, contradicting the 'exact' and 'parameter-free' guarantees.
Authors: We acknowledge the importance of explicitly demonstrating the finite feature maps to support the exact decomposition claims. In the revised version of the manuscript, we have added detailed constructions of the feature maps φ for the Summation Squared Euclidean Distance Kernel and the Subtraction Squared Euclidean Distance Kernel. These maps are finite-dimensional and ensure that the inner product φ(q)^T φ(k) exactly recovers the kernel function for all pairs, while the non-negativity and discriminability constraints guarantee positive definiteness without the need for additional wrapping functions or parameters. This revision directly addresses the concern and reinforces the exactness of our approach. revision: yes
-
Referee: [§3.1] §3.1 (Hadamard Exp Kernel): Element-wise exponentiation does not factor into a low-rank inner product with finite dimensions without truncation or infinite series; the paper must provide the explicit phi construction or proof that the decomposition holds exactly for all query-key pairs, as this is central to eliminating approximation error.
Authors: We appreciate this feedback on the Hadamard Exp Kernel. The manuscript originally presented the kernel through its definition and the imposed constraints, but to provide full transparency, we have now included an explicit finite-dimensional feature map φ along with a proof that the decomposition φ(q)^T φ(k) = kernel(q, k) holds exactly without truncation or reliance on infinite series. This construction leverages the element-wise operations in a manner that maintains finite dimensionality and exact recovery for all query-key pairs, consistent with our goal of eliminating approximation error. revision: yes
-
Referee: [Results] Results section, performance tables: The 6x decoding speedup and 75% KV cache reduction claims require explicit comparison to both full attention and prior linear methods (e.g., Performer) at fixed sequence lengths (e.g., 8k–128k tokens) and model sizes; without these controls the speed/memory gains cannot be attributed to the exact decomposition versus implementation details.
Authors: We agree that strengthening the experimental validation with controlled comparisons is essential. Accordingly, we have revised the Results section to include comprehensive benchmarks comparing ELA against both standard full attention and established linear attention baselines such as the Performer. These experiments are conducted at fixed sequence lengths from 8k to 128k tokens and for various model sizes, allowing clear attribution of the reported speedups and memory reductions to the exact linear attention mechanism rather than implementation specifics. Updated tables and figures have been added to the manuscript. revision: yes
Circularity Check
No significant circularity; derivation is self-contained around proposed kernels
full rationale
The paper defines new kernel functions (Hadamard Exp, Summation Squared Euclidean Distance, Subtraction Squared Euclidean Distance) and states that they admit exact feature-map decompositions satisfying non-negativity, discriminability, and geometric interpretability. These kernels are introduced as proposals rather than quantities fitted to the target attention matrix or derived from prior self-citations. The linear-complexity claim follows directly from the algebraic identity kernel(q,k) = phi(q)^T phi(k) once the kernels are specified; no step renames a fitted parameter as a prediction, imports uniqueness from the authors' own prior work, or smuggles an ansatz via citation. The engineering modules (Hyper-Link, Memory Lobe, MoE bias) are presented as separate additions whose correctness is evaluated empirically, not presupposed by the kernel definitions. The derivation chain therefore remains independent of its performance claims.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Transformers are RNNs: Fast autoregressive transformers with linear attention,
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention,” inProc. 37th Int. Conf. Mach. Learn. (ICML), 2020, pp. 5156–5165
work page 2020
-
[2]
Linear transformers are secretly fast weight programmers,
I. Schlag, K. Irie, and J. Schmidhuber, “Linear transformers are secretly fast weight programmers,” inProc. 38th Int. Conf. Mach. Learn. (ICML), 2021, pp. 9355–9366
work page 2021
-
[3]
The devil in linear transformer,
Z. Qin, X. Han, W. Sun, D. Li, L. Kong, N. Barnes, and Y . Zhong, “The devil in linear transformer,” inProc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2022, pp. 7025–7041
work page 2022
-
[4]
A. Vaswaniet al., “Attention is all you need,” inProc. 31st Conf. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008
work page 2017
-
[5]
MiniMind: Train a Tiny LLM from Scratch,
Jingyao Gong., “MiniMind: Train a Tiny LLM from Scratch,” inGitHub: https://github.com/jingyaogong/minimind
-
[6]
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, et al. Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models.arXiv preprint arXiv:2601.07372, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Hyper-Connections.arXiv preprint arXiv:2409.19606, 2024
Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-Connections.arXiv preprint arXiv:2409.19606, 2024
-
[8]
mHC: Manifold-Constrained Hyper-Connections
Zhenda Xie, Wentao Zhang, Xinyu Zhao, Yukai Li, Peng Wang, Weiran You, and others. mHC: Manifold-Constrained Hyper-Connections. arXiv preprint arXiv:2512.24880, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations.Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 209, pp. 415–446, 1909
work page 1909
-
[10]
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax Team. MiniMax-01: Scaling Foundation Models with Light- ning Attention.arXiv preprint arXiv:2501.08313, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Kimi Linear: A Novel Hybrid Linear Attention Architecture
Kimi Team. Kimi Linear: A Novel Hybrid Linear Attention Architecture. arXiv preprint arXiv:2510.xxxxx, 2025
work page 2025
-
[12]
S. Jain and B. C. Wallace. Attention is not Explanation. InProc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019
work page 2019
-
[13]
Z. Jia, W. Kwon, and O. Ruwase. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InProc. Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC’20), 2020
work page 2020
-
[14]
T. J. Buschman and E. K. Miller. Goal-direction and top-down control. Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 365, no. 1544, pp. 1271–1278, 2010
work page 2010
-
[15]
Polyn, S. M., & Kahana, M. J. Memory search and the neural representation of context.Trends in Cognitive Sciences, 12(1):24–30, 2008. JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX 2026 9
work page 2008
-
[16]
A., Klumpers, F., Roelofs, K., & Hermans, E
Zhang, W., van Ast, V . A., Klumpers, F., Roelofs, K., & Hermans, E. J. Memory contextualization: The role of the left inferior frontal gyrus in binding event and contextual information.Journal of Cognitive Neuroscience, 30(5):698–713, 2018
work page 2018
-
[17]
de Sousa, A. F., Zeidler, Z. E., Almeida-Filho, D. G., Shen, Y ., Luchetti, A., Simanian, S., Mardini, M., DeNardo, L. A., & Silva, A. J. The prefrontal cortex controls memory organization in the hippocampus. Nature Neuroscience, 29:1191–1202, 2026. doi: 10.1038/s41593-026- 02231-1
-
[18]
You only look once: Unified, real-time object detection,
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 779–788
work page 2016
-
[19]
YOLOv5: A state-of-the-art real-time object detection sys- tem,
Ultralytics, “YOLOv5: A state-of-the-art real-time object detection sys- tem,” 2020. [Online]. Available: https://github.com/ultralytics/yolov5
work page 2020
-
[20]
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,
C.-Y . Wang, A. Bochkovskiy, and H.-Y . M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 7464–7475
work page 2023
-
[21]
YOLOv8: A new state-of-the-art computer vision model,
Ultralytics, “YOLOv8: A new state-of-the-art computer vision model,”
-
[22]
Available: https://github.com/ultralytics/ultralytics
[Online]. Available: https://github.com/ultralytics/ultralytics
-
[23]
YOLO12: Attention-centric object detection,
Ultralytics, “YOLO12: Attention-centric object detection,” 2025. [On- line]. Available: https://docs.ultralytics.com/models/yolo12/
work page 2025
-
[24]
YOLO-DMA: A small-object detector based on multi-scale deformable convolution and linear attention,
Y . Li, Z. Wang, and H. Liu, “YOLO-DMA: A small-object detector based on multi-scale deformable convolution and linear attention,”Elec- tronics, vol. 15, no. 4, p. 812, 2026
work page 2026
-
[25]
G. Jocher and J. Qiu, Ultralytics YOLO26, version 26.0.0, 2026. [Online]. Available: https://github.com/ultralytics/ultralytics
work page 2026
-
[26]
The linear attention resurrection in vision transformer,
C. Zheng, “The linear attention resurrection in vision transformer,” arXiv preprint arXiv:2501.16182, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.