pith. sign in

arxiv: 2605.18848 · v2 · pith:HWYAZBIKnew · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Exact Linear Attention

Pith reviewed 2026-05-21 08:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords exact linear attentionkernel functionslinear complexitytransformer attentionlong sequence modelingefficient transformersmemory lobehyper-link structure
0
0 comments X

The pith

Exact Linear Attention uses kernel decomposition to compute Transformer attention in linear time without approximation errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to replace the quadratic cost of attention in Transformers with a linear alternative that remains mathematically exact. It does so by identifying kernel functions that can be factored into operations performed separately on each token and then combined. Constraints on these kernels are introduced to stop gradients from exploding during training and to keep attention focused rather than spread too thin. Supporting components such as an alternative residual connection and a bidirectional memory module are added to improve stability and information flow across layers. Success here would mean Transformers could handle sequences thousands of tokens long at far lower cost while retaining their core capabilities.

Core claim

Exact Linear Attention works by exploiting the property that certain kernel functions allow the attention scores to be computed through associative operations that reduce to linear passes over the input sequence. The paper specifies three such kernels—the Hadamard exponential, summation squared Euclidean distance, and subtraction squared Euclidean distance—each chosen to be non-negative, to distinguish inputs clearly, and to carry geometric meaning. These choices remove the need for any approximation while directly tackling the problems of unstable gradients and diluted attention weights that appear in earlier linear attention designs.

What carries the argument

The exact decomposition property of kernel functions that rewrites the attention matrix as the product of two separate linear transformations accumulated over the sequence.

If this is right

  • Decoding runs up to six times faster with seventy-five percent less key-value cache memory than full attention.
  • Performance during training matches or exceeds standard attention on long contexts.
  • The Memory Lobe accelerates convergence and boosts generalization.
  • Vision models using the same principle gain up to 4.3 times faster inference and 7.9 times fewer parameters.
  • A bias mechanism for mixture-of-experts routing increases semantic alignment and interpretability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the exactness property holds through stacked layers, much deeper models could train without special gradient techniques.
  • The geometric interpretability requirement on kernels may offer a new way to analyze attention patterns in trained models.
  • Applying the decomposition idea to other quadratic-cost components in neural nets could multiply the efficiency benefits.
  • Validating the kernels on sequences of tens of thousands of tokens would confirm whether the constraints prevent dilution in practice.

Load-bearing premise

The proposed kernel functions must satisfy non-negativity, discriminability, and geometric interpretability simultaneously without introducing new fitting parameters or adjustments that break the exact decomposition.

What would settle it

A side-by-side computation on a short sequence where the attention matrix from Exact Linear Attention differs from the standard softmax version by more than floating-point error, or a case where one kernel produces negative values for valid inputs.

Figures

Figures reproduced from arXiv: 2605.18848 by Weinuo Ou.

Figure 1
Figure 1. Figure 1: Comparison of Exact Linear Attention GPT (top row) [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training Comparison (ELA GPT) Furthermore, additionally, the QKV weight matrices of this memory module are pluggable. Theoretically, this frame￾work can be embedded into any semantic-transformation￾based model that is capable of producing ∆Xk|k−1, allowing it to learn internal experience and form qualitative memory. This provides a brand-new paradigm for LLM training beyond LoRA and Engram methods. In part… view at source ↗
Figure 2
Figure 2. Figure 2: Training Comparison (GPT) C. How Memory Works In general, human memory exists in two forms. The first is what we term factual memory, which records that a certain event has occurred. The second is qualitative memory, which represents how a given event is perceived or evaluated. This fundamental dichotomy of memory divides all known infor￾mation into two categories: behavioral judgment and objective existen… view at source ↗
Figure 6
Figure 6. Figure 6: Training Comparison(Hyper-Link & Memory) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training Comparison(Hyper-Link) (a) ∥Ai + Bj∥ 2 (b) exp(Ai)exp(Bj ) (c) full [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training Comparison(Normal) (a) ∥Ai + Bj∥ 2 (b) exp(Ai)exp(Bj ) (c) full There ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: CUDA vs CPU in inference speed [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: YOLO-LAT vs YOLOv26 in inference speed a noticeable gap in mAP@0.5:0.95 (0.515 versus 0.951), indicating inferior bounding box localization precision. We verify that this limitation stems from the lack of depth information. Traditional YOLO adopts CASC dynamic chan￾nel pruning [19] to simulate hierarchical visual perception. In contrast, YOLO-LAT leverages inherent attention mechanisms to focus on foregrou… view at source ↗
Figure 10
Figure 10. Figure 10: YOLO-LAT vs YOLOv26 in inference accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
read the original abstract

This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by exploiting the exact decomposition property of kernel functions, thereby eliminating approximation error. We identify and address two key limitations of prior linear attention -- gradient explosion and token attention dilution -- by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel, each tailored for specific attention behaviors. Beyond the core attention formulation, the paper presents three engineering innovations: (1) a Hyper-Link structure that replaces traditional residual connections to mitigate gradient degradation; (2) a Memory Lobe module based on bidirectional linear attention, which captures "transformation flow" across layers to implement qualitative memory and an implicit reinforcement learning paradigm; and (3) a routing-score-based bias mechanism for Mixture-of-Experts (MoE) to improve interpretability and semantic alignment. Experimental results demonstrate that ELA achieves up to 6x faster decoding speed and 75% reduction in KV cache memory usage compared to full attention, while maintaining comparable or superior training performance. The proposed memory module accelerates convergence and enhances generalization. Furthermore, we extend the linear attention principle to vision models, yielding YOLO-LAT, which attains up to 4.3x GPU inference speedup and 7.9x parameter reduction with competitive detection accuracy. These results underline the broad applicability of exact linear attention for scaling Transformer models to ultra-long sequences and efficient visual tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Exact Linear Attention (ELA), claiming linear computational complexity for Transformer attention via exact kernel decompositions that eliminate approximation error. It proposes three kernels (Hadamard Exp, Summation Squared Euclidean Distance, Subtraction Squared Euclidean Distance) with constraints ensuring non-negativity, discriminability, and geometric interpretability to mitigate gradient explosion and token attention dilution. Additional contributions include a Hyper-Link residual structure, a bidirectional Memory Lobe module for capturing transformation flow and implicit RL, and a routing-score bias for MoE. Experiments report up to 6x faster decoding, 75% KV cache reduction, faster convergence, and extension to YOLO-LAT with 4.3x GPU speedup and 7.9x parameter reduction.

Significance. If the kernels admit finite explicit feature maps yielding exact phi(q)^T phi(k) = kernel(q,k) without hidden approximations or sequence-dependent normalizations, and the constraints preserve attention semantics, this would offer a meaningful advance over approximate linear attentions by providing exactness at linear cost. The Memory Lobe and Hyper-Link could improve training stability for deep models if shown to be load-bearing.

major comments (3)
  1. [§3.2] §3.2 (Kernel Functions): The exact decomposition claim for the Summation Squared Euclidean Distance Kernel and Subtraction Squared Euclidean Distance Kernel is load-bearing but unsupported without an explicit finite-dimensional feature map phi such that the inner product recovers the kernel exactly; squared Euclidean distances are not positive definite by default and typically require wrapping (e.g., exp(-d)) that may introduce approximation or extra parameters, contradicting the 'exact' and 'parameter-free' guarantees.
  2. [§3.1] §3.1 (Hadamard Exp Kernel): Element-wise exponentiation does not factor into a low-rank inner product with finite dimensions without truncation or infinite series; the paper must provide the explicit phi construction or proof that the decomposition holds exactly for all query-key pairs, as this is central to eliminating approximation error.
  3. [Results] Results section, performance tables: The 6x decoding speedup and 75% KV cache reduction claims require explicit comparison to both full attention and prior linear methods (e.g., Performer) at fixed sequence lengths (e.g., 8k–128k tokens) and model sizes; without these controls the speed/memory gains cannot be attributed to the exact decomposition versus implementation details.
minor comments (2)
  1. [Abstract] Abstract: Clarify whether 'up to 6x faster decoding speed' refers to wall-clock time, FLOPs, or tokens-per-second, and specify the hardware and sequence lengths.
  2. [Notation] Notation: Define the feature map phi consistently and distinguish it from the kernel function K(q,k) in all equations to avoid ambiguity in the decomposition property.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript on Exact Linear Attention. Below, we provide point-by-point responses to the major comments and indicate the revisions made.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Kernel Functions): The exact decomposition claim for the Summation Squared Euclidean Distance Kernel and Subtraction Squared Euclidean Distance Kernel is load-bearing but unsupported without an explicit finite-dimensional feature map phi such that the inner product recovers the kernel exactly; squared Euclidean distances are not positive definite by default and typically require wrapping (e.g., exp(-d)) that may introduce approximation or extra parameters, contradicting the 'exact' and 'parameter-free' guarantees.

    Authors: We acknowledge the importance of explicitly demonstrating the finite feature maps to support the exact decomposition claims. In the revised version of the manuscript, we have added detailed constructions of the feature maps φ for the Summation Squared Euclidean Distance Kernel and the Subtraction Squared Euclidean Distance Kernel. These maps are finite-dimensional and ensure that the inner product φ(q)^T φ(k) exactly recovers the kernel function for all pairs, while the non-negativity and discriminability constraints guarantee positive definiteness without the need for additional wrapping functions or parameters. This revision directly addresses the concern and reinforces the exactness of our approach. revision: yes

  2. Referee: [§3.1] §3.1 (Hadamard Exp Kernel): Element-wise exponentiation does not factor into a low-rank inner product with finite dimensions without truncation or infinite series; the paper must provide the explicit phi construction or proof that the decomposition holds exactly for all query-key pairs, as this is central to eliminating approximation error.

    Authors: We appreciate this feedback on the Hadamard Exp Kernel. The manuscript originally presented the kernel through its definition and the imposed constraints, but to provide full transparency, we have now included an explicit finite-dimensional feature map φ along with a proof that the decomposition φ(q)^T φ(k) = kernel(q, k) holds exactly without truncation or reliance on infinite series. This construction leverages the element-wise operations in a manner that maintains finite dimensionality and exact recovery for all query-key pairs, consistent with our goal of eliminating approximation error. revision: yes

  3. Referee: [Results] Results section, performance tables: The 6x decoding speedup and 75% KV cache reduction claims require explicit comparison to both full attention and prior linear methods (e.g., Performer) at fixed sequence lengths (e.g., 8k–128k tokens) and model sizes; without these controls the speed/memory gains cannot be attributed to the exact decomposition versus implementation details.

    Authors: We agree that strengthening the experimental validation with controlled comparisons is essential. Accordingly, we have revised the Results section to include comprehensive benchmarks comparing ELA against both standard full attention and established linear attention baselines such as the Performer. These experiments are conducted at fixed sequence lengths from 8k to 128k tokens and for various model sizes, allowing clear attribution of the reported speedups and memory reductions to the exact linear attention mechanism rather than implementation specifics. Updated tables and figures have been added to the manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained around proposed kernels

full rationale

The paper defines new kernel functions (Hadamard Exp, Summation Squared Euclidean Distance, Subtraction Squared Euclidean Distance) and states that they admit exact feature-map decompositions satisfying non-negativity, discriminability, and geometric interpretability. These kernels are introduced as proposals rather than quantities fitted to the target attention matrix or derived from prior self-citations. The linear-complexity claim follows directly from the algebraic identity kernel(q,k) = phi(q)^T phi(k) once the kernels are specified; no step renames a fitted parameter as a prediction, imports uniqueness from the authors' own prior work, or smuggles an ansatz via citation. The engineering modules (Hyper-Link, Memory Lobe, MoE bias) are presented as separate additions whose correctness is evaluated empirically, not presupposed by the kernel definitions. The derivation chain therefore remains independent of its performance claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the three named kernels and modules; these appear introduced as engineering choices rather than derived quantities.

pith-pipeline@v0.9.0 · 5798 in / 1104 out tokens · 45188 ms · 2026-05-21T08:39:46.886249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 3 internal anchors

  1. [1]

    Transformers are RNNs: Fast autoregressive transformers with linear attention,

    A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention,” inProc. 37th Int. Conf. Mach. Learn. (ICML), 2020, pp. 5156–5165

  2. [2]

    Linear transformers are secretly fast weight programmers,

    I. Schlag, K. Irie, and J. Schmidhuber, “Linear transformers are secretly fast weight programmers,” inProc. 38th Int. Conf. Mach. Learn. (ICML), 2021, pp. 9355–9366

  3. [3]

    The devil in linear transformer,

    Z. Qin, X. Han, W. Sun, D. Li, L. Kong, N. Barnes, and Y . Zhong, “The devil in linear transformer,” inProc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2022, pp. 7025–7041

  4. [4]

    Attention is all you need,

    A. Vaswaniet al., “Attention is all you need,” inProc. 31st Conf. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008

  5. [5]

    MiniMind: Train a Tiny LLM from Scratch,

    Jingyao Gong., “MiniMind: Train a Tiny LLM from Scratch,” inGitHub: https://github.com/jingyaogong/minimind

  6. [6]

    Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

    Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, et al. Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models.arXiv preprint arXiv:2601.07372, 2026

  7. [7]

    Hyper-Connections.arXiv preprint arXiv:2409.19606, 2024

    Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-Connections.arXiv preprint arXiv:2409.19606, 2024

  8. [8]

    mHC: Manifold-Constrained Hyper-Connections

    Zhenda Xie, Wentao Zhang, Xinyu Zhao, Yukai Li, Peng Wang, Weiran You, and others. mHC: Manifold-Constrained Hyper-Connections. arXiv preprint arXiv:2512.24880, 2025

  9. [9]

    J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations.Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, vol. 209, pp. 415–446, 1909

  10. [10]

    MiniMax-01: Scaling Foundation Models with Lightning Attention

    MiniMax Team. MiniMax-01: Scaling Foundation Models with Light- ning Attention.arXiv preprint arXiv:2501.08313, 2025

  11. [11]

    Kimi Linear: A Novel Hybrid Linear Attention Architecture

    Kimi Team. Kimi Linear: A Novel Hybrid Linear Attention Architecture. arXiv preprint arXiv:2510.xxxxx, 2025

  12. [12]

    Jain and B

    S. Jain and B. C. Wallace. Attention is not Explanation. InProc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019

  13. [13]

    Z. Jia, W. Kwon, and O. Ruwase. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InProc. Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC’20), 2020

  14. [14]

    T. J. Buschman and E. K. Miller. Goal-direction and top-down control. Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 365, no. 1544, pp. 1271–1278, 2010

  15. [15]

    M., & Kahana, M

    Polyn, S. M., & Kahana, M. J. Memory search and the neural representation of context.Trends in Cognitive Sciences, 12(1):24–30, 2008. JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXXX 2026 9

  16. [16]

    A., Klumpers, F., Roelofs, K., & Hermans, E

    Zhang, W., van Ast, V . A., Klumpers, F., Roelofs, K., & Hermans, E. J. Memory contextualization: The role of the left inferior frontal gyrus in binding event and contextual information.Journal of Cognitive Neuroscience, 30(5):698–713, 2018

  17. [17]

    F., Zeidler, Z

    de Sousa, A. F., Zeidler, Z. E., Almeida-Filho, D. G., Shen, Y ., Luchetti, A., Simanian, S., Mardini, M., DeNardo, L. A., & Silva, A. J. The prefrontal cortex controls memory organization in the hippocampus. Nature Neuroscience, 29:1191–1202, 2026. doi: 10.1038/s41593-026- 02231-1

  18. [18]

    You only look once: Unified, real-time object detection,

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 779–788

  19. [19]

    YOLOv5: A state-of-the-art real-time object detection sys- tem,

    Ultralytics, “YOLOv5: A state-of-the-art real-time object detection sys- tem,” 2020. [Online]. Available: https://github.com/ultralytics/yolov5

  20. [20]

    YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,

    C.-Y . Wang, A. Bochkovskiy, and H.-Y . M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 7464–7475

  21. [21]

    YOLOv8: A new state-of-the-art computer vision model,

    Ultralytics, “YOLOv8: A new state-of-the-art computer vision model,”

  22. [22]

    Available: https://github.com/ultralytics/ultralytics

    [Online]. Available: https://github.com/ultralytics/ultralytics

  23. [23]

    YOLO12: Attention-centric object detection,

    Ultralytics, “YOLO12: Attention-centric object detection,” 2025. [On- line]. Available: https://docs.ultralytics.com/models/yolo12/

  24. [24]

    YOLO-DMA: A small-object detector based on multi-scale deformable convolution and linear attention,

    Y . Li, Z. Wang, and H. Liu, “YOLO-DMA: A small-object detector based on multi-scale deformable convolution and linear attention,”Elec- tronics, vol. 15, no. 4, p. 812, 2026

  25. [25]

    Jocher and J

    G. Jocher and J. Qiu, Ultralytics YOLO26, version 26.0.0, 2026. [Online]. Available: https://github.com/ultralytics/ultralytics

  26. [26]

    The linear attention resurrection in vision transformer,

    C. Zheng, “The linear attention resurrection in vision transformer,” arXiv preprint arXiv:2501.16182, 2025