LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel

Changwei Wang; Muyang Zhang; Rongtao Xu; Sen Lian; Tianlong Tan; Weiliang Meng; Xiaopeng Zhang; Zhe Feng

arxiv: 2604.20368 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.AI

LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel

Zhe Feng , Sen Lian , Changwei Wang , Muyang Zhang , Tianlong Tan , Rongtao Xu , Weiliang Meng , Xiaopeng Zhang This is my paper

Pith reviewed 2026-05-10 00:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords linear attentionLaplacian kernelvision transformerNyström approximationNewton-Schulz iterationattention mechanismImageNetefficient transformer

0 comments

The pith

A Laplacian kernel replaces softmax in attention to achieve linear complexity while retaining expressiveness in vision transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision transformers face quadratic costs in attention that limit high-resolution use. LaplacianFormer replaces softmax with a Laplacian kernel, motivated by observations that Gaussian kernels suppress mid-range token interactions and by supporting theory. The design adds a provably injective feature map to preserve token details under approximation, then applies Nyström kernel approximation solved via Newton-Schulz iteration with custom CUDA kernels for speed. If the approach holds, transformers can scale to larger images with competitive ImageNet accuracy and better efficiency than prior linear attention methods.

Core claim

LaplacianFormer employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis, together with a provably injective feature map, Nyström approximation, and Newton-Schulz solver, achieving strong performance-efficiency trade-offs on ImageNet while improving attention expressiveness.

What carries the argument

Laplacian kernel paired with a provably injective feature map, Nyström approximation, and Newton-Schulz solver for linear attention computation.

If this is right

Attention computation scales linearly with token count, supporting higher-resolution inputs without quadratic blowup.
Mid-range token dependencies receive stronger weighting than under Gaussian kernels.
The injective feature map prevents loss of fine-grained token information during low-rank approximation.
Newton-Schulz iteration plus custom CUDA kernels deliver high-throughput forward and backward passes suitable for edge hardware.
Overall model accuracy on ImageNet remains competitive while efficiency improves over softmax baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same kernel-plus-solver pattern could be tested in non-vision transformers where mid-range dependencies matter.
Newton-Schulz iteration might accelerate other kernel-matrix operations inside deep networks beyond attention.
Hybrid models could combine Laplacian attention layers with standard softmax layers for tasks needing both long-range and local focus.
The efficiency gains suggest practical deployment on resource-limited devices that current quadratic transformers cannot reach.

Load-bearing premise

That the Laplacian kernel, when paired with the proposed feature map and approximations, genuinely improves mid-range token interactions and overall expressiveness compared with Gaussian kernels, and that the claimed theoretical grounding and injectivity hold in the actual model implementation.

What would settle it

A side-by-side ImageNet experiment in which an equivalently approximated Gaussian-kernel linear attention model matches or exceeds LaplacianFormer accuracy at the same throughput would falsify the claim of superior expressiveness.

Figures

Figures reproduced from arXiv: 2604.20368 by Changwei Wang, Muyang Zhang, Rongtao Xu, Sen Lian, Tianlong Tan, Weiliang Meng, Xiaopeng Zhang, Zhe Feng.

**Figure 1.** Figure 1: Distributions of ℓ1 and ℓ 2 2 Q-K distances in DeiT, PVT, and Swin Transformers. Theoretically, the Gaussian kernel presumes that query-key similarity should decay rapidly with increasing ℓ 2 2 distance. However, this assumption may not reflect the actual distribution of query-key interactions in vision Transformers. To investigate this issue, we analyze the empirical distribution of query-key distances in… view at source ↗

**Figure 2.** Figure 2: (a) Top-1 accuracy (%) over training epochs on ImageNet. The left plot shows results [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy and Memory Comparison.(a) Top-1 accuracy vs. FLOPs on ImageNet-1k Deng [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison between Softmax Self-Attention (left) and Linear Self-Attention (right). The [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Execution time breakdown of custom CUDA kernels. Comparison of forward and backward execution time for Newton–Schulz iteration (left) and Laplacian kernel (right) across different matrix sizes (batch = 1, 2 heads, 32 channels). CUDA execution times (< 0.05ms) are shown as 0.0 due to timing resolution limits. definite, we apply a small diagonal perturbation W ← W+ ϵI, with ϵ > 0, preserving the structure w… view at source ↗

**Figure 6.** Figure 6: Convergence behavior of Newton–Schulz and conjugate gradient methods under varying [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of attention maps under different Laplacian kernel scales λ. From left to right: λ = 0.5, 1, 2, 4, 8. 6 CONCLUSIONS AND FUTURE WORK We propose LaplacianFormer, a Transformer variant that employs a Laplacian kernel to construct injective and normalized attention, enabling fine-grained token discrimination with linear complexity. To ensure scalability, we adopt the Nystrom approximation and ac… view at source ↗

read the original abstract

The quadratic complexity of softmax attention presents a major obstacle for scaling Transformers to high-resolution vision tasks. Existing linear attention variants often replace the softmax with Gaussian kernels to reduce complexity, but such approximations lack theoretical grounding and tend to oversuppress mid-range token interactions. We propose LaplacianFormer, a Transformer variant that employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis. To address expressiveness degradation under low-rank approximations, we introduce a provably injective feature map that retains fine-grained token information. For efficient computation, we adopt a Nystr\"om approximation of the kernel matrix and solve the resulting system using Newton--Schulz iteration, avoiding costly matrix inversion and SVD. We further develop custom CUDA implementations for both the kernel and solver, enabling high-throughput forward and backward passes suitable for edge deployment. Experiments on ImageNet show that LaplacianFormer achieves strong performance-efficiency trade-offs while improving attention expressiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LaplacianFormer swaps Gaussian for Laplacian kernels in linear attention and adds an injective feature map, but the Nyström and Newton-Schulz steps likely undermine the claimed theoretical benefits.

read the letter

LaplacianFormer replaces the usual Gaussian kernel in linear attention with a Laplacian one, motivated by the idea that it better keeps mid-range token interactions in vision transformers. The authors add a provably injective feature map to offset degradation from low-rank approximations, then apply Nyström to the kernel matrix and Newton-Schulz iteration to avoid inversion. Custom CUDA kernels are included for speed on edge hardware, with ImageNet results claimed to show solid efficiency trade-offs.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces LaplacianFormer, a Transformer variant for vision tasks that replaces softmax attention with a Laplacian kernel to achieve linear complexity. It motivates the choice via empirical observations and theoretical analysis, introduces a provably injective feature map to retain fine-grained token information under low-rank approximations, adopts Nyström approximation of the kernel matrix solved via Newton-Schulz iteration (with custom CUDA kernels), and reports strong performance-efficiency trade-offs on ImageNet while claiming improved attention expressiveness over Gaussian-kernel baselines.

Significance. If the claimed theoretical properties and experimental gains hold, the work could provide a more principled linear-attention alternative that better preserves mid-range token interactions than existing Gaussian-kernel methods, with potential benefits for high-resolution vision Transformers and edge deployment.

major comments (2)

[Sections describing the feature map, Nyström approximation, and Newton-Schulz solver (likely §3)] The central claim requires that the provably injective feature map retains its properties (and thus mid-range expressiveness) after Nyström low-rank approximation plus Newton-Schulz iteration. The paper introduces the injective map specifically to counteract degradation from low-rank approximations, yet neither step is shown to commute with or preserve the injectivity property in the actual attention output. An explicit check (e.g., distance-dependent attention weight preservation on toy token sets before/after approximation) is needed.
[Abstract and experimental results section] The abstract asserts theoretical analysis, a provable property, and experimental gains on ImageNet, but the provided text supplies no derivations, proofs, quantitative results, baselines, or error bars. Without these, the claims of improved expressiveness and strong trade-offs cannot be verified.

minor comments (1)

Ensure all theoretical claims (injectivity proof, motivation for Laplacian over Gaussian) are accompanied by clear derivations or proof sketches in the main text or appendix, with explicit statements of assumptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications and committing to targeted revisions that strengthen the presentation of our theoretical and empirical contributions without altering the core claims.

read point-by-point responses

Referee: [Sections describing the feature map, Nyström approximation, and Newton-Schulz solver (likely §3)] The central claim requires that the provably injective feature map retains its properties (and thus mid-range expressiveness) after Nyström low-rank approximation plus Newton-Schulz iteration. The paper introduces the injective map specifically to counteract degradation from low-rank approximations, yet neither step is shown to commute with or preserve the injectivity property in the actual attention output. An explicit check (e.g., distance-dependent attention weight preservation on toy token sets before/after approximation) is needed.

Authors: We agree that an explicit verification of property preservation under the combined approximations is valuable for rigor. The injectivity proof holds for the exact Laplacian kernel, and our design of the feature map was intended to mitigate low-rank effects, but we did not include a direct before/after comparison on toy data. In the revision, we will add a new subsection (likely in §3.3) with a controlled toy experiment on synthetic token sets that measures distance-dependent attention weight preservation before and after Nyström + Newton-Schulz, confirming that mid-range interactions remain better retained than in Gaussian baselines. revision: yes
Referee: [Abstract and experimental results section] The abstract asserts theoretical analysis, a provable property, and experimental gains on ImageNet, but the provided text supplies no derivations, proofs, quantitative results, baselines, or error bars. Without these, the claims of improved expressiveness and strong trade-offs cannot be verified.

Authors: The full manuscript (Sections 3 and 4 plus appendix) contains the complete theoretical derivations, injectivity proof, Nyström/Newton-Schulz analysis, ImageNet results with multiple baselines (including Gaussian linear attention variants), quantitative metrics, and error bars from repeated runs. The abstract is intentionally concise; however, we will revise it to more explicitly reference the key theoretical guarantees and performance trade-offs while ensuring the main text highlights the supporting evidence. We will also add a brief summary paragraph at the end of the introduction that cross-references the proofs and tables. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper introduces a Laplacian kernel as an alternative to softmax, motivated by empirical and theoretical considerations, along with a new provably injective feature map, Nyström approximation, and Newton-Schulz solver. These are presented as novel components rather than re-derivations of prior results. No equations, predictions, or claims reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the central claims rest on independent theoretical grounding and standard approximation techniques applied to the proposed kernel. The derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the Laplacian kernel is a principled and superior replacement for softmax or Gaussian kernels in attention, plus the existence of a provably injective feature map that preserves information. No explicit free parameters are described in the abstract.

axioms (1)

domain assumption Laplacian kernel provides better mid-range token interactions than Gaussian kernels without oversuppression
Stated as motivated by empirical observations and theoretical analysis in the abstract.

invented entities (1)

Provably injective feature map for the Laplacian kernel no independent evidence
purpose: Retains fine-grained token information under low-rank approximations to avoid expressiveness degradation
Introduced to solve a stated limitation of low-rank kernel approximations; no independent evidence outside the paper is given.

pith-pipeline@v0.9.0 · 5474 in / 1397 out tokens · 52181 ms · 2026-05-10T00:25:06.879736+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 1 internal anchor

[1]

The Eleventh International Conference on Learning Representations , year=

HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer , author=. The Eleventh International Conference on Learning Representations , year=

work page
[2]

The Twelfth International Conference on Learning Representations , year=

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry , author=. The Twelfth International Conference on Learning Representations , year=

work page
[3]

The Tenth International Conference on Learning Representations , year=

cosFormer: Rethinking Softmax In Attention , author=. The Tenth International Conference on Learning Representations , year=

work page
[4]

The Thirteenth International Conference on Learning Representations , year=

PolaFormer: Polarity-aware Linear Attention for Vision Transformers , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[5]

The Thirteenth International Conference on Learning Representations , year=

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[6]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Learning Correlation Structures for Vision Transformers , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2024
[7]

International Conference on Machine Learning , year=

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , author=. International Conference on Machine Learning , year=

work page
[8]

International Conference on Machine Learning , year=

Linear Complexity Randomized Self-attention Mechanism , author=. International Conference on Machine Learning , year=

work page
[9]

Smith and Lingpeng Kong , title =

Hao Peng and Nikolaos Pappas and Dani Yogatama and Roy Schwartz and Noah A. Smith and Lingpeng Kong , title =. 9th International Conference on Learning Representations,

work page
[10]

Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level , volume =

Hassani, Ali and Hwu, Wen-mei and Shi, Humphrey , booktitle =. Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level , volume =

work page
[11]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ScanFormer: Referring Expression Comprehension by Iteratively Scanning , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2024
[12]

ICLR 2024 Workshop on Reliable and Responsible Foundation Models , year=

ProTransformer: Robustify Transformers via Plug-and-Play Paradigm , author=. ICLR 2024 Workshop on Reliable and Responsible Foundation Models , year=

work page 2024
[13]

Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation

Yu, Hyunwoo and Cho, Yubin and Kang, Beoungwoo and Moon, Seunghun and Kong, Kyeongbo and Kang, Suk-Ju. Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation. Computer Vision -- ECCV 2024. 2025

work page 2024
[14]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

RBSFormer: Enhanced Transformer Network for Raw Image Super-Resolution , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

work page 2024
[15]

International Conference on Algorithmic Learning Theory , year=

On The Computational Complexity of Self-Attention , author=. International Conference on Algorithmic Learning Theory , year=

work page
[16]

Neural Information Processing Systems , year=

Attention is All you Need , author=. Neural Information Processing Systems , year=

work page
[17]

9th International Conference on Learning Representations,

Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. 9th International Conference on Learning Representations,

work page
[18]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page 2021
[19]

Computational Visual Media , year=

PVT v2: Improved baselines with Pyramid Vision Transformer , author=. Computational Visual Media , year=

work page
[20]

International Conference on Machine Learning , year=

Training data-efficient image transformers & distillation through attention , author=. International Conference on Machine Learning , year=

work page
[21]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Going deeper with Image Transformers , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page 2021
[22]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page 2021
[23]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Swin Transformer V2: Scaling Up Capacity and Resolution , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2022
[24]

European Conference on Computer Vision , year=

DeiT III: Revenge of the ViT , author=. European Conference on Computer Vision , year=

work page
[25]

9th International Conference on Learning Representations,

Xizhou Zhu and Weijie Su and Lewei Lu and Bin Li and Xiaogang Wang and Jifeng Dai , title =. 9th International Conference on Learning Representations,

work page
[26]

Hao Zhang and Feng Li and Shilong Liu and Lei Zhang and Hang Su and Jun Zhu and Lionel Ni and Heung-Yeung Shum , booktitle=

work page
[27]

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , author=. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2021
[28]

Neural Information Processing Systems , year=

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , author=. Neural Information Processing Systems , year=

work page
[29]

Neural Information Processing Systems , year=

Per-Pixel Classification is Not All You Need for Semantic Segmentation , author=. Neural Information Processing Systems , year=

work page
[30]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Masked-attention Mask Transformer for Universal Image Segmentation , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2022
[31]

Proceedings of the AAAI conference on artificial intelligence , volume=

Head-free lightweight semantic segmentation with linear transformer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[32]

Advances in Neural Information Processing Systems , volume=

Soft: Softmax-free transformer with linear complexity , author=. Advances in Neural Information Processing Systems , volume=

work page
[33]

Neural Information Processing Systems , year=

QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion , author=. Neural Information Processing Systems , year=

work page
[34]

Proxyformer: Nystr

Sangho Lee and Hayun Lee and Dongkun Shin , booktitle=. Proxyformer: Nystr

work page
[35]

European Conference on Computer Vision , pages=

Agent attention: On the integration of softmax and linear attention , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[36]

NeurIPS , year=

Bridging the Divide: Reconsidering Softmax and Linear Attention , author=. NeurIPS , year=

work page
[37]

Christopher K. I. Williams and Matthias W. Seeger , booktitle=. Using the Nystr

work page
[38]

Antoine Chatalic and Nicolas Schreuder and Alessandro Rudi and Lorenzo Rosasco , booktitle=. Nystr

work page
[39]

1997 , publisher=

Iterative Methods for Solving Linear Systems , author=. 1997 , publisher=

work page 1997
[40]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

MobileOne: An Improved One millisecond Mobile Backbone , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2023
[41]

2009 IEEE Conference on Computer Vision and Pattern Recognition , year=

ImageNet: A large-scale hierarchical image database , author=. 2009 IEEE Conference on Computer Vision and Pattern Recognition , year=

work page 2009
[42]

International Conference on Learning Representations , year=

Long Range Arena : A Benchmark for Efficient Transformers , author=. International Conference on Learning Representations , year=

work page
[43]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019
[44]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

MetaFormer is Actually What You Need for Vision , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2022
[45]

International Conference on Learning Representations , year=

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer , author=. International Conference on Learning Representations , year=

work page
[46]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page
[47]

ECCV Workshops , year=

Hydra Attention: Efficient Attention with Many Heads , author=. ECCV Workshops , year=

work page
[48]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2022
[49]

ArXiv , year=

Linformer: Self-Attention with Linear Complexity , author=. ArXiv , year=

work page
[50]

Longformer: The Long-Document Transformer

Longformer: The Long-Document Transformer , author=. arXiv:2004.05150 , year=

work page internal anchor Pith review arXiv 2004
[51]

International Conference on Learning Representations , year=

Rethinking Attention with Performers , author=. International Conference on Learning Representations , year=

work page
[52]

Yunyang Xiong and Zhanpeng Zeng and Rudrasis Chakraborty and Mingxing Tan and Glenn Moo Fung and Yin Li and Vikas Singh , journal=. Nystr. 2021 , volume=

work page 2021
[53]

International Conference on Learning Representations , year=

Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=

work page
[54]

2025 , booktitle=

Breaking the Low-Rank Dilemma of Linear Attention , author=. 2025 , booktitle=

work page 2025
[55]

Automatic differentiation in PyTorch , author=

work page
[56]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

Pavan Kumar Anasosalu Vasu and James Gabriel and Jeff Zhu and Oncel Tuzel and Anurag Ranjan , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

work page
[57]

2024 , issn =

RoFormer: Enhanced transformer with Rotary Position Embedding , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.neucom.2023.127063 , author =

work page doi:10.1016/j.neucom.2023.127063 2024
[58]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

FLatten Transformer: Vision Transformer using Focused Linear Attention , author=. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page 2023
[59]

International Conference on Machine Learning , year=

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization , author=. International Conference on Machine Learning , year=

work page
[60]

International Journal of Computer Vision , volume =

Jiachen Lu and Junge Zhang and Xiatian Zhu and Jianfeng Feng and Tao Xiang and Li Zhang , title =. International Journal of Computer Vision , volume =. 2024 , month = aug, doi =

work page 2024
[61]

2017 IEEE International Conference on Computer Vision (ICCV) , year=

Mask R-CNN , author=. 2017 IEEE International Conference on Computer Vision (ICCV) , year=

work page 2017
[62]

2017 IEEE International Conference on Computer Vision (ICCV) , year=

Focal Loss for Dense Object Detection , author=. 2017 IEEE International Conference on Computer Vision (ICCV) , year=

work page 2017
[63]

Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr

Yifan Chen and Qi Zeng and Heng Ji and Yun Yang , booktitle=. Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr

work page
[64]

ArXiv , year=

Revisiting Kernel Attention with Correlated Gaussian Process Representation , author=. ArXiv , year=

work page
[65]

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition , author=. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

work page 2021
[66]

NeurIPS , year=

Demystify Mamba in Vision: A Linear Attention Perspective , author=. NeurIPS , year=

work page
[67]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Vision Transformer with Super Token Sampling , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[68]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Lei Zhu and Xinjiang Wang and Zhanghan Ke and Wayne Zhang and Rynson Lau , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[69]

International Conference on Learning Representations , year=

MogaNet: Multi-order Gated Aggregation Network , author=. International Conference on Learning Representations , year=

work page
[70]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Neighborhood Attention Transformer , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

work page 2023
[71]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Deep Long-Tailed Learning: A Survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page

[1] [1]

The Eleventh International Conference on Learning Representations , year=

HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer , author=. The Eleventh International Conference on Learning Representations , year=

work page

[2] [2]

The Twelfth International Conference on Learning Representations , year=

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry , author=. The Twelfth International Conference on Learning Representations , year=

work page

[3] [3]

The Tenth International Conference on Learning Representations , year=

cosFormer: Rethinking Softmax In Attention , author=. The Tenth International Conference on Learning Representations , year=

work page

[4] [4]

The Thirteenth International Conference on Learning Representations , year=

PolaFormer: Polarity-aware Linear Attention for Vision Transformers , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[5] [5]

The Thirteenth International Conference on Learning Representations , year=

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[6] [6]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Learning Correlation Structures for Vision Transformers , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2024

[7] [7]

International Conference on Machine Learning , year=

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , author=. International Conference on Machine Learning , year=

work page

[8] [8]

International Conference on Machine Learning , year=

Linear Complexity Randomized Self-attention Mechanism , author=. International Conference on Machine Learning , year=

work page

[9] [9]

Smith and Lingpeng Kong , title =

Hao Peng and Nikolaos Pappas and Dani Yogatama and Roy Schwartz and Noah A. Smith and Lingpeng Kong , title =. 9th International Conference on Learning Representations,

work page

[10] [10]

Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level , volume =

Hassani, Ali and Hwu, Wen-mei and Shi, Humphrey , booktitle =. Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level , volume =

work page

[11] [11]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ScanFormer: Referring Expression Comprehension by Iteratively Scanning , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2024

[12] [12]

ICLR 2024 Workshop on Reliable and Responsible Foundation Models , year=

ProTransformer: Robustify Transformers via Plug-and-Play Paradigm , author=. ICLR 2024 Workshop on Reliable and Responsible Foundation Models , year=

work page 2024

[13] [13]

Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation

Yu, Hyunwoo and Cho, Yubin and Kang, Beoungwoo and Moon, Seunghun and Kong, Kyeongbo and Kang, Suk-Ju. Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation. Computer Vision -- ECCV 2024. 2025

work page 2024

[14] [14]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

RBSFormer: Enhanced Transformer Network for Raw Image Super-Resolution , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=

work page 2024

[15] [15]

International Conference on Algorithmic Learning Theory , year=

On The Computational Complexity of Self-Attention , author=. International Conference on Algorithmic Learning Theory , year=

work page

[16] [16]

Neural Information Processing Systems , year=

Attention is All you Need , author=. Neural Information Processing Systems , year=

work page

[17] [17]

9th International Conference on Learning Representations,

Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. 9th International Conference on Learning Representations,

work page

[18] [18]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page 2021

[19] [19]

Computational Visual Media , year=

PVT v2: Improved baselines with Pyramid Vision Transformer , author=. Computational Visual Media , year=

work page

[20] [20]

International Conference on Machine Learning , year=

Training data-efficient image transformers & distillation through attention , author=. International Conference on Machine Learning , year=

work page

[21] [21]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Going deeper with Image Transformers , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page 2021

[22] [22]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page 2021

[23] [23]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Swin Transformer V2: Scaling Up Capacity and Resolution , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2022

[24] [24]

European Conference on Computer Vision , year=

DeiT III: Revenge of the ViT , author=. European Conference on Computer Vision , year=

work page

[25] [25]

9th International Conference on Learning Representations,

Xizhou Zhu and Weijie Su and Lewei Lu and Bin Li and Xiaogang Wang and Jifeng Dai , title =. 9th International Conference on Learning Representations,

work page

[26] [26]

Hao Zhang and Feng Li and Shilong Liu and Lei Zhang and Hang Su and Jun Zhu and Lionel Ni and Heung-Yeung Shum , booktitle=

work page

[27] [27]

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , author=. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2021

[28] [28]

Neural Information Processing Systems , year=

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , author=. Neural Information Processing Systems , year=

work page

[29] [29]

Neural Information Processing Systems , year=

Per-Pixel Classification is Not All You Need for Semantic Segmentation , author=. Neural Information Processing Systems , year=

work page

[30] [30]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Masked-attention Mask Transformer for Universal Image Segmentation , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2022

[31] [31]

Proceedings of the AAAI conference on artificial intelligence , volume=

Head-free lightweight semantic segmentation with linear transformer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[32] [32]

Advances in Neural Information Processing Systems , volume=

Soft: Softmax-free transformer with linear complexity , author=. Advances in Neural Information Processing Systems , volume=

work page

[33] [33]

Neural Information Processing Systems , year=

QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion , author=. Neural Information Processing Systems , year=

work page

[34] [34]

Proxyformer: Nystr

Sangho Lee and Hayun Lee and Dongkun Shin , booktitle=. Proxyformer: Nystr

work page

[35] [35]

European Conference on Computer Vision , pages=

Agent attention: On the integration of softmax and linear attention , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[36] [36]

NeurIPS , year=

Bridging the Divide: Reconsidering Softmax and Linear Attention , author=. NeurIPS , year=

work page

[37] [37]

Christopher K. I. Williams and Matthias W. Seeger , booktitle=. Using the Nystr

work page

[38] [38]

Antoine Chatalic and Nicolas Schreuder and Alessandro Rudi and Lorenzo Rosasco , booktitle=. Nystr

work page

[39] [39]

1997 , publisher=

Iterative Methods for Solving Linear Systems , author=. 1997 , publisher=

work page 1997

[40] [40]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

MobileOne: An Improved One millisecond Mobile Backbone , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2023

[41] [41]

2009 IEEE Conference on Computer Vision and Pattern Recognition , year=

ImageNet: A large-scale hierarchical image database , author=. 2009 IEEE Conference on Computer Vision and Pattern Recognition , year=

work page 2009

[42] [42]

International Conference on Learning Representations , year=

Long Range Arena : A Benchmark for Efficient Transformers , author=. International Conference on Learning Representations , year=

work page

[43] [43]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019

[44] [44]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

MetaFormer is Actually What You Need for Vision , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2022

[45] [45]

International Conference on Learning Representations , year=

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer , author=. International Conference on Learning Representations , year=

work page

[46] [46]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page

[47] [47]

ECCV Workshops , year=

Hydra Attention: Efficient Attention with Many Heads , author=. ECCV Workshops , year=

work page

[48] [48]

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2022

[49] [49]

ArXiv , year=

Linformer: Self-Attention with Linear Complexity , author=. ArXiv , year=

work page

[50] [50]

Longformer: The Long-Document Transformer

Longformer: The Long-Document Transformer , author=. arXiv:2004.05150 , year=

work page internal anchor Pith review arXiv 2004

[51] [51]

International Conference on Learning Representations , year=

Rethinking Attention with Performers , author=. International Conference on Learning Representations , year=

work page

[52] [52]

Yunyang Xiong and Zhanpeng Zeng and Rudrasis Chakraborty and Mingxing Tan and Glenn Moo Fung and Yin Li and Vikas Singh , journal=. Nystr. 2021 , volume=

work page 2021

[53] [53]

International Conference on Learning Representations , year=

Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=

work page

[54] [54]

2025 , booktitle=

Breaking the Low-Rank Dilemma of Linear Attention , author=. 2025 , booktitle=

work page 2025

[55] [55]

Automatic differentiation in PyTorch , author=

work page

[56] [56]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

Pavan Kumar Anasosalu Vasu and James Gabriel and Jeff Zhu and Oncel Tuzel and Anurag Ranjan , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

work page

[57] [57]

2024 , issn =

RoFormer: Enhanced transformer with Rotary Position Embedding , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.neucom.2023.127063 , author =

work page doi:10.1016/j.neucom.2023.127063 2024

[58] [58]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

FLatten Transformer: Vision Transformer using Focused Linear Attention , author=. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page 2023

[59] [59]

International Conference on Machine Learning , year=

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization , author=. International Conference on Machine Learning , year=

work page

[60] [60]

International Journal of Computer Vision , volume =

Jiachen Lu and Junge Zhang and Xiatian Zhu and Jianfeng Feng and Tao Xiang and Li Zhang , title =. International Journal of Computer Vision , volume =. 2024 , month = aug, doi =

work page 2024

[61] [61]

2017 IEEE International Conference on Computer Vision (ICCV) , year=

Mask R-CNN , author=. 2017 IEEE International Conference on Computer Vision (ICCV) , year=

work page 2017

[62] [62]

2017 IEEE International Conference on Computer Vision (ICCV) , year=

Focal Loss for Dense Object Detection , author=. 2017 IEEE International Conference on Computer Vision (ICCV) , year=

work page 2017

[63] [63]

Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr

Yifan Chen and Qi Zeng and Heng Ji and Yun Yang , booktitle=. Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr

work page

[64] [64]

ArXiv , year=

Revisiting Kernel Attention with Correlated Gaussian Process Representation , author=. ArXiv , year=

work page

[65] [65]

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition , author=. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

work page 2021

[66] [66]

NeurIPS , year=

Demystify Mamba in Vision: A Linear Attention Perspective , author=. NeurIPS , year=

work page

[67] [67]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Vision Transformer with Super Token Sampling , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page

[68] [68]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Lei Zhu and Xinjiang Wang and Zhanghan Ke and Wayne Zhang and Rynson Lau , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page

[69] [69]

International Conference on Learning Representations , year=

MogaNet: Multi-order Gated Aggregation Network , author=. International Conference on Learning Representations , year=

work page

[70] [70]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Neighborhood Attention Transformer , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

work page 2023

[71] [71]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Deep Long-Tailed Learning: A Survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page