LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel
Pith reviewed 2026-05-10 00:25 UTC · model grok-4.3
The pith
A Laplacian kernel replaces softmax in attention to achieve linear complexity while retaining expressiveness in vision transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LaplacianFormer employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis, together with a provably injective feature map, Nyström approximation, and Newton-Schulz solver, achieving strong performance-efficiency trade-offs on ImageNet while improving attention expressiveness.
What carries the argument
Laplacian kernel paired with a provably injective feature map, Nyström approximation, and Newton-Schulz solver for linear attention computation.
If this is right
- Attention computation scales linearly with token count, supporting higher-resolution inputs without quadratic blowup.
- Mid-range token dependencies receive stronger weighting than under Gaussian kernels.
- The injective feature map prevents loss of fine-grained token information during low-rank approximation.
- Newton-Schulz iteration plus custom CUDA kernels deliver high-throughput forward and backward passes suitable for edge hardware.
- Overall model accuracy on ImageNet remains competitive while efficiency improves over softmax baselines.
Where Pith is reading between the lines
- The same kernel-plus-solver pattern could be tested in non-vision transformers where mid-range dependencies matter.
- Newton-Schulz iteration might accelerate other kernel-matrix operations inside deep networks beyond attention.
- Hybrid models could combine Laplacian attention layers with standard softmax layers for tasks needing both long-range and local focus.
- The efficiency gains suggest practical deployment on resource-limited devices that current quadratic transformers cannot reach.
Load-bearing premise
That the Laplacian kernel, when paired with the proposed feature map and approximations, genuinely improves mid-range token interactions and overall expressiveness compared with Gaussian kernels, and that the claimed theoretical grounding and injectivity hold in the actual model implementation.
What would settle it
A side-by-side ImageNet experiment in which an equivalently approximated Gaussian-kernel linear attention model matches or exceeds LaplacianFormer accuracy at the same throughput would falsify the claim of superior expressiveness.
Figures
read the original abstract
The quadratic complexity of softmax attention presents a major obstacle for scaling Transformers to high-resolution vision tasks. Existing linear attention variants often replace the softmax with Gaussian kernels to reduce complexity, but such approximations lack theoretical grounding and tend to oversuppress mid-range token interactions. We propose LaplacianFormer, a Transformer variant that employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis. To address expressiveness degradation under low-rank approximations, we introduce a provably injective feature map that retains fine-grained token information. For efficient computation, we adopt a Nystr\"om approximation of the kernel matrix and solve the resulting system using Newton--Schulz iteration, avoiding costly matrix inversion and SVD. We further develop custom CUDA implementations for both the kernel and solver, enabling high-throughput forward and backward passes suitable for edge deployment. Experiments on ImageNet show that LaplacianFormer achieves strong performance-efficiency trade-offs while improving attention expressiveness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LaplacianFormer, a Transformer variant for vision tasks that replaces softmax attention with a Laplacian kernel to achieve linear complexity. It motivates the choice via empirical observations and theoretical analysis, introduces a provably injective feature map to retain fine-grained token information under low-rank approximations, adopts Nyström approximation of the kernel matrix solved via Newton-Schulz iteration (with custom CUDA kernels), and reports strong performance-efficiency trade-offs on ImageNet while claiming improved attention expressiveness over Gaussian-kernel baselines.
Significance. If the claimed theoretical properties and experimental gains hold, the work could provide a more principled linear-attention alternative that better preserves mid-range token interactions than existing Gaussian-kernel methods, with potential benefits for high-resolution vision Transformers and edge deployment.
major comments (2)
- [Sections describing the feature map, Nyström approximation, and Newton-Schulz solver (likely §3)] The central claim requires that the provably injective feature map retains its properties (and thus mid-range expressiveness) after Nyström low-rank approximation plus Newton-Schulz iteration. The paper introduces the injective map specifically to counteract degradation from low-rank approximations, yet neither step is shown to commute with or preserve the injectivity property in the actual attention output. An explicit check (e.g., distance-dependent attention weight preservation on toy token sets before/after approximation) is needed.
- [Abstract and experimental results section] The abstract asserts theoretical analysis, a provable property, and experimental gains on ImageNet, but the provided text supplies no derivations, proofs, quantitative results, baselines, or error bars. Without these, the claims of improved expressiveness and strong trade-offs cannot be verified.
minor comments (1)
- Ensure all theoretical claims (injectivity proof, motivation for Laplacian over Gaussian) are accompanied by clear derivations or proof sketches in the main text or appendix, with explicit statements of assumptions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, providing clarifications and committing to targeted revisions that strengthen the presentation of our theoretical and empirical contributions without altering the core claims.
read point-by-point responses
-
Referee: [Sections describing the feature map, Nyström approximation, and Newton-Schulz solver (likely §3)] The central claim requires that the provably injective feature map retains its properties (and thus mid-range expressiveness) after Nyström low-rank approximation plus Newton-Schulz iteration. The paper introduces the injective map specifically to counteract degradation from low-rank approximations, yet neither step is shown to commute with or preserve the injectivity property in the actual attention output. An explicit check (e.g., distance-dependent attention weight preservation on toy token sets before/after approximation) is needed.
Authors: We agree that an explicit verification of property preservation under the combined approximations is valuable for rigor. The injectivity proof holds for the exact Laplacian kernel, and our design of the feature map was intended to mitigate low-rank effects, but we did not include a direct before/after comparison on toy data. In the revision, we will add a new subsection (likely in §3.3) with a controlled toy experiment on synthetic token sets that measures distance-dependent attention weight preservation before and after Nyström + Newton-Schulz, confirming that mid-range interactions remain better retained than in Gaussian baselines. revision: yes
-
Referee: [Abstract and experimental results section] The abstract asserts theoretical analysis, a provable property, and experimental gains on ImageNet, but the provided text supplies no derivations, proofs, quantitative results, baselines, or error bars. Without these, the claims of improved expressiveness and strong trade-offs cannot be verified.
Authors: The full manuscript (Sections 3 and 4 plus appendix) contains the complete theoretical derivations, injectivity proof, Nyström/Newton-Schulz analysis, ImageNet results with multiple baselines (including Gaussian linear attention variants), quantitative metrics, and error bars from repeated runs. The abstract is intentionally concise; however, we will revise it to more explicitly reference the key theoretical guarantees and performance trade-offs while ensuring the main text highlights the supporting evidence. We will also add a brief summary paragraph at the end of the introduction that cross-references the proofs and tables. revision: partial
Circularity Check
No significant circularity in the derivation chain.
full rationale
The paper introduces a Laplacian kernel as an alternative to softmax, motivated by empirical and theoretical considerations, along with a new provably injective feature map, Nyström approximation, and Newton-Schulz solver. These are presented as novel components rather than re-derivations of prior results. No equations, predictions, or claims reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the central claims rest on independent theoretical grounding and standard approximation techniques applied to the proposed kernel. The derivation chain is self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Laplacian kernel provides better mid-range token interactions than Gaussian kernels without oversuppression
invented entities (1)
-
Provably injective feature map for the Laplacian kernel
no independent evidence
Reference graph
Works this paper leans on
-
[1]
The Eleventh International Conference on Learning Representations , year=
HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer , author=. The Eleventh International Conference on Learning Representations , year=
-
[2]
The Twelfth International Conference on Learning Representations , year=
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry , author=. The Twelfth International Conference on Learning Representations , year=
-
[3]
The Tenth International Conference on Learning Representations , year=
cosFormer: Rethinking Softmax In Attention , author=. The Tenth International Conference on Learning Representations , year=
-
[4]
The Thirteenth International Conference on Learning Representations , year=
PolaFormer: Polarity-aware Linear Attention for Vision Transformers , author=. The Thirteenth International Conference on Learning Representations , year=
-
[5]
The Thirteenth International Conference on Learning Representations , year=
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures , author=. The Thirteenth International Conference on Learning Representations , year=
-
[6]
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Learning Correlation Structures for Vision Transformers , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2024
-
[7]
International Conference on Machine Learning , year=
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , author=. International Conference on Machine Learning , year=
-
[8]
International Conference on Machine Learning , year=
Linear Complexity Randomized Self-attention Mechanism , author=. International Conference on Machine Learning , year=
-
[9]
Smith and Lingpeng Kong , title =
Hao Peng and Nikolaos Pappas and Dani Yogatama and Roy Schwartz and Noah A. Smith and Lingpeng Kong , title =. 9th International Conference on Learning Representations,
-
[10]
Hassani, Ali and Hwu, Wen-mei and Shi, Humphrey , booktitle =. Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level , volume =
-
[11]
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
ScanFormer: Referring Expression Comprehension by Iteratively Scanning , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2024
-
[12]
ICLR 2024 Workshop on Reliable and Responsible Foundation Models , year=
ProTransformer: Robustify Transformers via Plug-and-Play Paradigm , author=. ICLR 2024 Workshop on Reliable and Responsible Foundation Models , year=
work page 2024
-
[13]
Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation
Yu, Hyunwoo and Cho, Yubin and Kang, Beoungwoo and Moon, Seunghun and Kong, Kyeongbo and Kang, Suk-Ju. Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation. Computer Vision -- ECCV 2024. 2025
work page 2024
-
[14]
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=
RBSFormer: Enhanced Transformer Network for Raw Image Super-Resolution , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , year=
work page 2024
-
[15]
International Conference on Algorithmic Learning Theory , year=
On The Computational Complexity of Self-Attention , author=. International Conference on Algorithmic Learning Theory , year=
-
[16]
Neural Information Processing Systems , year=
Attention is All you Need , author=. Neural Information Processing Systems , year=
-
[17]
9th International Conference on Learning Representations,
Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. 9th International Conference on Learning Representations,
-
[18]
2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
work page 2021
-
[19]
Computational Visual Media , year=
PVT v2: Improved baselines with Pyramid Vision Transformer , author=. Computational Visual Media , year=
-
[20]
International Conference on Machine Learning , year=
Training data-efficient image transformers & distillation through attention , author=. International Conference on Machine Learning , year=
-
[21]
2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
Going deeper with Image Transformers , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
work page 2021
-
[22]
2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , author=. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
work page 2021
-
[23]
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Swin Transformer V2: Scaling Up Capacity and Resolution , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2022
-
[24]
European Conference on Computer Vision , year=
DeiT III: Revenge of the ViT , author=. European Conference on Computer Vision , year=
-
[25]
9th International Conference on Learning Representations,
Xizhou Zhu and Weijie Su and Lewei Lu and Bin Li and Xiaogang Wang and Jifeng Dai , title =. 9th International Conference on Learning Representations,
-
[26]
Hao Zhang and Feng Li and Shilong Liu and Lei Zhang and Hang Su and Jun Zhu and Lionel Ni and Heung-Yeung Shum , booktitle=
-
[27]
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , author=. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2021
-
[28]
Neural Information Processing Systems , year=
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , author=. Neural Information Processing Systems , year=
-
[29]
Neural Information Processing Systems , year=
Per-Pixel Classification is Not All You Need for Semantic Segmentation , author=. Neural Information Processing Systems , year=
-
[30]
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Masked-attention Mask Transformer for Universal Image Segmentation , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2022
-
[31]
Proceedings of the AAAI conference on artificial intelligence , volume=
Head-free lightweight semantic segmentation with linear transformer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[32]
Advances in Neural Information Processing Systems , volume=
Soft: Softmax-free transformer with linear complexity , author=. Advances in Neural Information Processing Systems , volume=
-
[33]
Neural Information Processing Systems , year=
QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion , author=. Neural Information Processing Systems , year=
-
[34]
Sangho Lee and Hayun Lee and Dongkun Shin , booktitle=. Proxyformer: Nystr
-
[35]
European Conference on Computer Vision , pages=
Agent attention: On the integration of softmax and linear attention , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[36]
Bridging the Divide: Reconsidering Softmax and Linear Attention , author=. NeurIPS , year=
-
[37]
Christopher K. I. Williams and Matthias W. Seeger , booktitle=. Using the Nystr
-
[38]
Antoine Chatalic and Nicolas Schreuder and Alessandro Rudi and Lorenzo Rosasco , booktitle=. Nystr
-
[39]
Iterative Methods for Solving Linear Systems , author=. 1997 , publisher=
work page 1997
-
[40]
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
MobileOne: An Improved One millisecond Mobile Backbone , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2023
-
[41]
2009 IEEE Conference on Computer Vision and Pattern Recognition , year=
ImageNet: A large-scale hierarchical image database , author=. 2009 IEEE Conference on Computer Vision and Pattern Recognition , year=
work page 2009
-
[42]
International Conference on Learning Representations , year=
Long Range Arena : A Benchmark for Efficient Transformers , author=. International Conference on Learning Representations , year=
-
[43]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...
-
[44]
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
MetaFormer is Actually What You Need for Vision , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2022
-
[45]
International Conference on Learning Representations , year=
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer , author=. International Conference on Learning Representations , year=
-
[46]
International Conference on Learning Representations , year=
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
-
[47]
Hydra Attention: Efficient Attention with Many Heads , author=. ECCV Workshops , year=
-
[48]
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2022
- [49]
-
[50]
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer , author=. arXiv:2004.05150 , year=
work page internal anchor Pith review arXiv 2004
-
[51]
International Conference on Learning Representations , year=
Rethinking Attention with Performers , author=. International Conference on Learning Representations , year=
-
[52]
Yunyang Xiong and Zhanpeng Zeng and Rudrasis Chakraborty and Mingxing Tan and Glenn Moo Fung and Yin Li and Vikas Singh , journal=. Nystr. 2021 , volume=
work page 2021
-
[53]
International Conference on Learning Representations , year=
Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=
-
[54]
Breaking the Low-Rank Dilemma of Linear Attention , author=. 2025 , booktitle=
work page 2025
-
[55]
Automatic differentiation in PyTorch , author=
-
[56]
Proceedings of the IEEE/CVF International Conference on Computer Vision , year =
Pavan Kumar Anasosalu Vasu and James Gabriel and Jeff Zhu and Oncel Tuzel and Anurag Ranjan , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision , year =
-
[57]
RoFormer: Enhanced transformer with Rotary Position Embedding , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.neucom.2023.127063 , author =
-
[58]
2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
FLatten Transformer: Vision Transformer using Focused Linear Attention , author=. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
work page 2023
-
[59]
International Conference on Machine Learning , year=
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization , author=. International Conference on Machine Learning , year=
-
[60]
International Journal of Computer Vision , volume =
Jiachen Lu and Junge Zhang and Xiatian Zhu and Jianfeng Feng and Tao Xiang and Li Zhang , title =. International Journal of Computer Vision , volume =. 2024 , month = aug, doi =
work page 2024
-
[61]
2017 IEEE International Conference on Computer Vision (ICCV) , year=
Mask R-CNN , author=. 2017 IEEE International Conference on Computer Vision (ICCV) , year=
work page 2017
-
[62]
2017 IEEE International Conference on Computer Vision (ICCV) , year=
Focal Loss for Dense Object Detection , author=. 2017 IEEE International Conference on Computer Vision (ICCV) , year=
work page 2017
-
[63]
Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr
Yifan Chen and Qi Zeng and Heng Ji and Yun Yang , booktitle=. Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr
-
[64]
Revisiting Kernel Attention with Correlated Gaussian Process Representation , author=. ArXiv , year=
-
[65]
Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition , author=. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=
work page 2021
-
[66]
Demystify Mamba in Vision: A Linear Attention Perspective , author=. NeurIPS , year=
-
[67]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
Vision Transformer with Super Token Sampling , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
-
[68]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
Lei Zhu and Xinjiang Wang and Zhanghan Ke and Wayne Zhang and Rynson Lau , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
-
[69]
International Conference on Learning Representations , year=
MogaNet: Multi-order Gated Aggregation Network , author=. International Conference on Learning Representations , year=
-
[70]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Neighborhood Attention Transformer , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =
work page 2023
-
[71]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
Deep Long-Tailed Learning: A Survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.