pith. sign in

arxiv: 2503.23947 · v2 · submitted 2025-03-31 · 💻 cs.CV

Spectral-Adaptive Modulation Networks for Visual Perception

Pith reviewed 2026-05-22 22:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords spectral analysisgraph spectral theoryconvolutional networksself-attentionvision backboneimage classificationobject detectionsemantic segmentation
0
0 comments X

The pith

Graph spectral analysis shows window size modulates node connectivity to control spectral filtering in convolution and attention, enabling a SPAM mixer for improved vision backbones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies graph spectral analysis to place 2D convolution and self-attention inside one theoretical framework and compare their frequency responses. It identifies node connectivity, which changes with window size, as the factor that shapes how each operation passes or suppresses different frequencies. This accounts for observed differences such as convolution favoring high-pass behavior and larger kernels encouraging shape bias. From the finding the authors construct the spectral-adaptive modulation mixer, which combines multi-scale convolutional kernels with a spectral re-scaling step. The resulting SPANetV2 backbone then records higher accuracy than prior models on standard vision benchmarks.

Core claim

Graph spectral analysis in a unified framework demonstrates that node connectivity modulated by window size is the dominant factor shaping the spectral functions of both 2D convolution and self-attention. This relation explains prior empirical observations and directly motivates the spectral-adaptive modulation (SPAM) mixer, which applies multi-scale convolutional kernels together with a spectral re-scaling mechanism to adaptively refine frequency components of visual features. SPANetV2, built from repeated SPAM blocks, produces stronger results than existing state-of-the-art backbones on ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.

What carries the argument

The spectral-adaptive modulation (SPAM) mixer, which processes visual features in a spectral-adaptive manner using multi-scale convolutional kernels and a spectral re-scaling mechanism to refine spectral components.

If this is right

  • SPANetV2 records higher top-1 accuracy on ImageNet-1K classification than prior vision backbones.
  • The same backbone improves mean average precision on COCO object detection.
  • SPANetV2 raises mean intersection-over-union on ADE20K semantic segmentation.
  • The SPAM mixer unifies spectral treatment of convolution and attention under one set of design rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same window-size analysis could be used to set attention window sizes in other transformer variants according to target frequency profiles.
  • SPAM-style re-scaling might be inserted into existing efficient-attention or convolution blocks without full architecture redesign.
  • The graph formulation suggests that spectral properties of mixers could be tuned for tasks that require specific frequency emphasis, such as edge detection or texture classification.

Load-bearing premise

Graph spectral analysis inside a single framework correctly captures the frequency responses of convolution and self-attention so that node connectivity modulated by window size can be used to design an improved mixer.

What would settle it

Running SPANetV2 on the reported ImageNet-1K, COCO, and ADE20K benchmarks and finding that its accuracy does not exceed the strongest published baselines, or measuring the empirical frequency response of the SPAM mixer and finding it deviates from the predicted spectral functions.

Figures

Figures reproduced from arXiv: 2503.23947 by Dong Hwan Kim, Guhnoo Yun, Jeongho Lee, Juhan Yoo, Kijung Kim, Paul Hongsuck Seo.

Figure 1
Figure 1. Figure 1: Simulation examples of frequency response. (a)-(c) show responses of 2D Euclidean convolutions with increasing kernel sizes, and (d) shows responses of self-attention. All responses are obtained with random weights. The input patch size is set to 16 × 16, inspired by ViT [5]. As the convolution kernel size increases, the cut-off frequency shifts closer to one, making it behave more like a low-pass filter, … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SPAM mixer. The Head Split layer evenly partitions the input along feature dimensions based on the number of heads. DWConv denotes depthwise convolution, while SRF re-scales the spectral components of DWConv’s output. All linear layers preserve input dimensions, except Exp, which doubles the feature dimensions, and Proj, which halves them. aggregates visual features in a spectral-adaptive m… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation of models on texture and shape bias. All models are pretrained on ImageNet-1K classification using the same augmentations as the MetaFormer baseline [45]. them across three vision tasks, addressing both texture and shape bias. Experimental results demonstrate that SPANetV2 outperforms state-of-the-art models based on convolution, self￾attention, and FFT in image classification, object detection … view at source ↗
read the original abstract

Recent studies have shown that 2D convolution and self-attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance. However, theoretical analyses remain limited in explaining why 2D convolution is more effective in high-pass filtering than self-attention and why larger kernels favor shape bias, akin to self-attention. In this paper, we employ graph spectral analysis to theoretically simulate and compare the frequency responses of 2D convolution and self-attention within a unified framework. Our results corroborate previous empirical findings and reveal that node connectivity, modulated by window size, is a key factor in shaping spectral functions. Leveraging this insight, we introduce a \textit{spectral-adaptive modulation} (SPAM) mixer, which processes visual features in a spectral-adaptive manner using multi-scale convolutional kernels and a spectral re-scaling mechanism to refine spectral components. Based on SPAM, we develop SPANetV2 as a novel vision backbone. Extensive experiments demonstrate that SPANetV2 outperforms state-of-the-art models across multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript employs graph spectral analysis within a unified framework to theoretically simulate and compare the frequency responses of 2D convolution and self-attention. It identifies node connectivity modulated by window size as a key factor shaping spectral behavior. Leveraging this, the authors introduce the spectral-adaptive modulation (SPAM) mixer, which uses multi-scale convolutional kernels and a spectral re-scaling mechanism. This forms the basis for SPANetV2, a vision backbone that is reported to outperform state-of-the-art models on ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.

Significance. If the graph spectral analysis is rigorous and independent, and the reported performance gains are robust and reproducible, the work could supply a principled basis for spectral-adaptive vision architectures. The corroboration of prior empirical observations with a theoretical simulation and the multi-task experimental validation are strengths. No machine-checked proofs or parameter-free derivations are claimed, but the falsifiable performance predictions on standard benchmarks add value.

major comments (2)
  1. [Graph spectral analysis (unified framework)] The abstract states that the unified graph spectral framework reveals node connectivity (modulated by window size) as the key factor, but without the explicit graph construction, frequency-response equations, or isolation of this variable in the analysis section, it is unclear whether the simulation is independent of the subsequent SPAM design choices.
  2. [SPAM mixer and SPANetV2 design] The SPAM mixer is described as using multi-scale kernels and spectral re-scaling to refine components, yet the central performance claim for SPANetV2 requires explicit ablation results isolating the contribution of the spectral re-scaling mechanism versus the multi-scale kernels alone.
minor comments (2)
  1. [Experiments] Ensure all experimental tables include standard deviations or multiple runs to support the outperformance claims across ImageNet-1K, COCO, and ADE20K.
  2. [SPAM mixer] Clarify notation for the spectral re-scaling operation and provide a precise mathematical definition rather than descriptive text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. We address each major comment below with clarifications and planned updates to the manuscript.

read point-by-point responses
  1. Referee: [Graph spectral analysis (unified framework)] The abstract states that the unified graph spectral framework reveals node connectivity (modulated by window size) as the key factor, but without the explicit graph construction, frequency-response equations, or isolation of this variable in the analysis section, it is unclear whether the simulation is independent of the subsequent SPAM design choices.

    Authors: Section 3 presents the unified graph spectral framework independently of the SPAM mixer. We explicitly define the graph construction for 2D convolution (via local window-based adjacency) and self-attention (via global or windowed attention matrices) in 3.1, derive the frequency-response functions from the normalized Laplacian in 3.2, and isolate the effect of node connectivity by varying window size while holding other factors fixed in 3.3. These steps precede the SPAM design in Section 4. We will revise the text to cross-reference these subsections more explicitly from the abstract and add a short paragraph reiterating the separation of analysis from architecture. revision: partial

  2. Referee: [SPAM mixer and SPANetV2 design] The SPAM mixer is described as using multi-scale kernels and spectral re-scaling to refine components, yet the central performance claim for SPANetV2 requires explicit ablation results isolating the contribution of the spectral re-scaling mechanism versus the multi-scale kernels alone.

    Authors: We agree that a direct isolation of the spectral re-scaling contribution strengthens the claims. While Table 5 already ablates multi-scale kernels and the overall SPAM mixer, it does not contain a dedicated row comparing multi-scale kernels alone against the full SPAM (kernels + re-scaling). We will add this targeted ablation to the experiments section in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; analysis precedes design independently

full rationale

The provided abstract and context describe a standard sequence: graph spectral analysis is performed first to simulate frequency responses and identify node connectivity (modulated by window size) as a key factor, corroborating prior empirical findings. The SPAM mixer is then introduced as leveraging that independent insight via multi-scale kernels and re-scaling. No equations, self-citations, or fitted parameters are shown that would make the analysis reduce to the mixer design or vice versa by construction. The derivation chain remains self-contained against external benchmarks, with the theoretical simulation serving as genuine prior content rather than a post-hoc fit or renamed result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review prevents exhaustive enumeration. The central claim rests on the validity of applying graph spectral analysis to model convolution and attention frequency responses and on the effectiveness of the resulting SPAM design.

axioms (1)
  • domain assumption Graph spectral analysis provides a unified framework that accurately simulates and compares the frequency responses of 2D convolution and self-attention
    Invoked to corroborate prior findings and reveal the role of node connectivity
invented entities (1)
  • SPAM mixer no independent evidence
    purpose: Processes visual features in a spectral-adaptive manner using multi-scale kernels and spectral re-scaling
    New component introduced based on the graph-spectral insight

pith-pipeline@v0.9.0 · 5747 in / 1280 out tokens · 66160 ms · 2026-05-22T22:30:31.294605+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · 6 internal anchors

  1. [1]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778. 1, 2, 7, 8

  2. [2]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017. 1, 2

  3. [3]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014. 1, 2

  4. [4]

    Going deeper with convolutions,

    C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 1–9. 1, 2

  5. [5]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021. 1, 2, 4, 5

  6. [6]

    Training data-efficient image transformers & distillation through attention,

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning . PMLR, 2021, pp. 10 347–10 357. 1, 7

  7. [7]

    Cvt: Introducing convolutions to vision transformers,

    H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 22–31. 1, 2

  8. [8]

    Swin transformer: Hierarchical vision transformer using shifted win- dows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted win- dows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022. 1, 2, 5, 7, 8

  9. [9]

    Scaling local self-attention for parameter efficient visual backbones,

    A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman, and J. Shlens, “Scaling local self-attention for parameter efficient visual backbones,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 12 894–12 904. 1

  10. [10]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16 . Springer, 2020, pp. 213–229. 1

  11. [11]

    Deformable detr: Deformable transformers for end-to-end object detection,

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in ICLR,

  12. [12]

    End-to-end object detection with adaptive clustering transformer,

    M. Zheng, P. Gao, R. Zhang, K. Li, X. Wang, H. Li, and H. Dong, “End-to-end object detection with adaptive clustering transformer,” in arXiv preprint arXiv:2011.09315 , 2020. 1

  13. [13]

    Max- deeplab: End-to-end panoptic segmentation with mask transformers,

    H. Wang, Y . Zhu, H. Adam, A. Yuille, and L.-C. Chen, “Max- deeplab: End-to-end panoptic segmentation with mask transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5463–5474. 1

  14. [14]

    End-to-end video instance segmentation with transformers,

    Y . Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, “End-to-end video instance segmentation with transformers,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR) , 2021. 1

  15. [15]

    Augmented trans- former with adaptive graph for temporal action proposal generation,

    S. Chang, P. Wang, F. Wang, H. Li, and J. Feng, “Augmented trans- former with adaptive graph for temporal action proposal generation,” arXiv preprint arXiv:2103.16024 , 2021. 1

  16. [16]

    Video transformer network,

    D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video transformer network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 3163–3172. 1

  17. [17]

    Transformer meets tracker: Exploiting temporal context for robust visual tracking,

    N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2021, pp. 1571–1580. 1

  18. [18]

    Future transformer for long-term action anticipation,

    D. Gong, J. Lee, M. Kim, S. J. Ha, and M. Cho, “Future transformer for long-term action anticipation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 3052–3061. 1

  19. [19]

    Intriguing properties of vision transformers,

    M. M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. Shahbaz Khan, and M.-H. Yang, “Intriguing properties of vision transformers,” in NeurIPS, 2021. 1

  20. [20]

    Griffiths

    S. Tuli, I. Dasgupta, E. Grant, and T. L. Griffiths, “Are convolutional neural networks or transformers more like human vision?” arXiv preprint arXiv:2105.07197, 2021. 1, 6

  21. [21]

    Rethinking token-mixing mlp for mlp-based vision backbone,

    T. Yu, X. Li, Y . Cai, M. Sun, and P. Li, “Rethinking token-mixing mlp for mlp-based vision backbone,” arXiv preprint arXiv:2106.14882,

  22. [22]

    Towards robust vision transformer,

    X. Mao, G. Qi, Y . Chen, X. Li, R. Duan, S. Ye, Y . He, and H. Xue, “Towards robust vision transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 12 042–12 051. 1

  23. [23]

    Twins: Revisiting the design of spatial attention in vision transformers,

    X. Chu, Z. Tian, Y . Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, “Twins: Revisiting the design of spatial attention in vision transformers,” in NeurIPS, 2021. [Online]. Available: https://openreview.net/forum?id=5kTlVBkzSRx 1

  24. [24]

    Tokens-to-token vit: Training vision transformers from scratch on imagenet,

    L. Yuan, Y . Chen, T. Wang, W. Yu, Y . Shi, Z.-H. Jiang, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in Proceedings of the IEEE/CVF interna- tional conference on computer vision , 2021, pp. 558–567. 1

  25. [25]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,

    W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 568–

  26. [26]

    Cswin transformer: A general vision transformer backbone with cross-shaped windows,

    X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” in Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition , 2022, pp. 12 124– 12 134. 1, 7

  27. [27]

    Mlp-mixer: An all-mlp architecture for vision,

    I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Un- terthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy, “Mlp-mixer: An all-mlp architecture for vision,” in NeurIPS, 2021. 1, 2

  28. [28]

    Vision permutator: A permutable mlp-like architecture for visual recognition,

    Q. Hou, Z. Jiang, L. Yuan, M.-M. Cheng, S. Yan, and J. Feng, “Vision permutator: A permutable mlp-like architecture for visual recognition,” in IEEE Transactions on Pattern Analysis and Machine Intelligence . IEEE, 2022. 1, 2

  29. [29]

    Resmlp: Feed- forward networks for image classification with data-efficient training,

    H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard, A. Joulin, G. Synnaeve, J. Verbeek et al., “Resmlp: Feed- forward networks for image classification with data-efficient training,” in IEEE Transactions on Pattern Analysis and Machine Intelligence . IEEE, 2022. 1, 2

  30. [30]

    An image patch is a wave: Phase-aware vision mlp,

    Y . Tang, K. Han, J. Guo, C. Xu, Y . Li, C. Xu, and Y . Wang, “An image patch is a wave: Phase-aware vision mlp,” in CVPR, 2022. 1

  31. [31]

    Cyclemlp: a mlp-like architecture for dense visual predictions,

    S. Chen, E. Xie, C. Ge, R. Chen, D. Liang, and P. Luo, “Cyclemlp: a mlp-like architecture for dense visual predictions,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023. 1

  32. [32]

    FNet: Mixing tokens with Fourier transforms,

    J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontanon, “FNet: Mixing tokens with Fourier transforms,” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 4296–4313. [Online]. Avail...

  33. [33]

    Gfnet: Global filter net- works for visual recognition,

    Y . Rao, W. Zhao, Z. Zhu, J. Zhou, and J. Lu, “Gfnet: Global filter net- works for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 9, pp. 10 960–10 973, 2023. 1, 2, 7

  34. [34]

    Fft-based dynamic token mixer for vision,

    Y . Tatsunami and M. Taki, “Fft-based dynamic token mixer for vision,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 38, no. 14, 2024, pp. 15 328–15 336. 1, 2, 6, 7, 8, 9

  35. [35]

    Vision gnn: An image is worth graph of nodes,

    K. Han, Y . Wang, J. Guo, Y . Tang, and E. Wu, “Vision gnn: An image is worth graph of nodes,” Advances in neural information processing systems, vol. 35, pp. 8291–8303, 2022. 1, 2

  36. [36]

    Image as set of points,

    X. Ma, Y . Zhou, H. Wang, C. Qin, B. Sun, C. Liu, and Y . Fu, “Image as set of points,” in The Eleventh International Conference on Learning Representations , 2023. [Online]. Available: https://openreview.net/forum?id=awnvqZja69 1, 2

  37. [37]

    Cluster- fomer: Clustering as a universal visual learner,

    J. Liang, Y . Cui, Q. Wang, T. Geng, W. Wang, and D. Liu, “Cluster- fomer: Clustering as a universal visual learner,” Advances in Neural Information Processing Systems , vol. 36, 2023. 1, 2

  38. [38]

    Resnet strikes back: An im- proved training procedure in timm,

    R. Wightman, H. Touvron, and H. J ´egou, “Resnet strikes back: An im- proved training procedure in timm,” arXiv preprint arXiv:2110.00476 ,

  39. [39]

    A convnet for the 2020s,

    Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 on computer vision and pattern recognition , 2022, pp. 11 976–11 986. 1, 7, 8

  40. [40]

    Scaling up your kernels to 31x31: Revisiting large kernel design in cnns,

    X. Ding, X. Zhang, J. Han, and G. Ding, “Scaling up your kernels to 31x31: Revisiting large kernel design in cnns,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 11 963–11 975. 1, 2, 5, 7, 10

  41. [41]

    More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity,

    S. Liu, T. Chen, X. Chen, X. Chen, Q. Xiao, B. Wu, T. K ¨arkk¨ainen, M. Pechenizkiy, D. Mocanu, and Z. Wang, “More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity,” inICLR, 2023. 1, 7, 8, 10

  42. [42]

    Conv2former: A simple transformer-style convnet for visual recognition,

    Q. Hou, C.-Z. Lu, M.-M. Cheng, and J. Feng, “Conv2former: A simple transformer-style convnet for visual recognition,” IEEE transactions on pattern analysis and machine intelligence , 2024. 1, 5, 7, 8

  43. [43]

    Hornet: Efficient high-order spatial interactions with recursive gated convolu- tions,

    Y . Rao, W. Zhao, Y . Tang, J. Zhou, S. N. Lim, and J. Lu, “Hornet: Efficient high-order spatial interactions with recursive gated convolu- tions,” Advances in Neural Information Processing Systems , vol. 35, pp. 10 353–10 366, 2022. 1

  44. [44]

    Internimage: Exploring large-scale vision foun- dation models with deformable convolutions,

    W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu, L. Lu, H. Li et al. , “Internimage: Exploring large-scale vision foun- dation models with deformable convolutions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 14 408–14 419. 1

  45. [45]

    Metaformer baselines for vision,

    W. Yu, C. Si, P. Zhou, M. Luo, Y . Zhou, J. Feng, S. Yan, and X. Wang, “Metaformer baselines for vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024. 1, 2, 6, 7, 8, 9, 10

  46. [46]

    Uniformer: Unifying convolution and self-attention for visual recognition,

    K. Li, Y . Wang, J. Zhang, P. Gao, G. Song, Y . Liu, H. Li, and Y . Qiao, “Uniformer: Unifying convolution and self-attention for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 581–12 600, 2023. 1, 7

  47. [47]

    Coatnet: Marrying convolution and attention for all data sizes,

    Z. Dai, H. Liu, Q. V . Le, and M. Tan, “Coatnet: Marrying convolution and attention for all data sizes,” Advances in neural information processing systems, vol. 34, pp. 3965–3977, 2021. 1

  48. [48]

    A battle of network structures: An empirical study of cnn, transformer, and mlp,

    Y . Zhao, G. Wang, C. Tang, C. Luo, W. Zeng, and Z.-J. Zha, “A battle of network structures: An empirical study of cnn, transformer, and mlp,” arXiv preprint arXiv:2108.13002 , 2021. 1

  49. [49]

    Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and be- yond,

    Q. Zhang, Y . Xu, J. Zhang, and D. Tao, “Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and be- yond,” International Journal of Computer Vision , vol. 131, no. 5, pp. 1141–1162, 2023. 1

  50. [50]

    Fast vision transformers with hilo attention,

    Z. Pan, J. Cai, and B. Zhuang, “Fast vision transformers with hilo attention,”Advances in Neural Information Processing Systems, vol. 35, pp. 14 541–14 554, 2022. 1, 2

  51. [51]

    How do vision transformers work?

    N. Park and S. Kim, “How do vision transformers work?” in Interna- tional Conference on Learning Representations , 2022. 1, 2, 5

  52. [52]

    Improving vision transformers by revisiting high-frequency components,

    J. Bai, L. Yuan, S.-T. Xia, S. Yan, Z. Li, and W. Liu, “Improving vision transformers by revisiting high-frequency components,” in European Conference on Computer Vision . Springer, 2022, pp. 1–18. 1, 2, 5

  53. [53]

    Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice,

    P. Wang, W. Zheng, T. Chen, and Z. Wang, “Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/ forum?id=O476oWmiNNp 1, 2, 5

  54. [54]

    High-frequency compo- nent helps explain the generalization of convolutional neural networks,

    H. Wang, X. Wu, Z. Huang, and E. P. Xing, “High-frequency compo- nent helps explain the generalization of convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8684–8694. 1, 2, 5, 9

  55. [55]

    Vtc-lfc: Vision transformer compression with low-frequency components,

    Z. Wang, H. Luo, P. Wang, F. Ding, F. Wang, and H. Li, “Vtc-lfc: Vision transformer compression with low-frequency components,” Advances in Neural Information Processing Systems , vol. 35, pp. 13 974–13 988,

  56. [56]

    Revealing the dark secrets of extremely large kernel convnets on robustness,

    H. Chen, Y . Zhang, X. Feng, X. Chu, and K. Huang, “Revealing the dark secrets of extremely large kernel convnets on robustness,” in International Conference on Machine Learning . PMLR, 2024, pp. 7687–7699. 1, 2

  57. [57]

    Spanet: Frequency- balancing token mixer using spectral pooling aggregation modulation,

    G. Yun, J. Yoo, K. Kim, J. Lee, and D. H. Kim, “Spanet: Frequency- balancing token mixer using spectral pooling aggregation modulation,” in Proceedings of the IEEE/CVF International Conference on Com- puter Vision, 2023, pp. 6113–6124. 2, 5, 7, 8

  58. [58]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255. 2, 7, 8

  59. [59]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 . Springer, 2014, pp. 740–755. 2, 7

  60. [60]

    Scene parsing through ade20k dataset,

    B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–

  61. [61]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems , vol. 30, 2017. 2

  62. [62]

    Rethinking and improv- ing relative position encoding for vision transformer,

    K. Wu, H. Peng, M. Chen, J. Fu, and H. Chao, “Rethinking and improv- ing relative position encoding for vision transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 10 033–10 041. 2

  63. [63]

    Blending anti-aliasing into vision transformer,

    S. Qian, H. Shao, Y . Zhu, M. Li, and J. Jia, “Blending anti-aliasing into vision transformer,” in NeurIPS, 2021. 2

  64. [64]

    Convit: Improving vision transformers with soft con- volutional inductive biases,

    S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, “Convit: Improving vision transformers with soft con- volutional inductive biases,” in International Conference on Machine Learning. PMLR, 2021, pp. 2286–2296. 2

  65. [65]

    Cmt: Convolutional neural networks meet vision transformers,

    J. Guo, K. Han, H. Wu, Y . Tang, X. Chen, Y . Wang, and C. Xu, “Cmt: Convolutional neural networks meet vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 175–12 185. 2

  66. [66]

    CycleMLP: A MLP-like architecture for dense prediction,

    S. Chen, E. Xie, C. GE, R. Chen, D. Liang, and P. Luo, “CycleMLP: A MLP-like architecture for dense prediction,” in International Conference on Learning Representations , 2022. [Online]. Available: https://openreview.net/forum?id=NMEceG4v69Y 2

  67. [67]

    Sparse and continuous attention mechanisms,

    A. Martins, A. Farinhas, M. Treviso, V . Niculae, P. Aguiar, and M. Figueiredo, “Sparse and continuous attention mechanisms,” in NeurIPS, 2020. 2

  68. [68]

    ∞-former: Infinite memory transformer,

    P. H. Martins, Z. Marinho, and A. F. Martins, “ ∞-former: Infinite memory transformer,” in Proc. ACL, 2022. 2

  69. [69]

    When shift operation meets vision transformer: An extremely simple alternative to attention mechanism,

    G. Wang, Y . Zhao, C. Tang, C. Luo, and W. Zeng, “When shift operation meets vision transformer: An extremely simple alternative to attention mechanism,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2423–2430. 2

  70. [70]

    Tsm: Temporal shift module for efficient video understanding,

    J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 7083–7093. 2

  71. [71]

    Metaformer is actually what you need for vision,

    W. Yu, M. Luo, P. Zhou, C. Si, Y . Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 10 819–10 829. 2, 6, 8

  72. [72]

    Drop an octave: Reducing spatial redundancy in convo- lutional neural networks with octave convolution,

    Y . Chen, H. Fan, B. Xu, Z. Yan, Y . Kalantidis, M. Rohrbach, S. Yan, and J. Feng, “Drop an octave: Reducing spatial redundancy in convo- lutional neural networks with octave convolution,” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 3435–3444. 2, 9

  73. [73]

    The fast fourier transform and its applications,

    J. W. Cooley, P. A. Lewis, and P. D. Welch, “The fast fourier transform and its applications,” IEEE Transactions on Education , vol. 12, no. 1, pp. 27–34, 1969. 2, 9

  74. [74]

    An adaptive gaussian filter for noise reduction and edge detection,

    G. Deng and L. Cahill, “An adaptive gaussian filter for noise reduction and edge detection,” in 1993 IEEE conference record nuclear science symposium and medical imaging conference . IEEE, 1993, pp. 1615–

  75. [75]

    Adaptive frequency filters as efficient global token mixers,

    Z. Huang, Z. Zhang, C. Lan, Z.-J. Zha, Y . Lu, and B. Guo, “Adaptive frequency filters as efficient global token mixers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6049–6059. 2

  76. [76]

    The emerging field of signal processing on graphs: Ex- tending high-dimensional data analysis to networks and other irregular domains,

    D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Van- dergheynst, “The emerging field of signal processing on graphs: Ex- tending high-dimensional data analysis to networks and other irregular domains,” IEEE signal processing magazine, vol. 30, no. 3, pp. 83–98,

  77. [77]

    Analyzing the expressive power of graph neural networks in a spectral perspective,

    M. Balcilar, G. Renton, P. H ´eroux, B. Ga ¨uz`ere, S. Adam, and P. Honeine, “Analyzing the expressive power of graph neural networks in a spectral perspective,” in International Conference on Learning Representations, 2021. 2, 3, 5

  78. [78]

    F. R. Chung, Spectral graph theory . American Mathematical Soc., 1997, vol. 92. 3

  79. [79]

    Sparse convolutional neural networks,

    B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 806–814. 3

  80. [80]

    Accelerating sparse convolution with column vector-wise sparsity,

    Y . Tan, K. Han, K. Zhao, X. Yu, Z. Du, Y . Chen, Y . Wang, and J. Yao, “Accelerating sparse convolution with column vector-wise sparsity,” Advances in Neural Information Processing Systems , vol. 35, pp. 30 307–30 317, 2022. 3

Showing first 80 references.