pith. machine review for the scientific record. sign in

arxiv: 2603.09138 · v2 · submitted 2026-03-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Rotation Equivariant Mamba for Vision Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords rotation equivariancevisual mambastate space modelsequivariant networksimage classificationsemantic segmentationimage super-resolution
0
0 comments X

The pith

EQ-VMamba adds a rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance in visual state-space models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current visual Mamba networks ignore rotational symmetry and therefore remain sensitive to image rotations. It introduces EQ-VMamba, which replaces the standard scan with a rotation-equivariant cross-scan and organizes Mamba blocks into groups that respect the same symmetry. Theoretical analysis of the equivariance error demonstrates that these changes keep the entire network equivariant. Experiments on classification, segmentation, and super-resolution show higher accuracy under rotations together with roughly 50 percent fewer parameters than non-equivariant baselines.

Core claim

By combining a rotation-equivariant cross-scan strategy with group Mamba blocks, EQ-VMamba enforces end-to-end rotation equivariance throughout the network. Rigorous analysis of the intrinsic equivariance error confirms that the architecture preserves this property, while empirical results across high-, mid-, and low-level vision tasks confirm both improved rotation robustness and better parameter efficiency.

What carries the argument

The rotation-equivariant cross-scan strategy together with group Mamba blocks, which together propagate rotational symmetry through the selective state-space computation.

If this is right

  • Vision Mamba models become robust to arbitrary rotations without relying on data augmentation.
  • Parameter count drops by approximately 50 percent while accuracy on classification, segmentation, and super-resolution either rises or stays competitive.
  • End-to-end equivariance holds across the full depth of the network rather than only in early layers.
  • The same geometric prior improves cross-task generalization on rotated inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scan-and-group pattern could be applied to other discrete symmetries such as reflections.
  • Equivariance may reduce the data volume needed to train Mamba vision models to a given accuracy level.
  • The efficiency gain opens the possibility of running rotation-robust Mamba models on edge hardware.

Load-bearing premise

The rotation-equivariant cross-scan and group blocks can be implemented without violating the selective state-space assumptions that give Mamba its efficiency.

What would settle it

Rotate an input image by 90 degrees; if the network output does not match the correspondingly rotated output of the original image within the theoretically bounded equivariance error, the claim is false.

Figures

Figures reproduced from arXiv: 2603.09138 by Deyu Meng, Keyu Huang, Lei Zhang, Qi Xie, Zhongchen Zhao, Zongben Xu.

Figure 1
Figure 1. Figure 1: (a)-(b) Visualization of output feature maps for an input image and its [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the group dimension (indexed by colors), and three [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of the proposed end-to-end rotation equivariant visual Mamba (EQ-VMamba). The framework mainly comprises: (a) an EQ-patch [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the rotation equivariant patch embedding. A spatial [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between the non-equivariant cross-scan and the proposed equivariant EQ-cross-scan. (a) Under a 90 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Upper: Architectural pipeline of the proposed equivariant Visual State-Space (EQ-VSS) block. (1) The block first employs EQ-Linear layers to generate input-dependent Mamba parameters A, B, and C. (2) These parameters, along with the input feature map, are partitioned along the group dimension and flattened into four 1D sequences via EQ-cross-scan. (3) Each feature sequence is processed in parallel by the g… view at source ↗
Figure 7
Figure 7. Figure 7: Robustness comparison of VMamba-T, Spectral VMamba-T, and [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Robustness comparison of VMamba and EQ-VMamba on the rotated [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visual comparison of image super-resolution results between MambaIR and EQ-MambaIR on Urban100 and Manga109 datasets. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
read the original abstract

Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross-task generalization. To address this limitation, we incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba-based architectures. Specifically, we introduce EQ-VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ-VMamba include a carefully designed rotation equivariant cross-scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end-to-end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks -- including high-level image classification, mid-level semantic segmentation, and low-level image super-resolution -- demonstrate that EQ-VMamba consistently improves rotation robustness and achieves superior or competitive performance compared to non-equivariant baselines, while requiring approximately 50\% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at https://github.com/zhongchenzhao/EQ-VMamba.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EQ-VMamba, the first rotation-equivariant visual Mamba architecture. It proposes a rotation-equivariant cross-scan strategy together with group Mamba blocks, supplies a theoretical analysis asserting that the design enforces zero intrinsic equivariance error for end-to-end rotation equivariance, and reports consistent gains in rotation robustness plus competitive or superior accuracy on image classification, semantic segmentation, and super-resolution tasks while using roughly 50% fewer parameters than non-equivariant baselines.

Significance. If the central theoretical claim is correct and the selective SSM parameters remain compatible with the group action, the work would constitute a meaningful step toward embedding geometric priors into efficient state-space models for vision, simultaneously improving rotation robustness and parameter efficiency.

major comments (2)
  1. [§4] §4 (Theoretical Analysis), Theorem 1 and surrounding derivation: the proof of zero intrinsic equivariance error assumes that the input-dependent selection parameters B, C, and Δ commute with the rotation group action induced by the cross-scan reordering. The architecture description in §3.2 does not specify an equivariant mechanism for computing these parameters from the reordered sequence; if selection occurs before the group action is applied to the states, the hidden-state transitions will not in general commute with rotation, and the derived error bound would capture only discretization or ordering artifacts rather than the full mismatch.
  2. [§5.2] §5.2 (Equivariance Error Evaluation): the reported near-zero end-to-end equivariance errors are presented without an explicit protocol for how the data-dependent selection is handled during the rotated forward passes. It is therefore impossible to determine whether the measured errors already include the effect of non-equivariant selection or only reflect scan-order and discretization contributions.
minor comments (2)
  1. [§3.2] Eq. (7) and the subsequent group-action notation: the precise representation of how the hidden state is transformed under the group element is not stated explicitly; adding a short sentence clarifying the action on the SSM state would remove ambiguity.
  2. [Table 3] Table 3 caption: the phrase 'parameter-free' is used for the equivariant variant, yet the model still contains learned projection weights; a brief clarification that the phrase refers only to the absence of rotation-specific augmentation parameters would prevent misreading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on the theoretical analysis and evaluation protocol. These points help clarify the assumptions underlying our zero-error claim and the measurement of equivariance. We address each major comment below and will revise the manuscript to improve explicitness without altering the core contributions.

read point-by-point responses
  1. Referee: [§4] §4 (Theoretical Analysis), Theorem 1 and surrounding derivation: the proof of zero intrinsic equivariance error assumes that the input-dependent selection parameters B, C, and Δ commute with the rotation group action induced by the cross-scan reordering. The architecture description in §3.2 does not specify an equivariant mechanism for computing these parameters from the reordered sequence; if selection occurs before the group action is applied to the states, the hidden-state transitions will not in general commute with rotation, and the derived error bound would capture only discretization or ordering artifacts rather than the full mismatch.

    Authors: We acknowledge that §3.2 could more explicitly describe how the data-dependent parameters are obtained. In the EQ-VMamba design, the rotation-equivariant cross-scan reorders the feature map according to the group element, after which B, C, and Δ are computed from the reordered sequence using the same linear projections and activations applied uniformly across group elements. Because the projections share weights and operate on the group-transformed features, the resulting parameters transform consistently under the group action, ensuring they commute with the subsequent state transitions by construction. Nevertheless, to eliminate any ambiguity, we will revise §3.2 to include a formal statement of this equivariant parameter computation and update the proof sketch in Theorem 1 to reference it directly. This is a clarification rather than a change to the architecture or results. revision: yes

  2. Referee: [§5.2] §5.2 (Equivariance Error Evaluation): the reported near-zero end-to-end equivariance errors are presented without an explicit protocol for how the data-dependent selection is handled during the rotated forward passes. It is therefore impossible to determine whether the measured errors already include the effect of non-equivariant selection or only reflect scan-order and discretization contributions.

    Authors: We agree that the measurement protocol merits explicit documentation. The reported errors are obtained by (i) rotating the input image by a group element g, (ii) running the complete forward pass of EQ-VMamba on the rotated input (so that selection parameters B, C, Δ are computed from the rotated features via the equivariant cross-scan), (iii) applying the inverse rotation g^{-1} to the output, and (iv) measuring the discrepancy with the unrotated output. Because the entire pipeline—including selection—is constructed to be equivariant, the measured near-zero errors already incorporate the data-dependent selection under rotation. We will add a precise description of this protocol, together with pseudocode, to §5.2. revision: yes

Circularity Check

0 steps flagged

No circularity: equivariance derived from explicit architectural components

full rationale

The paper's central claim rests on a theoretical analysis of intrinsic equivariance error that is tied directly to the introduced rotation-equivariant cross-scan strategy and group Mamba blocks. No step reduces a prediction to a fitted parameter, renames a known result, or relies on a load-bearing self-citation whose content is itself unverified. The derivation is presented as following from the design choices (equivariant scanning and group actions on states), which is the standard non-circular route for equivariant architectures. The skeptic concern about input-dependent selection parameters is a potential correctness or completeness issue for the proof, not evidence that the stated analysis collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that rotation symmetry is a universal image prior and on the technical assumption that the proposed cross-scan and group blocks preserve Mamba's selective dynamics while enforcing equivariance. No new free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Rotation symmetry constitutes a universal and fundamental geometric prior in images
    Explicitly stated in the abstract as the motivation for the architecture.

pith-pipeline@v0.9.0 · 5573 in / 1220 out tokens · 49348 ms · 2026-05-15T13:40:59.415928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 10 internal anchors

  1. [1]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024

  2. [2]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolu- tional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

  3. [3]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Sun Jian. Deep residual learning for image recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2016

  4. [4]

    Densely connected convolutional networks

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4700–4708, 2017

  5. [5]

    Efficientnet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning, pages 6105–6114. PMLR, 2019

  6. [6]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

  7. [7]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  8. [8]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In The Tenth International Conference on Learning Representations, 2021

  9. [9]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  10. [10]

    Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in neural information processing systems, 34:572–585, 2021

    Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in neural information processing systems, 34:572–585, 2021

  11. [11]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021

  12. [12]

    Sim- plified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022

    Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Sim- plified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022

  13. [13]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual represen- tation learning with bidirectional state space model.arXiv preprint arXiv:2401.09417, 2024

  14. [14]

    Vmamba: Visual state space model

    Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model. InAdvances in Neural Information Processing Systems, pages 103031–103063, 2024

  15. [15]

    Spatial-mamba: Effective visual state space models via structure- aware state fusion

    Chaodong Xiao, Minghan Li, Zhengqiang Zhang, Deyu Meng, and Lei Zhang. Spatial-mamba: Effective visual state space models via structure- aware state fusion. InThe Fourteenth International Conference on Learning Representations, 2025

  16. [16]

    Polyline path masked attention for vision transformer

    Zhongchen Zhao, Chaodong Xiao, Hui Lin, Qi Xie, Lei Zhang, and Deyu Meng. Polyline path masked attention for vision transformer. arXiv preprint arXiv:2506.15940, 2025

  17. [17]

    Mambair: A simple baseline for image restoration with state- space model

    Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, and Shu- Tao Xia. Mambair: A simple baseline for image restoration with state- space model. InEuropean conference on computer vision, pages 222–

  18. [18]

    Visualizing and understanding convolutional networks

    Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014

  19. [19]

    Group equivariant convolutional net- works

    Taco Cohen and Max Welling. Group equivariant convolutional net- works. InInternational conference on machine learning, pages 2990–

  20. [20]

    General e (2)-equivariant steerable cnns.Advances in neural information processing systems, 32, 2019

    Maurice Weiler and Gabriele Cesa. General e (2)-equivariant steerable cnns.Advances in neural information processing systems, 32, 2019

  21. [21]

    Pdo- econvs: Partial differential operator based equivariant convolutions

    Zhengyang Shen, Lingshen He, Zhouchen Lin, and Jinwen Ma. Pdo- econvs: Partial differential operator based equivariant convolutions. InInternational Conference on Machine Learning, pages 8697–8706. PMLR, 2020

  22. [22]

    Fourier series ex- pansion based filter parametrization for equivariant convolutions.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4537– 4551, 2022

    Qi Xie, Qian Zhao, Zongben Xu, and Deyu Meng. Fourier series ex- pansion based filter parametrization for equivariant convolutions.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4537– 4551, 2022

  23. [23]

    Efficient equivariant network.Advances in Neural Information Processing Systems, 34:5290–5302, 2021

    Lingshen He, Yuxuan Chen, Yiming Dong, Yisen Wang, Zhouchen Lin, et al. Efficient equivariant network.Advances in Neural Information Processing Systems, 34:5290–5302, 2021

  24. [24]

    InUncertainty in Artificial Intelligence, pages 2356–

    Renjun Xu, Kaifan Yang, Ke Liu, and Fengxiang He.e(2)-equivariant vision transformer. InUncertainty in Artificial Intelligence, pages 2356–

  25. [25]

    Steerable transformers for volumetric data.arXiv preprint arXiv:2405.15932, 2024

    Soumyabrata Kundu and Risi Kondor. Steerable transformers for volumetric data.arXiv preprint arXiv:2405.15932, 2024

  26. [26]

    Steerable CNNs

    Taco S Cohen and Max Welling. Steerable cnns.arXiv preprint arXiv:1612.08498, 2016

  27. [27]

    Harmonic networks: Deep translation and rotation equivariance

    Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Harmonic networks: Deep translation and rotation equivariance. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5028–5037, 2017

  28. [28]

    Oriented response networks

    Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Oriented response networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 519–528, 2017

  29. [29]

    Rotation equivariant vector field networks

    Diego Marcos, Michele V olpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant vector field networks. InProceedings of the IEEE International Conference on Computer Vision, pages 5048–5057, 2017

  30. [30]

    Learning steerable filters for rotation equivariant cnns

    Maurice Weiler, Fred A Hamprecht, and Martin Storath. Learning steerable filters for rotation equivariant cnns. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 849–858, 2018

  31. [31]

    Group equivariant stand-alone self-attention for vision.arXiv preprint arXiv:2010.00977, 2020

    David W Romero and Jean-Baptiste Cordonnier. Group equivariant stand-alone self-attention for vision.arXiv preprint arXiv:2010.00977, 2020

  32. [32]

    Rotation equivariant arbitrary-scale image super-resolution.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Qi Xie, Jiahong Fu, Zongben Xu, and Deyu Meng. Rotation equivariant arbitrary-scale image super-resolution.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2026 15

  33. [33]

    A regularization- guided equivariant approach for image restoration

    Yulu Bai, Jiahong Fu, Qi Xie, and Deyu Meng. A regularization- guided equivariant approach for image restoration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2300– 2310, 2025

  34. [34]

    Spectral state space model for rotation-invariant visual representation learning

    Sahar Dastani, Ali Bahri, Moslem Yazdanpanah, Mehrdad Noori, David Osowiechi, Gustavo Adolfo Vargas Hakim, Farzad Beizaee, Milad Cheraghalikhani, Arnab Kumar Mondal, Herve Lombaert, et al. Spectral state space model for rotation-invariant visual representation learning. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 23881...

  35. [35]

    Lietransformer: Equivariant self-attention for lie groups

    Michael J Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, Yee Whye Teh, and Hyunjik Kim. Lietransformer: Equivariant self-attention for lie groups. InInternational conference on machine learning, pages 4533–4543. PMLR, 2021

  36. [36]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

  37. [37]

    Generating Long Sequences with Sparse Transformers

    Rewon Child. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

  38. [38]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

  39. [39]

    RWKV: Reinventing RNNs for the Transformer Era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcad- inho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era.arXiv preprint arXiv:2305.13048, 2023

  40. [40]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  41. [41]

    Localmamba: Visual state space model with windowed selective scan

    Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, and Chang Xu. Localmamba: Visual state space model with windowed selective scan. InEuropean Conference on Computer Vision, pages 12–22. Springer, 2024

  42. [42]

    Multi-scale vmamba: Hierarchy in hierarchy visual state space model.Advances in Neural Information Processing Systems, 37:25687–25708, 2024

    Yuheng Shi, Minjing Dong, and Chang Xu. Multi-scale vmamba: Hierarchy in hierarchy visual state space model.Advances in Neural Information Processing Systems, 37:25687–25708, 2024

  43. [43]

    Demystify mamba in vision: A linear attention perspective.Advances in neural information processing systems, 37:127181–127203, 2024

    Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, and Gao Huang. Demystify mamba in vision: A linear attention perspective.Advances in neural information processing systems, 37:127181–127203, 2024

  44. [44]

    Mambairv2: Attentive state space restoration.arXiv preprint arXiv:2411.15269, 2024

    Hang Guo, Yong Guo, Yaohua Zha, Yulun Zhang, Wenbo Li, Tao Dai, Shu-Tao Xia, and Yawei Li. Mambairv2: Attentive state space restoration.arXiv preprint arXiv:2411.15269, 2024

  45. [45]

    Freqmamba: Viewing mamba from a frequency perspective for image deraining

    Zhen Zou, Hu Yu, Jie Huang, and Feng Zhao. Freqmamba: Viewing mamba from a frequency perspective for image deraining. InProceed- ings of the 32nd ACM international conference on multimedia, pages 1905–1914, 2024

  46. [46]

    Dim: Diffusion mamba for efficient high- resolution image synthesis.arXiv preprint arXiv:2405.14224, 2024

    Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Dim: Diffusion mamba for efficient high- resolution image synthesis.arXiv preprint arXiv:2405.14224, 2024

  47. [47]

    Videomamba: State space model for efficient video understanding

    Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InEuropean conference on computer vision, pages 237–

  48. [48]

    Samba: Semantic segmentation of remotely sensed images with state space model.Heliyon, 10(19), 2024

    Qinfeng Zhu, Yuanzhi Cai, Yuan Fang, Yihan Yang, Cheng Chen, Lei Fan, and Anh Nguyen. Samba: Semantic segmentation of remotely sensed images with state space model.Heliyon, 10(19), 2024

  49. [49]

    Rs-mamba for large remote sensing image dense prediction.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Sijie Zhao, Hao Chen, Xueliang Zhang, Pengfeng Xiao, Lei Bai, and Wanli Ouyang. Rs-mamba for large remote sensing image dense prediction.IEEE Transactions on Geoscience and Remote Sensing, 2024

  50. [50]

    Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems, 37:32653–32677, 2024

    Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems, 37:32653–32677, 2024

  51. [51]

    V oxel mamba: Group-free state space models for point cloud based 3d object detection.Advances in Neural Information Processing Systems, 37:81489–81509, 2024

    Guowen Zhang, Lue Fan, Chenhang He, Zhen Lei, Zhaoxiang Zhang, and Lei Zhang. V oxel mamba: Group-free state space models for point cloud based 3d object detection.Advances in Neural Information Processing Systems, 37:81489–81509, 2024

  52. [52]

    Vm-unet: Vision mamba unet for medical image segmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 2024

    Jiacheng Ruan, Jincheng Li, and Suncheng Xiang. Vm-unet: Vision mamba unet for medical image segmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 2024

  53. [53]

    Swinir: Image restoration using swin transformer

    Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021

  54. [54]

    Srformer: Permuted self-attention for single image super- resolution

    Yupeng Zhou, Zhen Li, Chun-Le Guo, Song Bai, Ming-Ming Cheng, and Qibin Hou. Srformer: Permuted self-attention for single image super- resolution. InProceedings of the IEEE/CVF international conference on computer vision, pages 12780–12791, 2023

  55. [55]

    Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks

    Dmitry Laptev, Nikolay Savinov, Joachim M Buhmann, and Marc Polle- feys. Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 289–297, 2016

  56. [56]

    Learning Invariant Representations with Local Transformations

    Kihyuk Sohn and Honglak Lee. Learning invariant representations with local transformations.arXiv preprint arXiv:1206.6418, 2012

  57. [57]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

  58. [58]

    HexaConv

    Emiel Hoogeboom, Jorn WT Peters, Taco S Cohen, and Max Welling. Hexaconv.arXiv preprint arXiv:1803.02108, 2018

  59. [59]

    Pdo- es2cnns: Partial differential operator based equivariant spherical cnns

    Zhengyang Shen, Tiancheng Shen, Zhouchen Lin, and Jinwen Ma. Pdo- es2cnns: Partial differential operator based equivariant spherical cnns. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 9585–9593, 2021

  60. [60]

    Universal equivariant multilayer perceptrons

    Siamak Ravanbakhsh. Universal equivariant multilayer perceptrons. InInternational Conference on Machine Learning, pages 7996–8006. PMLR, 2020

  61. [61]

    Unified perceptual parsing for scene understanding

    Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. InProceedings of the European conference on computer vision (ECCV), pages 418–434, 2018

  62. [62]

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016

  63. [63]

    Squeeze-and-excitation networks

    Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

  64. [64]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021

  65. [65]

    Xcit: Cross-covariance image trans- formers.Advances in neural information processing systems, 34:20014– 20027, 2021

    Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross-covariance image trans- formers.Advances in neural information processing systems, 34:20014– 20027, 2021

  66. [66]

    Imagenet100 (kaggle dataset)

    ambityga (Kaggle user). Imagenet100 (kaggle dataset). https://www. kaggle.com/datasets/ambityga/imagenet100, 2023. Accessed: 2025-04- 21

  67. [67]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, et al. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2009

  68. [68]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regulariza- tion.arXiv preprint arXiv:1711.05101, 2017

  69. [69]

    Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

    Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

  70. [70]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, et al. Scene parsing through ade20k dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2017

  71. [71]

    The pascal visual object classes (voc) challenge

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010

  72. [72]

    The cityscapes dataset for semantic urban scene under- standing

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene under- standing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

  73. [73]

    Coco-stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018

  74. [74]

    Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021

    Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021

  75. [75]

    The isprs benchmark on urban object classification and 3d building reconstruction

    Franz Rottensteiner, Gunho Sohn, Jaewook Jung, Markus Gerke, Car- oline Baillard, Sebastien Benitez, and Uwe Breitkopf. The isprs benchmark on urban object classification and 3d building reconstruction. 2012

  76. [76]

    Mmsegmentation, an open source semantic segmentation toolbox, 2020

    MMSegmentation Contributors. Mmsegmentation, an open source semantic segmentation toolbox, 2020. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2026 16

  77. [77]

    Enhanced deep residual networks for single image super- resolution

    Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super- resolution. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017

  78. [78]

    Image super-resolution using very deep residual channel attention networks

    Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InProceedings of the European conference on computer vision (ECCV), pages 286–301, 2018

  79. [79]

    Second-order attention network for single image super-resolution

    Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11065–11074, 2019

  80. [80]

    Single image super-resolution via a holistic attention network

    Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and Haifeng Shen. Single image super-resolution via a holistic attention network. InEuropean conference on computer vision, pages 191–207. Springer, 2020

Showing first 80 references.