arxiv: 2603.09138 · v2 · submitted 2026-03-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Rotation Equivariant Mamba for Vision Tasks

Zhongchen Zhao , Qi Xie , Keyu Huang , Lei Zhang , Deyu Meng , Zongben Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords rotation equivariancevisual mambastate space modelsequivariant networksimage classificationsemantic segmentationimage super-resolution

0 comments

The pith

EQ-VMamba adds a rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance in visual state-space models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current visual Mamba networks ignore rotational symmetry and therefore remain sensitive to image rotations. It introduces EQ-VMamba, which replaces the standard scan with a rotation-equivariant cross-scan and organizes Mamba blocks into groups that respect the same symmetry. Theoretical analysis of the equivariance error demonstrates that these changes keep the entire network equivariant. Experiments on classification, segmentation, and super-resolution show higher accuracy under rotations together with roughly 50 percent fewer parameters than non-equivariant baselines.

Core claim

By combining a rotation-equivariant cross-scan strategy with group Mamba blocks, EQ-VMamba enforces end-to-end rotation equivariance throughout the network. Rigorous analysis of the intrinsic equivariance error confirms that the architecture preserves this property, while empirical results across high-, mid-, and low-level vision tasks confirm both improved rotation robustness and better parameter efficiency.

What carries the argument

The rotation-equivariant cross-scan strategy together with group Mamba blocks, which together propagate rotational symmetry through the selective state-space computation.

If this is right

Vision Mamba models become robust to arbitrary rotations without relying on data augmentation.
Parameter count drops by approximately 50 percent while accuracy on classification, segmentation, and super-resolution either rises or stays competitive.
End-to-end equivariance holds across the full depth of the network rather than only in early layers.
The same geometric prior improves cross-task generalization on rotated inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scan-and-group pattern could be applied to other discrete symmetries such as reflections.
Equivariance may reduce the data volume needed to train Mamba vision models to a given accuracy level.
The efficiency gain opens the possibility of running rotation-robust Mamba models on edge hardware.

Load-bearing premise

The rotation-equivariant cross-scan and group blocks can be implemented without violating the selective state-space assumptions that give Mamba its efficiency.

What would settle it

Rotate an input image by 90 degrees; if the network output does not match the correspondingly rotated output of the original image within the theoretically bounded equivariance error, the claim is false.

Figures

Figures reproduced from arXiv: 2603.09138 by Deyu Meng, Keyu Huang, Lei Zhang, Qi Xie, Zhongchen Zhao, Zongben Xu.

**Figure 2.** Figure 2: Illustration of the group dimension (indexed by colors), and three [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overall architecture of the proposed end-to-end rotation equivariant visual Mamba (EQ-VMamba). The framework mainly comprises: (a) an EQ-patch [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the rotation equivariant patch embedding. A spatial [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between the non-equivariant cross-scan and the proposed equivariant EQ-cross-scan. (a) Under a 90 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Upper: Architectural pipeline of the proposed equivariant Visual State-Space (EQ-VSS) block. (1) The block first employs EQ-Linear layers to generate input-dependent Mamba parameters A, B, and C. (2) These parameters, along with the input feature map, are partitioned along the group dimension and flattened into four 1D sequences via EQ-cross-scan. (3) Each feature sequence is processed in parallel by the g… view at source ↗

**Figure 7.** Figure 7: Robustness comparison of VMamba-T, Spectral VMamba-T, and [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 9.** Figure 9: Robustness comparison of VMamba and EQ-VMamba on the rotated [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Visual comparison of image super-resolution results between MambaIR and EQ-MambaIR on Urban100 and Manga109 datasets. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

read the original abstract

Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross-task generalization. To address this limitation, we incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba-based architectures. Specifically, we introduce EQ-VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ-VMamba include a carefully designed rotation equivariant cross-scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end-to-end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks -- including high-level image classification, mid-level semantic segmentation, and low-level image super-resolution -- demonstrate that EQ-VMamba consistently improves rotation robustness and achieves superior or competitive performance compared to non-equivariant baselines, while requiring approximately 50\% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at https://github.com/zhongchenzhao/EQ-VMamba.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EQ-VMamba, the first rotation-equivariant visual Mamba architecture. It proposes a rotation-equivariant cross-scan strategy together with group Mamba blocks, supplies a theoretical analysis asserting that the design enforces zero intrinsic equivariance error for end-to-end rotation equivariance, and reports consistent gains in rotation robustness plus competitive or superior accuracy on image classification, semantic segmentation, and super-resolution tasks while using roughly 50% fewer parameters than non-equivariant baselines.

Significance. If the central theoretical claim is correct and the selective SSM parameters remain compatible with the group action, the work would constitute a meaningful step toward embedding geometric priors into efficient state-space models for vision, simultaneously improving rotation robustness and parameter efficiency.

major comments (2)

[§4] §4 (Theoretical Analysis), Theorem 1 and surrounding derivation: the proof of zero intrinsic equivariance error assumes that the input-dependent selection parameters B, C, and Δ commute with the rotation group action induced by the cross-scan reordering. The architecture description in §3.2 does not specify an equivariant mechanism for computing these parameters from the reordered sequence; if selection occurs before the group action is applied to the states, the hidden-state transitions will not in general commute with rotation, and the derived error bound would capture only discretization or ordering artifacts rather than the full mismatch.
[§5.2] §5.2 (Equivariance Error Evaluation): the reported near-zero end-to-end equivariance errors are presented without an explicit protocol for how the data-dependent selection is handled during the rotated forward passes. It is therefore impossible to determine whether the measured errors already include the effect of non-equivariant selection or only reflect scan-order and discretization contributions.

minor comments (2)

[§3.2] Eq. (7) and the subsequent group-action notation: the precise representation of how the hidden state is transformed under the group element is not stated explicitly; adding a short sentence clarifying the action on the SSM state would remove ambiguity.
[Table 3] Table 3 caption: the phrase 'parameter-free' is used for the equivariant variant, yet the model still contains learned projection weights; a brief clarification that the phrase refers only to the absence of rotation-specific augmentation parameters would prevent misreading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on the theoretical analysis and evaluation protocol. These points help clarify the assumptions underlying our zero-error claim and the measurement of equivariance. We address each major comment below and will revise the manuscript to improve explicitness without altering the core contributions.

read point-by-point responses

Referee: [§4] §4 (Theoretical Analysis), Theorem 1 and surrounding derivation: the proof of zero intrinsic equivariance error assumes that the input-dependent selection parameters B, C, and Δ commute with the rotation group action induced by the cross-scan reordering. The architecture description in §3.2 does not specify an equivariant mechanism for computing these parameters from the reordered sequence; if selection occurs before the group action is applied to the states, the hidden-state transitions will not in general commute with rotation, and the derived error bound would capture only discretization or ordering artifacts rather than the full mismatch.

Authors: We acknowledge that §3.2 could more explicitly describe how the data-dependent parameters are obtained. In the EQ-VMamba design, the rotation-equivariant cross-scan reorders the feature map according to the group element, after which B, C, and Δ are computed from the reordered sequence using the same linear projections and activations applied uniformly across group elements. Because the projections share weights and operate on the group-transformed features, the resulting parameters transform consistently under the group action, ensuring they commute with the subsequent state transitions by construction. Nevertheless, to eliminate any ambiguity, we will revise §3.2 to include a formal statement of this equivariant parameter computation and update the proof sketch in Theorem 1 to reference it directly. This is a clarification rather than a change to the architecture or results. revision: yes
Referee: [§5.2] §5.2 (Equivariance Error Evaluation): the reported near-zero end-to-end equivariance errors are presented without an explicit protocol for how the data-dependent selection is handled during the rotated forward passes. It is therefore impossible to determine whether the measured errors already include the effect of non-equivariant selection or only reflect scan-order and discretization contributions.

Authors: We agree that the measurement protocol merits explicit documentation. The reported errors are obtained by (i) rotating the input image by a group element g, (ii) running the complete forward pass of EQ-VMamba on the rotated input (so that selection parameters B, C, Δ are computed from the rotated features via the equivariant cross-scan), (iii) applying the inverse rotation g^{-1} to the output, and (iv) measuring the discrepancy with the unrotated output. Because the entire pipeline—including selection—is constructed to be equivariant, the measured near-zero errors already incorporate the data-dependent selection under rotation. We will add a precise description of this protocol, together with pseudocode, to §5.2. revision: yes

Circularity Check

0 steps flagged

No circularity: equivariance derived from explicit architectural components

full rationale

The paper's central claim rests on a theoretical analysis of intrinsic equivariance error that is tied directly to the introduced rotation-equivariant cross-scan strategy and group Mamba blocks. No step reduces a prediction to a fitted parameter, renames a known result, or relies on a load-bearing self-citation whose content is itself unverified. The derivation is presented as following from the design choices (equivariant scanning and group actions on states), which is the standard non-circular route for equivariant architectures. The skeptic concern about input-dependent selection parameters is a potential correctness or completeness issue for the proof, not evidence that the stated analysis collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that rotation symmetry is a universal image prior and on the technical assumption that the proposed cross-scan and group blocks preserve Mamba's selective dynamics while enforcing equivariance. No new free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Rotation symmetry constitutes a universal and fundamental geometric prior in images
Explicitly stated in the abstract as the motivation for the architecture.

pith-pipeline@v0.9.0 · 5573 in / 1220 out tokens · 49348 ms · 2026-05-15T13:40:59.415928+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a rotation equivariant cross-scan (EQ-cross-scan) strategy … group Mamba blocks … prove that the EQ-cross-scan/merge, the group Mamba blocks, and the overall EQ-VMamba architecture all achieve zero equivariance error under 90-degree rotations
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery / orbit embedding unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3 … VSSeq (π_RS_Ĝ (X)) = π_RS_Ĝ (VSSeq (X))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 10 internal anchors

[1]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024

work page 2024
[2]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolu- tional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[3]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Sun Jian. Deep residual learning for image recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2016

work page 2016
[4]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017
[5]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning, pages 6105–6114. PMLR, 2019

work page 2019
[6]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

work page 2022
[7]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[8]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In The Tenth International Conference on Learning Representations, 2021

work page 2021
[9]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

work page 2021
[10]

Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in neural information processing systems, 34:572–585, 2021

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in neural information processing systems, 34:572–585, 2021

work page 2021
[11]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Sim- plified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022

Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Sim- plified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022

work page arXiv 2022
[13]

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual represen- tation learning with bidirectional state space model.arXiv preprint arXiv:2401.09417, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Vmamba: Visual state space model

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model. InAdvances in Neural Information Processing Systems, pages 103031–103063, 2024

work page 2024
[15]

Spatial-mamba: Effective visual state space models via structure- aware state fusion

Chaodong Xiao, Minghan Li, Zhengqiang Zhang, Deyu Meng, and Lei Zhang. Spatial-mamba: Effective visual state space models via structure- aware state fusion. InThe Fourteenth International Conference on Learning Representations, 2025

work page 2025
[16]

Polyline path masked attention for vision transformer

Zhongchen Zhao, Chaodong Xiao, Hui Lin, Qi Xie, Lei Zhang, and Deyu Meng. Polyline path masked attention for vision transformer. arXiv preprint arXiv:2506.15940, 2025

work page arXiv 2025
[17]

Mambair: A simple baseline for image restoration with state- space model

Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, and Shu- Tao Xia. Mambair: A simple baseline for image restoration with state- space model. InEuropean conference on computer vision, pages 222–

work page
[18]

Visualizing and understanding convolutional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InEuropean conference on computer vision, pages 818–833. Springer, 2014

work page 2014
[19]

Group equivariant convolutional net- works

Taco Cohen and Max Welling. Group equivariant convolutional net- works. InInternational conference on machine learning, pages 2990–

work page
[20]

General e (2)-equivariant steerable cnns.Advances in neural information processing systems, 32, 2019

Maurice Weiler and Gabriele Cesa. General e (2)-equivariant steerable cnns.Advances in neural information processing systems, 32, 2019

work page 2019
[21]

Pdo- econvs: Partial differential operator based equivariant convolutions

Zhengyang Shen, Lingshen He, Zhouchen Lin, and Jinwen Ma. Pdo- econvs: Partial differential operator based equivariant convolutions. InInternational Conference on Machine Learning, pages 8697–8706. PMLR, 2020

work page 2020
[22]

Fourier series ex- pansion based filter parametrization for equivariant convolutions.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4537– 4551, 2022

Qi Xie, Qian Zhao, Zongben Xu, and Deyu Meng. Fourier series ex- pansion based filter parametrization for equivariant convolutions.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4537– 4551, 2022

work page 2022
[23]

Efficient equivariant network.Advances in Neural Information Processing Systems, 34:5290–5302, 2021

Lingshen He, Yuxuan Chen, Yiming Dong, Yisen Wang, Zhouchen Lin, et al. Efficient equivariant network.Advances in Neural Information Processing Systems, 34:5290–5302, 2021

work page 2021
[24]

InUncertainty in Artificial Intelligence, pages 2356–

Renjun Xu, Kaifan Yang, Ke Liu, and Fengxiang He.e(2)-equivariant vision transformer. InUncertainty in Artificial Intelligence, pages 2356–

work page
[25]

Steerable transformers for volumetric data.arXiv preprint arXiv:2405.15932, 2024

Soumyabrata Kundu and Risi Kondor. Steerable transformers for volumetric data.arXiv preprint arXiv:2405.15932, 2024

work page arXiv 2024
[26]

Steerable CNNs

Taco S Cohen and Max Welling. Steerable cnns.arXiv preprint arXiv:1612.08498, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Harmonic networks: Deep translation and rotation equivariance

Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Harmonic networks: Deep translation and rotation equivariance. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5028–5037, 2017

work page 2017
[28]

Oriented response networks

Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Oriented response networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 519–528, 2017

work page 2017
[29]

Rotation equivariant vector field networks

Diego Marcos, Michele V olpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant vector field networks. InProceedings of the IEEE International Conference on Computer Vision, pages 5048–5057, 2017

work page 2017
[30]

Learning steerable filters for rotation equivariant cnns

Maurice Weiler, Fred A Hamprecht, and Martin Storath. Learning steerable filters for rotation equivariant cnns. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 849–858, 2018

work page 2018
[31]

Group equivariant stand-alone self-attention for vision.arXiv preprint arXiv:2010.00977, 2020

David W Romero and Jean-Baptiste Cordonnier. Group equivariant stand-alone self-attention for vision.arXiv preprint arXiv:2010.00977, 2020

work page arXiv 2010
[32]

Rotation equivariant arbitrary-scale image super-resolution.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Qi Xie, Jiahong Fu, Zongben Xu, and Deyu Meng. Rotation equivariant arbitrary-scale image super-resolution.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2026 15

work page 2025
[33]

A regularization- guided equivariant approach for image restoration

Yulu Bai, Jiahong Fu, Qi Xie, and Deyu Meng. A regularization- guided equivariant approach for image restoration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2300– 2310, 2025

work page 2025
[34]

Spectral state space model for rotation-invariant visual representation learning

Sahar Dastani, Ali Bahri, Moslem Yazdanpanah, Mehrdad Noori, David Osowiechi, Gustavo Adolfo Vargas Hakim, Farzad Beizaee, Milad Cheraghalikhani, Arnab Kumar Mondal, Herve Lombaert, et al. Spectral state space model for rotation-invariant visual representation learning. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 23881...

work page 2025
[35]

Lietransformer: Equivariant self-attention for lie groups

Michael J Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, Yee Whye Teh, and Hyunjik Kim. Lietransformer: Equivariant self-attention for lie groups. InInternational conference on machine learning, pages 4533–4543. PMLR, 2021

work page 2021
[36]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

work page 2020
[37]

Generating Long Sequences with Sparse Transformers

Rewon Child. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[38]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[39]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcad- inho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era.arXiv preprint arXiv:2305.13048, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Localmamba: Visual state space model with windowed selective scan

Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, and Chang Xu. Localmamba: Visual state space model with windowed selective scan. InEuropean Conference on Computer Vision, pages 12–22. Springer, 2024

work page 2024
[42]

Multi-scale vmamba: Hierarchy in hierarchy visual state space model.Advances in Neural Information Processing Systems, 37:25687–25708, 2024

Yuheng Shi, Minjing Dong, and Chang Xu. Multi-scale vmamba: Hierarchy in hierarchy visual state space model.Advances in Neural Information Processing Systems, 37:25687–25708, 2024

work page 2024
[43]

Demystify mamba in vision: A linear attention perspective.Advances in neural information processing systems, 37:127181–127203, 2024

Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, and Gao Huang. Demystify mamba in vision: A linear attention perspective.Advances in neural information processing systems, 37:127181–127203, 2024

work page 2024
[44]

Mambairv2: Attentive state space restoration.arXiv preprint arXiv:2411.15269, 2024

Hang Guo, Yong Guo, Yaohua Zha, Yulun Zhang, Wenbo Li, Tao Dai, Shu-Tao Xia, and Yawei Li. Mambairv2: Attentive state space restoration.arXiv preprint arXiv:2411.15269, 2024

work page arXiv 2024
[45]

Freqmamba: Viewing mamba from a frequency perspective for image deraining

Zhen Zou, Hu Yu, Jie Huang, and Feng Zhao. Freqmamba: Viewing mamba from a frequency perspective for image deraining. InProceed- ings of the 32nd ACM international conference on multimedia, pages 1905–1914, 2024

work page 1905
[46]

Dim: Diffusion mamba for efficient high- resolution image synthesis.arXiv preprint arXiv:2405.14224, 2024

Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Dim: Diffusion mamba for efficient high- resolution image synthesis.arXiv preprint arXiv:2405.14224, 2024

work page arXiv 2024
[47]

Videomamba: State space model for efficient video understanding

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InEuropean conference on computer vision, pages 237–

work page
[48]

Samba: Semantic segmentation of remotely sensed images with state space model.Heliyon, 10(19), 2024

Qinfeng Zhu, Yuanzhi Cai, Yuan Fang, Yihan Yang, Cheng Chen, Lei Fan, and Anh Nguyen. Samba: Semantic segmentation of remotely sensed images with state space model.Heliyon, 10(19), 2024

work page 2024
[49]

Rs-mamba for large remote sensing image dense prediction.IEEE Transactions on Geoscience and Remote Sensing, 2024

Sijie Zhao, Hao Chen, Xueliang Zhang, Pengfeng Xiao, Lei Bai, and Wanli Ouyang. Rs-mamba for large remote sensing image dense prediction.IEEE Transactions on Geoscience and Remote Sensing, 2024

work page 2024
[50]

Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems, 37:32653–32677, 2024

Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems, 37:32653–32677, 2024

work page 2024
[51]

V oxel mamba: Group-free state space models for point cloud based 3d object detection.Advances in Neural Information Processing Systems, 37:81489–81509, 2024

Guowen Zhang, Lue Fan, Chenhang He, Zhen Lei, Zhaoxiang Zhang, and Lei Zhang. V oxel mamba: Group-free state space models for point cloud based 3d object detection.Advances in Neural Information Processing Systems, 37:81489–81509, 2024

work page 2024
[52]

Vm-unet: Vision mamba unet for medical image segmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 2024

Jiacheng Ruan, Jincheng Li, and Suncheng Xiang. Vm-unet: Vision mamba unet for medical image segmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 2024

work page 2024
[53]

Swinir: Image restoration using swin transformer

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021

work page 2021
[54]

Srformer: Permuted self-attention for single image super- resolution

Yupeng Zhou, Zhen Li, Chun-Le Guo, Song Bai, Ming-Ming Cheng, and Qibin Hou. Srformer: Permuted self-attention for single image super- resolution. InProceedings of the IEEE/CVF international conference on computer vision, pages 12780–12791, 2023

work page 2023
[55]

Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks

Dmitry Laptev, Nikolay Savinov, Joachim M Buhmann, and Marc Polle- feys. Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 289–297, 2016

work page 2016
[56]

Learning Invariant Representations with Local Transformations

Kihyuk Sohn and Honglak Lee. Learning invariant representations with local transformations.arXiv preprint arXiv:1206.6418, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[57]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012
[58]

HexaConv

Emiel Hoogeboom, Jorn WT Peters, Taco S Cohen, and Max Welling. Hexaconv.arXiv preprint arXiv:1803.02108, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[59]

Pdo- es2cnns: Partial differential operator based equivariant spherical cnns

Zhengyang Shen, Tiancheng Shen, Zhouchen Lin, and Jinwen Ma. Pdo- es2cnns: Partial differential operator based equivariant spherical cnns. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 9585–9593, 2021

work page 2021
[60]

Universal equivariant multilayer perceptrons

Siamak Ravanbakhsh. Universal equivariant multilayer perceptrons. InInternational Conference on Machine Learning, pages 7996–8006. PMLR, 2020

work page 2020
[61]

Unified perceptual parsing for scene understanding

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. InProceedings of the European conference on computer vision (ECCV), pages 418–434, 2018

work page 2018
[62]

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016

work page 2016
[63]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

work page 2018
[64]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021

work page 2021
[65]

Xcit: Cross-covariance image trans- formers.Advances in neural information processing systems, 34:20014– 20027, 2021

Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross-covariance image trans- formers.Advances in neural information processing systems, 34:20014– 20027, 2021

work page 2021
[66]

Imagenet100 (kaggle dataset)

ambityga (Kaggle user). Imagenet100 (kaggle dataset). https://www. kaggle.com/datasets/ambityga/imagenet100, 2023. Accessed: 2025-04- 21

work page 2023
[67]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, et al. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2009

work page 2009
[68]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regulariza- tion.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[69]

Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

work page 1992
[70]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, et al. Scene parsing through ade20k dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2017

work page 2017
[71]

The pascal visual object classes (voc) challenge

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010

work page 2010
[72]

The cityscapes dataset for semantic urban scene under- standing

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene under- standing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

work page 2016
[73]

Coco-stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018

work page 2018
[74]

Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021

Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021

work page arXiv 2021
[75]

The isprs benchmark on urban object classification and 3d building reconstruction

Franz Rottensteiner, Gunho Sohn, Jaewook Jung, Markus Gerke, Car- oline Baillard, Sebastien Benitez, and Uwe Breitkopf. The isprs benchmark on urban object classification and 3d building reconstruction. 2012

work page 2012
[76]

Mmsegmentation, an open source semantic segmentation toolbox, 2020

MMSegmentation Contributors. Mmsegmentation, an open source semantic segmentation toolbox, 2020. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2026 16

work page 2020
[77]

Enhanced deep residual networks for single image super- resolution

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super- resolution. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017

work page 2017
[78]

Image super-resolution using very deep residual channel attention networks

Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InProceedings of the European conference on computer vision (ECCV), pages 286–301, 2018

work page 2018
[79]

Second-order attention network for single image super-resolution

Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11065–11074, 2019

work page 2019
[80]

Single image super-resolution via a holistic attention network

Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and Haifeng Shen. Single image super-resolution via a holistic attention network. InEuropean conference on computer vision, pages 191–207. Springer, 2020

work page 2020

Showing first 80 references.