arxiv: 2401.10166 · v4 · submitted 2024-01-18 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VMamba: Visual State Space Model

Yue Liu , Yunjie Tian , Yuzhong Zhao , Hongtian Yu , Lingxi Xie , Yaowei Wang , Qixiang Ye , Jianbin Jiao

show 1 more author

Yunfan Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords VMambaVisual State Space ModelState Space ModelsVision BackboneLinear Complexity2D Selective ScanComputer Vision

0 comments

The pith

VMamba adapts Mamba's state-space model to vision by scanning 2D images along four fixed routes to reach linear time complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VMamba as a vision backbone that converts the selective state-space model originally used for language into an efficient architecture for images. At its center is the Visual State-Space block containing the 2D Selective Scan module, which traverses each image along four routes to gather contextual information without quadratic cost. The resulting family of models is further accelerated by targeted architectural and implementation changes. If the approach holds, it would let vision networks process higher-resolution inputs at linear rather than quadratic scaling while maintaining competitive accuracy on standard perception tasks.

Core claim

VMamba is a vision backbone with linear time complexity built from stacks of Visual State-Space blocks. Each block incorporates the 2D Selective Scan module that traverses 2D data along four scanning routes, thereby bridging the ordered 1D selective scan with the non-sequential structure of images and allowing contextual information to be collected from multiple directions.

What carries the argument

The 2D Selective Scan (SS2D) module, which traverses each 2D feature map along four fixed routes to adapt 1D state-space scanning to vision data.

If this is right

VMamba achieves promising accuracy on a range of visual perception tasks.
The model exhibits better scaling with input size than current benchmark architectures.
A family of VMamba variants can be constructed and further accelerated by successive refinements.
Linear complexity opens the door to processing larger images or video frames without proportional compute growth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fixed-route scanning idea could be tested on video or 3D data where temporal or volumetric context must be aggregated.
Replacing the four routes with learned or adaptive paths might further improve information capture without raising asymptotic cost.
The same linear-complexity block could be inserted into hybrid models that combine state-space layers with local convolutions.

Load-bearing premise

That four fixed scanning routes are enough to capture all necessary spatial relationships in 2D visual data.

What would settle it

A controlled experiment in which VMamba accuracy falls below a comparable transformer on a task that requires long-range 2D spatial relations the four routes cannot reach, or in which measured runtime grows quadratically rather than linearly with input resolution.

read the original abstract

Designing computationally efficient network architectures remains an ongoing necessity in computer vision. In this paper, we adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity. At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments demonstrate VMamba's promising performance across diverse visual perception tasks, highlighting its superior input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VMamba gives a practical four-direction scan to run Mamba on images at linear cost, but the claim that those routes collect enough 2D context rests on an unproven assumption.

read the letter

VMamba adapts the Mamba selective scan to vision by introducing the SS2D module, which traverses each image along four fixed routes—rows, columns, and two diagonals—to feed the state space model. They wrap this into VSS blocks and build a family of backbones, with some implementation speedups along the way. The code is released, which is useful for anyone who wants to reproduce or extend it. The experiments section claims competitive results on standard perception tasks and better scaling with image size than the usual transformer or conv baselines. That part is the main practical contribution: a concrete linear-complexity alternative that people can actually try. The soft spot is exactly the one the stress test flags. Four fixed routes are presented as sufficient to gather contextual information from various perspectives, yet there is no derivation or information-theoretic argument showing they cover the spatial relationships that full 2D attention would see. The routes are not content-adaptive, so any systematic blind spot in non-aligned patterns would undermine the scaling advantage. The abstract talks about extensive experiments, but the visible text gives no numbers, ablations on direction count, or error bars, so the performance claims cannot be evaluated from what is here. This paper is aimed at researchers working on efficient vision backbones who follow the Mamba line of work. A reader who needs a drop-in linear option for larger inputs will get concrete architecture details and a public implementation to test. It deserves peer review because the new module is well-specified, the code is available, and the scaling question is falsifiable even if the current draft needs tighter experimental support.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VMamba, a vision backbone that adapts the Mamba state-space model for computer vision. Its core consists of stacked Visual State-Space (VSS) blocks containing a 2D Selective Scan (SS2D) module; SS2D traverses the 2D feature map along four fixed routes (row-wise, column-wise, and two diagonals) to convert non-sequential 2D data into ordered 1D sequences suitable for selective state-space modeling, thereby achieving linear time complexity while collecting contextual information from multiple perspectives. The authors construct a family of VMamba architectures, apply successive optimizations, and report extensive experiments across visual perception tasks that demonstrate competitive accuracy together with superior input scaling efficiency relative to existing benchmarks.

Significance. If the experimental claims hold, the work supplies a concrete, linear-complexity alternative to quadratic-attention vision transformers and large-kernel convolutions. The explicit release of code and the focus on input scaling efficiency constitute reproducible strengths that could influence the design of efficient backbones for high-resolution or long-sequence vision tasks.

major comments (2)

[§3.2] §3.2 (SS2D module description): the claim that scanning along exactly four fixed routes 'facilitates the collection of contextual information from various sources and perspectives' is load-bearing for the linear-complexity superiority argument, yet no derivation, information-theoretic argument, or ablation on route count/adaptivity is supplied to show that these four traversals are sufficient to avoid systematic blind spots for arbitrary 2D spatial configurations.
[§4] §4 (experimental section): the headline claim of 'superior input scaling efficiency' is asserted without visible tables reporting FLOPs-vs-accuracy curves, resolution scaling plots, or error bars across multiple runs; without these quantitative details the central efficiency advantage cannot be independently verified from the manuscript.

minor comments (2)

[§3.1] Notation for the state-space parameters (A, B, C, Δ) is introduced without an explicit reminder of their correspondence to the original Mamba formulation; a brief cross-reference would improve readability.
[Figure 2] Figure 2 (architecture diagram) would benefit from explicit labeling of the four scan directions and the merge operation that recombines the scanned sequences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to strengthen the presentation of the SS2D module and experimental results.

read point-by-point responses

Referee: [§3.2] §3.2 (SS2D module description): the claim that scanning along exactly four fixed routes 'facilitates the collection of contextual information from various sources and perspectives' is load-bearing for the linear-complexity superiority argument, yet no derivation, information-theoretic argument, or ablation on route count/adaptivity is supplied to show that these four traversals are sufficient to avoid systematic blind spots for arbitrary 2D spatial configurations.

Authors: We agree that additional justification would strengthen the manuscript. In the revision we have added an ablation study (new Table 4 in Section 4.3) that compares performance with 2, 4, 6, and 8 scanning routes; results show that accuracy saturates at four routes with only marginal gains thereafter. Section 3.2 has been expanded with a short explanation that the four fixed directions (horizontal, vertical, and two diagonals) are selected to capture the dominant spatial axes in 2D grids while preserving linear complexity. A formal information-theoretic derivation or proof of absence of blind spots is not supplied, as it lies beyond the scope of the current work; we note this explicitly as an avenue for future research. revision: partial
Referee: [§4] §4 (experimental section): the headline claim of 'superior input scaling efficiency' is asserted without visible tables reporting FLOPs-vs-accuracy curves, resolution scaling plots, or error bars across multiple runs; without these quantitative details the central efficiency advantage cannot be independently verified from the manuscript.

Authors: We thank the referee for highlighting this gap. The revised manuscript adds Figure 5, which plots FLOPs against top-1 accuracy for input resolutions ranging from 224×224 to 1024×1024, and updates Table 3 to report mean accuracy and standard deviation over three independent runs. These additions directly support the input-scaling-efficiency claims and allow independent verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity in VMamba architectural proposal

full rationale

The paper presents VMamba as an adaptation of the external Mamba model via an explicit new SS2D module that traverses four fixed scanning routes to handle 2D data. No equations, performance metrics, or modeling claims reduce by construction to fitted inputs, self-definitions, or self-citation chains; the four-route design is stated as a bridging mechanism whose sufficiency is asserted via experiments rather than derived tautologically. The derivation chain remains self-contained against external benchmarks such as attention or convolution, with no load-bearing uniqueness theorems or ansatzes imported from overlapping prior work.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that four directional 1D scans suffice for 2D context and on standard neural-network training practices. No new physical entities are postulated; the main additions are architectural modules whose effectiveness is treated as empirical.

free parameters (2)

number of scanning directions
Fixed at four routes; chosen by design rather than learned from data.
VSS block hyperparameters
Layer widths, depths, and state dimensions are standard architecture choices fitted during model search.

axioms (1)

domain assumption Mamba selective scan can be extended to 2D images by multiple 1D traversals without loss of essential spatial modeling power.
Invoked when SS2D is introduced to bridge ordered 1D scan and non-sequential 2D data.

invented entities (2)

Visual State-Space (VSS) block no independent evidence
purpose: Core repeatable unit replacing attention or convolution layers in the vision backbone.
New module defined in the paper; no independent evidence outside the empirical results.
2D Selective Scan (SS2D) module no independent evidence
purpose: Mechanism that performs four directional scans to collect contextual information.
Novel component introduced to adapt Mamba; effectiveness shown only via the reported experiments.

pith-pipeline@v0.9.0 · 5485 in / 1430 out tokens · 65525 ms · 2026-05-16T18:19:04.355281+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing eight_tick_forces_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives.
IndisputableMonolith.Cost.FunctionalEquation Jcost_cosh_identity unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
cs.LG 2026-05 conditional novelty 7.0

A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
DGSSM: Diffusion guided state-space models for multimodal salient object detection
cs.CV 2026-04 unverdicted novelty 7.0

DGSSM formulates multimodal salient object detection as a progressive denoising process using diffusion-guided Mamba models, achieving better boundary accuracy and outperforming prior methods on 13 benchmarks.
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
cs.CV 2024-01 conditional novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
Annotation-free deep learning for detection and segmentation of fetal germinal matrix-intraventricular hemorrhage in brain MRI
eess.IV 2026-05 conditional novelty 6.0

FreeHemoSeg detects fetal GMH-IVH on T2-weighted MRI with high sensitivity and specificity and moderate segmentation accuracy using pseudo-image synthesis from normal scans, outperforming supervised and unsupervised b...
EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

EmambaIR is a visual state space model with cross-modal top-k sparse attention and gated SSM components that outperforms prior CNN and ViT methods on event-guided deblurring, deraining, and HDR reconstruction while re...
BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments
cs.CV 2026-04 unverdicted novelty 6.0

BVI-Mamba enhances low-light and underwater videos by combining feature alignment with a UNet architecture built from Visual State Space blocks, claiming better quality and efficiency than prior Transformer or convolu...
A Controlled Benchmark of Visual State-Space Backbones with Domain-Shift and Boundary Analysis for Remote-Sensing Segmentation
eess.IV 2026-04 unverdicted novelty 6.0

Controlled tests on LoveDA and ISPRS Potsdam show visual SSM encoders deliver favorable speed-accuracy trade-offs but suffer most from boundary errors under domain shift, indicating that robustness and boundary-aware ...
HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
cs.CV 2026-04 unverdicted novelty 6.0

HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.
Gated Linear Attention Transformers with Hardware-Efficient Training
cs.LG 2023-12 unverdicted novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
TopoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media
cs.CV 2026-04 unverdicted novelty 5.0

TopoMamba improves medical image segmentation by combining topology-aware diagonal scans with standard cross-scans and a HSIC Gate for efficient fusion, yielding gains on thin and curved targets like the pancreas.
Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

DGM-Net reaches 82.3% mIoU on Cityscapes and 45.24% on ADE20K using directional geometric guidance inside a linear-complexity Mamba backbone, without heavy pretraining or large models.
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

MambaLiteUNet integrates Mamba into U-Net with adaptive fusion, local-global mixing, and cross-gated attention modules to reach 87.12% IoU and 93.09% Dice on skin lesion datasets while cutting parameters by 93.6%.
BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving
cs.CV 2026-04 unverdicted novelty 5.0

BEVPredFormer uses attention-based temporal processing and 3D camera projection to match or exceed prior methods on nuScenes for BEV instance prediction.
Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression
cs.CV 2026-03 conditional novelty 5.0

On scarce dual-view pasture data, a simple two-layer gated depthwise convolution fusion achieves R²=0.903, beating cross-view attention transformers (0.833), bidirectional SSMs (0.819), and Mamba (0.793), while backbo...
Beyond ZOH: Advanced Discretization Strategies for Vision Mamba
cs.CV 2026-04 unverdicted novelty 4.0

Bilinear discretization improves Vision Mamba accuracy over zero-order hold on classification, segmentation, and detection benchmarks with only modest extra training cost.
The First Challenge on Remote Sensing Infrared Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview
cs.CV 2026-04 accept novelty 3.0

The NTIRE 2026 challenge establishes a benchmark for x4 super-resolution of remote sensing infrared images, with 13 teams submitting valid methods evaluated on a dedicated dataset.
The First Challenge on Mobile Real-World Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview
cs.CV 2026-04 unverdicted novelty 2.0

The NTIRE 2026 mobile real-world image super-resolution challenge received 16 valid submissions and overviews methods balancing image quality with mobile execution speed.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 17 Pith papers · 7 internal anchors

[1]

Xcit: Cross-covariance image trans- formers

Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross-covariance image trans- formers. NeurIPS, 34:20014–20027, 2021

work page 2021
[2]

Prefix sums and their applications

Guy E Blelloch. Prefix sums and their applications. 1990

work page 1990
[3]

MMDetection: Open MMLab Detection Toolbox and Benchmark

Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Mmdetection: Open mmlab detection toolbox and b...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[4]

MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark

MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020

work page 2020
[5]

Deformable convolutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, pages 764–773, 2017

work page 2017
[6]

Coatnet: Marrying convolution and attention for all data sizes

Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 34:3965–3977, 2021

work page 2021
[7]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2023

work page 2023
[8]

Flashattention: Fast and memory- efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. NeurIPS, 35:16344–16359, 2022

work page 2022
[9]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009

work page 2009
[10]

Davit: Dual attention vision transformers

Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, and Lu Yuan. Davit: Dual attention vision transformers. In ECCV, pages 74–92, 2022

work page 2022
[11]

Scaling up your kernels to 31x31: Revisiting large kernel design in cnns

Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang Ding. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In CVPR, pages 11963–11975, 2022

work page 2022
[12]

Cswin transformer: A general vision transformer backbone with cross-shaped windows

Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, pages 12124–12134, 2022

work page 2022
[13]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

work page 2021
[14]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, 2018

work page 2018
[15]

Rmt: Retentive networks meet vision transformers

Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, and Ran He. Rmt: Retentive networks meet vision transformers. In CVPR, 2024

work page 2024
[16]

Hungry hungry hippos: Towards language modeling with state space models

Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. Hungry hungry hippos: Towards language modeling with state space models. In ICLR, 2022

work page 2022
[17]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Hippo: Recurrent memory with optimal polynomial projections

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. NeurIPS, 33:1474–1487, 2020

work page 2020
[19]

On the parameterization and initialization of diagonal state space models

Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models. NeurIPS, 35:35971–35983, 2022

work page 2022
[20]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In ICLR, 2021

work page 2021
[21]

Combining recurrent, convolutional, and continuous-time models with linear state space layers

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. NeurIPS, 34:572–585, 2021. 11

work page 2021
[22]

Diagonal state spaces are as effective as structured state spaces

Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces. NeurIPS, 35:22982–22994, 2022

work page 2022
[23]

On the connection between local attention and dynamic depth-wise convolution

Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu, and Jingdong Wang. On the connection between local attention and dynamic depth-wise convolution. In ICLR, 2021

work page 2021
[24]

Liquid structural state-space models

Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models. In ICLR, 2022

work page 2022
[25]

Neighborhood attention transformer

Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. In CVPR, pages 6185–6194, 2023

work page 2023
[26]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–s2969, 2017

work page 2017
[27]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016

work page 2016
[28]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017

work page 2017
[30]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, pages 5156–5165, 2020

work page 2020
[31]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. NeurIPS, pages 1106–1114, 2012

work page 2012
[32]

A new approach to linear filtering and prediction problems

Rudolf Emil Kálmán. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960

work page 1960
[33]

Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, pages 740–755, 2014

work page 2014
[34]

More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity

Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao Xiao, Boqian Wu, Tommi Kärkkäinen, Mykola Pechenizkiy, Decebal Constantin Mocanu, and Zhangyang Wang. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. In ICLR, 2023

work page 2023
[35]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In CVPR, pages 12009–12019, 2022

work page 2022
[36]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021

work page 2021
[37]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022

work page 2022
[38]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Container: Context aggregation networks

Jiasen Lu, Roozbeh Mottaghi, Aniruddha Kembhavi, et al. Container: Context aggregation networks. NeurIPS, 34:19160–19171, 2021

work page 2021
[40]

Understanding the effective receptive field in deep convolutional neural networks

Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. NeurIPS, 29:4898–4906, 2016

work page 2016
[41]

Mega: Moving average equipped gated attention

Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. Mega: Moving average equipped gated attention. In ICLR, 2022

work page 2022
[42]

Parallelizing linear recurrent neural nets over sequence length

Eric Martin and Chris Cundy. Parallelizing linear recurrent neural nets over sequence length. In ICLR, 2018

work page 2018
[43]

Long range language modeling via gated state spaces

Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. Long range language modeling via gated state spaces. In ICLR, 2023. 12

work page 2023
[44]

S4nd: Modeling images and videos as multidimensional signals with state spaces

Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christo- pher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. NeurIPS, 35:2846–2861, 2022

work page 2022
[45]

RWKV: reinventing rnns for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV , et al. RWKV: reinventing rnns for the transformer era. In EMNLP, pages 14048–14077, 2023

work page 2023
[46]

Designing network design spaces

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In CVPR, pages 10428–10436, 2020

work page 2020
[47]

Hornet: Efficient high-order spatial interactions with recursive gated convolutions

Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser Nam Lim, and Jiwen Lu. Hornet: Efficient high-order spatial interactions with recursive gated convolutions. NeurIPS, 35:10353–10366, 2022

work page 2022
[48]

Scalable event-by-event processing of neuromorphic sensory signals with deep state-space models

Mark Schöne, Neeraj Mohan Sushma, Jingyue Zhuge, Christian Mayr, Anand Subramoney, and David Kappel. Scalable event-by-event processing of neuromorphic sensory signals with deep state-space models. arXiv preprint arXiv:2404.18508, 2024

work page arXiv 2024
[49]

Very deep convolutional networks for large-scale image recogni- tion

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. 2015

work page 2015
[50]

Simplified state space layers for sequence modeling

Jimmy TH Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. In ICLR, 2022

work page 2022
[51]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015

work page 2015
[53]

Mingxing Tan and Quoc V . Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114, 2019

work page 2019
[54]

Efficient transformers: A survey

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6), 2022

work page 2022
[55]

Integrally pre-trained transformer pyramid networks

Yunjie Tian, Lingxi Xie, Zhaozhi Wang, Longhui Wei, Xiaopeng Zhang, Jianbin Jiao, Yaowei Wang, Qi Tian, and Qixiang Ye. Integrally pre-trained transformer pyramid networks. In CVPR, pages 18610– 18620, 2023

work page 2023
[56]

Mlp-mixer: An all-mlp architecture for vision

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems , 34:24261–24272, 2021

work page 2021
[57]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021

work page 2021
[58]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30:5998–6008, 2017

work page 2017
[59]

Selective structured state-spaces for long-form video understanding

Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, and Raffay Hamid. Selective structured state-spaces for long-form video understanding. In CVPR, pages 6387–6397, 2023

work page 2023
[60]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[61]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021

work page 2021
[62]

Pytorch image models

Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019

work page 2019
[63]

Unified perceptual parsing for scene understanding

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018. 13

work page 2018
[64]

Focal self-attention for local-global interactions in vision transformers

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021

work page arXiv 2021
[65]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Hivit: A simpler and more efficient design of hierarchical vision transformer

Xiaosong Zhang, Yunjie Tian, Lingxi Xie, Wei Huang, Qi Dai, Qixiang Ye, and Qi Tian. Hivit: A simpler and more efficient design of hierarchical vision transformer. In ICLR, 2023

work page 2023
[67]

Graformer: Graph-oriented transformer for 3d pose estimation

Weixi Zhao, Weiqiang Wang, and Yunjie Tian. Graformer: Graph-oriented transformer for 3d pose estimation. In CVPR, pages 20438–20447, 2022

work page 2022
[68]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, pages 5122–5130, 2017

work page 2017
[69]

Vision mamba: Efficient visual representation learning with bidirectional state space model

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. In ICML, 2024

work page 2024
[70]

Deformable convnets v2: More deformable, better results

Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In CVPR, pages 9308–9316, 2019

work page 2019
[71]

1×" indicates models fine-tuned for 12 epochs, while

Nikola Zubic, Mathias Gehrig, and Davide Scaramuzza. State space models for event cameras. In CVPR, pages 5819–5828, 2024. 14 A Discretization of State Space Models (SSMs) In this section, we explore the correlation between the discretized formulations of State Space Models (SSMs) obtained in Sec. 3 and those derived from the zero-order hold (ZOH) method ...

work page 2024
[72]

Guidelines: • The answer NA means that the abstract and introduction do not include the claims made in the paper

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The abstract and introduction provide a comprehensive overview of the background and motivation of this study, effectively outlining its main contributions point- by-point, thus accurately reflectin...

work page
[73]

Limitations

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We primarily focused on discussing the limitations associated with this study in section 6. Guidelines: • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those ar...

work page
[74]

Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? 27 Answer: [Yes] Justification: The paper includes the full set of assumptions and correct proofs for each theoretical result, primarily presented in the appendix. Notably, it covers the formulation of...

work page
[75]

Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: All information regarding ...

work page
[76]

Instructions for running the code are also provided within the scripts

Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The supplementary material submitted with the manuscript includes open access to all source code and script...

work page
[77]

Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The paper specifies detailed experimental configurations in Section E in Appendix, providing readers with ess...

work page
[78]

However, we have provided the code, hyperparameters, and random seeds used in our experiments to facilitate the reproducibility of our findings

Experiment Statistical Significance 29 Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: We did not include an analysis of the statistical significance of the experiments mainly due to the prohibitively expensive trai...

work page
[79]

Guidelines: • The answer NA means that the paper does not include experiments

Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: All experiments were carried out on an 8 × A100 GPU server, as detailed at the beginning of the experim...

work page
[80]

Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics

Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] 30 Justification: After carefully reviewing the referenced document, we certify that the research conducted in the paper conforms, in every respect, with the NeurIPS Code of ...

work page

Showing first 80 references.