pith. machine review for the scientific record. sign in

arxiv: 2401.10166 · v4 · submitted 2024-01-18 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VMamba: Visual State Space Model

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords VMambaVisual State Space ModelState Space ModelsVision BackboneLinear Complexity2D Selective ScanComputer Vision
0
0 comments X

The pith

VMamba adapts Mamba's state-space model to vision by scanning 2D images along four fixed routes to reach linear time complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VMamba as a vision backbone that converts the selective state-space model originally used for language into an efficient architecture for images. At its center is the Visual State-Space block containing the 2D Selective Scan module, which traverses each image along four routes to gather contextual information without quadratic cost. The resulting family of models is further accelerated by targeted architectural and implementation changes. If the approach holds, it would let vision networks process higher-resolution inputs at linear rather than quadratic scaling while maintaining competitive accuracy on standard perception tasks.

Core claim

VMamba is a vision backbone with linear time complexity built from stacks of Visual State-Space blocks. Each block incorporates the 2D Selective Scan module that traverses 2D data along four scanning routes, thereby bridging the ordered 1D selective scan with the non-sequential structure of images and allowing contextual information to be collected from multiple directions.

What carries the argument

The 2D Selective Scan (SS2D) module, which traverses each 2D feature map along four fixed routes to adapt 1D state-space scanning to vision data.

If this is right

  • VMamba achieves promising accuracy on a range of visual perception tasks.
  • The model exhibits better scaling with input size than current benchmark architectures.
  • A family of VMamba variants can be constructed and further accelerated by successive refinements.
  • Linear complexity opens the door to processing larger images or video frames without proportional compute growth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fixed-route scanning idea could be tested on video or 3D data where temporal or volumetric context must be aggregated.
  • Replacing the four routes with learned or adaptive paths might further improve information capture without raising asymptotic cost.
  • The same linear-complexity block could be inserted into hybrid models that combine state-space layers with local convolutions.

Load-bearing premise

That four fixed scanning routes are enough to capture all necessary spatial relationships in 2D visual data.

What would settle it

A controlled experiment in which VMamba accuracy falls below a comparable transformer on a task that requires long-range 2D spatial relations the four routes cannot reach, or in which measured runtime grows quadratically rather than linearly with input resolution.

read the original abstract

Designing computationally efficient network architectures remains an ongoing necessity in computer vision. In this paper, we adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity. At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments demonstrate VMamba's promising performance across diverse visual perception tasks, highlighting its superior input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VMamba, a vision backbone that adapts the Mamba state-space model for computer vision. Its core consists of stacked Visual State-Space (VSS) blocks containing a 2D Selective Scan (SS2D) module; SS2D traverses the 2D feature map along four fixed routes (row-wise, column-wise, and two diagonals) to convert non-sequential 2D data into ordered 1D sequences suitable for selective state-space modeling, thereby achieving linear time complexity while collecting contextual information from multiple perspectives. The authors construct a family of VMamba architectures, apply successive optimizations, and report extensive experiments across visual perception tasks that demonstrate competitive accuracy together with superior input scaling efficiency relative to existing benchmarks.

Significance. If the experimental claims hold, the work supplies a concrete, linear-complexity alternative to quadratic-attention vision transformers and large-kernel convolutions. The explicit release of code and the focus on input scaling efficiency constitute reproducible strengths that could influence the design of efficient backbones for high-resolution or long-sequence vision tasks.

major comments (2)
  1. [§3.2] §3.2 (SS2D module description): the claim that scanning along exactly four fixed routes 'facilitates the collection of contextual information from various sources and perspectives' is load-bearing for the linear-complexity superiority argument, yet no derivation, information-theoretic argument, or ablation on route count/adaptivity is supplied to show that these four traversals are sufficient to avoid systematic blind spots for arbitrary 2D spatial configurations.
  2. [§4] §4 (experimental section): the headline claim of 'superior input scaling efficiency' is asserted without visible tables reporting FLOPs-vs-accuracy curves, resolution scaling plots, or error bars across multiple runs; without these quantitative details the central efficiency advantage cannot be independently verified from the manuscript.
minor comments (2)
  1. [§3.1] Notation for the state-space parameters (A, B, C, Δ) is introduced without an explicit reminder of their correspondence to the original Mamba formulation; a brief cross-reference would improve readability.
  2. [Figure 2] Figure 2 (architecture diagram) would benefit from explicit labeling of the four scan directions and the merge operation that recombines the scanned sequences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to strengthen the presentation of the SS2D module and experimental results.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (SS2D module description): the claim that scanning along exactly four fixed routes 'facilitates the collection of contextual information from various sources and perspectives' is load-bearing for the linear-complexity superiority argument, yet no derivation, information-theoretic argument, or ablation on route count/adaptivity is supplied to show that these four traversals are sufficient to avoid systematic blind spots for arbitrary 2D spatial configurations.

    Authors: We agree that additional justification would strengthen the manuscript. In the revision we have added an ablation study (new Table 4 in Section 4.3) that compares performance with 2, 4, 6, and 8 scanning routes; results show that accuracy saturates at four routes with only marginal gains thereafter. Section 3.2 has been expanded with a short explanation that the four fixed directions (horizontal, vertical, and two diagonals) are selected to capture the dominant spatial axes in 2D grids while preserving linear complexity. A formal information-theoretic derivation or proof of absence of blind spots is not supplied, as it lies beyond the scope of the current work; we note this explicitly as an avenue for future research. revision: partial

  2. Referee: [§4] §4 (experimental section): the headline claim of 'superior input scaling efficiency' is asserted without visible tables reporting FLOPs-vs-accuracy curves, resolution scaling plots, or error bars across multiple runs; without these quantitative details the central efficiency advantage cannot be independently verified from the manuscript.

    Authors: We thank the referee for highlighting this gap. The revised manuscript adds Figure 5, which plots FLOPs against top-1 accuracy for input resolutions ranging from 224×224 to 1024×1024, and updates Table 3 to report mean accuracy and standard deviation over three independent runs. These additions directly support the input-scaling-efficiency claims and allow independent verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity in VMamba architectural proposal

full rationale

The paper presents VMamba as an adaptation of the external Mamba model via an explicit new SS2D module that traverses four fixed scanning routes to handle 2D data. No equations, performance metrics, or modeling claims reduce by construction to fitted inputs, self-definitions, or self-citation chains; the four-route design is stated as a bridging mechanism whose sufficiency is asserted via experiments rather than derived tautologically. The derivation chain remains self-contained against external benchmarks such as attention or convolution, with no load-bearing uniqueness theorems or ansatzes imported from overlapping prior work.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that four directional 1D scans suffice for 2D context and on standard neural-network training practices. No new physical entities are postulated; the main additions are architectural modules whose effectiveness is treated as empirical.

free parameters (2)
  • number of scanning directions
    Fixed at four routes; chosen by design rather than learned from data.
  • VSS block hyperparameters
    Layer widths, depths, and state dimensions are standard architecture choices fitted during model search.
axioms (1)
  • domain assumption Mamba selective scan can be extended to 2D images by multiple 1D traversals without loss of essential spatial modeling power.
    Invoked when SS2D is introduced to bridge ordered 1D scan and non-sequential 2D data.
invented entities (2)
  • Visual State-Space (VSS) block no independent evidence
    purpose: Core repeatable unit replacing attention or convolution layers in the vision backbone.
    New module defined in the paper; no independent evidence outside the empirical results.
  • 2D Selective Scan (SS2D) module no independent evidence
    purpose: Mechanism that performs four directional scans to collect contextual information.
    Novel component introduced to adapt Mamba; effectiveness shown only via the reported experiments.

pith-pipeline@v0.9.0 · 5485 in / 1430 out tokens · 65525 ms · 2026-05-16T18:19:04.355281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

    cs.LG 2026-05 conditional novelty 7.0

    A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.

  2. DGSSM: Diffusion guided state-space models for multimodal salient object detection

    cs.CV 2026-04 unverdicted novelty 7.0

    DGSSM formulates multimodal salient object detection as a progressive denoising process using diffusion-guided Mamba models, achieving better boundary accuracy and outperforming prior methods on 13 benchmarks.

  3. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    cs.CV 2024-01 conditional novelty 7.0

    Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.

  4. Annotation-free deep learning for detection and segmentation of fetal germinal matrix-intraventricular hemorrhage in brain MRI

    eess.IV 2026-05 conditional novelty 6.0

    FreeHemoSeg detects fetal GMH-IVH on T2-weighted MRI with high sensitivity and specificity and moderate segmentation accuracy using pseudo-image synthesis from normal scans, outperforming supervised and unsupervised b...

  5. EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    EmambaIR is a visual state space model with cross-modal top-k sparse attention and gated SSM components that outperforms prior CNN and ViT methods on event-guided deblurring, deraining, and HDR reconstruction while re...

  6. BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments

    cs.CV 2026-04 unverdicted novelty 6.0

    BVI-Mamba enhances low-light and underwater videos by combining feature alignment with a UNet architecture built from Visual State Space blocks, claiming better quality and efficiency than prior Transformer or convolu...

  7. A Controlled Benchmark of Visual State-Space Backbones with Domain-Shift and Boundary Analysis for Remote-Sensing Segmentation

    eess.IV 2026-04 unverdicted novelty 6.0

    Controlled tests on LoveDA and ISPRS Potsdam show visual SSM encoders deliver favorable speed-accuracy trade-offs but suffer most from boundary errors under domain shift, indicating that robustness and boundary-aware ...

  8. HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet

    cs.CV 2026-04 unverdicted novelty 6.0

    HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.

  9. Gated Linear Attention Transformers with Hardware-Efficient Training

    cs.LG 2023-12 unverdicted novelty 6.0

    Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

  10. TopoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media

    cs.CV 2026-04 unverdicted novelty 5.0

    TopoMamba improves medical image segmentation by combining topology-aware diagonal scans with standard cross-scans and a HSIC Gate for efficient fusion, yielding gains on thin and curved targets like the pancreas.

  11. Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    DGM-Net reaches 82.3% mIoU on Cityscapes and 45.24% on ADE20K using directional geometric guidance inside a linear-complexity Mamba backbone, without heavy pretraining or large models.

  12. MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    MambaLiteUNet integrates Mamba into U-Net with adaptive fusion, local-global mixing, and cross-gated attention modules to reach 87.12% IoU and 93.09% Dice on skin lesion datasets while cutting parameters by 93.6%.

  13. BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 5.0

    BEVPredFormer uses attention-based temporal processing and 3D camera projection to match or exceed prior methods on nuScenes for BEV instance prediction.

  14. Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression

    cs.CV 2026-03 conditional novelty 5.0

    On scarce dual-view pasture data, a simple two-layer gated depthwise convolution fusion achieves R²=0.903, beating cross-view attention transformers (0.833), bidirectional SSMs (0.819), and Mamba (0.793), while backbo...

  15. Beyond ZOH: Advanced Discretization Strategies for Vision Mamba

    cs.CV 2026-04 unverdicted novelty 4.0

    Bilinear discretization improves Vision Mamba accuracy over zero-order hold on classification, segmentation, and detection benchmarks with only modest extra training cost.

  16. The First Challenge on Remote Sensing Infrared Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview

    cs.CV 2026-04 accept novelty 3.0

    The NTIRE 2026 challenge establishes a benchmark for x4 super-resolution of remote sensing infrared images, with 13 teams submitting valid methods evaluated on a dedicated dataset.

  17. The First Challenge on Mobile Real-World Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview

    cs.CV 2026-04 unverdicted novelty 2.0

    The NTIRE 2026 mobile real-world image super-resolution challenge received 16 valid submissions and overviews methods balancing image quality with mobile execution speed.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 17 Pith papers · 7 internal anchors

  1. [1]

    Xcit: Cross-covariance image trans- formers

    Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross-covariance image trans- formers. NeurIPS, 34:20014–20027, 2021

  2. [2]

    Prefix sums and their applications

    Guy E Blelloch. Prefix sums and their applications. 1990

  3. [3]

    MMDetection: Open MMLab Detection Toolbox and Benchmark

    Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Mmdetection: Open mmlab detection toolbox and b...

  4. [4]

    MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark

    MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020

  5. [5]

    Deformable convolutional networks

    Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, pages 764–773, 2017

  6. [6]

    Coatnet: Marrying convolution and attention for all data sizes

    Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 34:3965–3977, 2021

  7. [7]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2023

  8. [8]

    Flashattention: Fast and memory- efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. NeurIPS, 35:16344–16359, 2022

  9. [9]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009

  10. [10]

    Davit: Dual attention vision transformers

    Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, and Lu Yuan. Davit: Dual attention vision transformers. In ECCV, pages 74–92, 2022

  11. [11]

    Scaling up your kernels to 31x31: Revisiting large kernel design in cnns

    Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang Ding. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In CVPR, pages 11963–11975, 2022

  12. [12]

    Cswin transformer: A general vision transformer backbone with cross-shaped windows

    Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, pages 12124–12134, 2022

  13. [13]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

  14. [14]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, 2018

  15. [15]

    Rmt: Retentive networks meet vision transformers

    Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, and Ran He. Rmt: Retentive networks meet vision transformers. In CVPR, 2024

  16. [16]

    Hungry hungry hippos: Towards language modeling with state space models

    Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. Hungry hungry hippos: Towards language modeling with state space models. In ICLR, 2022

  17. [17]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  18. [18]

    Hippo: Recurrent memory with optimal polynomial projections

    Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. NeurIPS, 33:1474–1487, 2020

  19. [19]

    On the parameterization and initialization of diagonal state space models

    Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models. NeurIPS, 35:35971–35983, 2022

  20. [20]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In ICLR, 2021

  21. [21]

    Combining recurrent, convolutional, and continuous-time models with linear state space layers

    Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. NeurIPS, 34:572–585, 2021. 11

  22. [22]

    Diagonal state spaces are as effective as structured state spaces

    Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces. NeurIPS, 35:22982–22994, 2022

  23. [23]

    On the connection between local attention and dynamic depth-wise convolution

    Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu, and Jingdong Wang. On the connection between local attention and dynamic depth-wise convolution. In ICLR, 2021

  24. [24]

    Liquid structural state-space models

    Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models. In ICLR, 2022

  25. [25]

    Neighborhood attention transformer

    Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. In CVPR, pages 6185–6194, 2023

  26. [26]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–s2969, 2017

  27. [27]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016

  28. [28]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017

  29. [29]

    Densely connected convolutional networks

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017

  30. [30]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, pages 5156–5165, 2020

  31. [31]

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. NeurIPS, pages 1106–1114, 2012

  32. [32]

    A new approach to linear filtering and prediction problems

    Rudolf Emil Kálmán. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960

  33. [33]

    Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, pages 740–755, 2014

  34. [34]

    More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity

    Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao Xiao, Boqian Wu, Tommi Kärkkäinen, Mykola Pechenizkiy, Decebal Constantin Mocanu, and Zhangyang Wang. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. In ICLR, 2023

  35. [35]

    Swin transformer v2: Scaling up capacity and resolution

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In CVPR, pages 12009–12019, 2022

  36. [36]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021

  37. [37]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022

  38. [38]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  39. [39]

    Container: Context aggregation networks

    Jiasen Lu, Roozbeh Mottaghi, Aniruddha Kembhavi, et al. Container: Context aggregation networks. NeurIPS, 34:19160–19171, 2021

  40. [40]

    Understanding the effective receptive field in deep convolutional neural networks

    Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. NeurIPS, 29:4898–4906, 2016

  41. [41]

    Mega: Moving average equipped gated attention

    Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. Mega: Moving average equipped gated attention. In ICLR, 2022

  42. [42]

    Parallelizing linear recurrent neural nets over sequence length

    Eric Martin and Chris Cundy. Parallelizing linear recurrent neural nets over sequence length. In ICLR, 2018

  43. [43]

    Long range language modeling via gated state spaces

    Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. Long range language modeling via gated state spaces. In ICLR, 2023. 12

  44. [44]

    S4nd: Modeling images and videos as multidimensional signals with state spaces

    Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christo- pher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. NeurIPS, 35:2846–2861, 2022

  45. [45]

    RWKV: reinventing rnns for the transformer era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV , et al. RWKV: reinventing rnns for the transformer era. In EMNLP, pages 14048–14077, 2023

  46. [46]

    Designing network design spaces

    Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In CVPR, pages 10428–10436, 2020

  47. [47]

    Hornet: Efficient high-order spatial interactions with recursive gated convolutions

    Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser Nam Lim, and Jiwen Lu. Hornet: Efficient high-order spatial interactions with recursive gated convolutions. NeurIPS, 35:10353–10366, 2022

  48. [48]

    Scalable event-by-event processing of neuromorphic sensory signals with deep state-space models

    Mark Schöne, Neeraj Mohan Sushma, Jingyue Zhuge, Christian Mayr, Anand Subramoney, and David Kappel. Scalable event-by-event processing of neuromorphic sensory signals with deep state-space models. arXiv preprint arXiv:2404.18508, 2024

  49. [49]

    Very deep convolutional networks for large-scale image recogni- tion

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. 2015

  50. [50]

    Simplified state space layers for sequence modeling

    Jimmy TH Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. In ICLR, 2022

  51. [51]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  52. [52]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015

  53. [53]

    Mingxing Tan and Quoc V . Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114, 2019

  54. [54]

    Efficient transformers: A survey

    Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6), 2022

  55. [55]

    Integrally pre-trained transformer pyramid networks

    Yunjie Tian, Lingxi Xie, Zhaozhi Wang, Longhui Wei, Xiaopeng Zhang, Jianbin Jiao, Yaowei Wang, Qi Tian, and Qixiang Ye. Integrally pre-trained transformer pyramid networks. In CVPR, pages 18610– 18620, 2023

  56. [56]

    Mlp-mixer: An all-mlp architecture for vision

    Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems , 34:24261–24272, 2021

  57. [57]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021

  58. [58]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30:5998–6008, 2017

  59. [59]

    Selective structured state-spaces for long-form video understanding

    Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, and Raffay Hamid. Selective structured state-spaces for long-form video understanding. In CVPR, pages 6387–6397, 2023

  60. [60]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020

  61. [61]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021

  62. [62]

    Pytorch image models

    Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019

  63. [63]

    Unified perceptual parsing for scene understanding

    Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018. 13

  64. [64]

    Focal self-attention for local-global interactions in vision transformers

    Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021

  65. [65]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023

  66. [66]

    Hivit: A simpler and more efficient design of hierarchical vision transformer

    Xiaosong Zhang, Yunjie Tian, Lingxi Xie, Wei Huang, Qi Dai, Qixiang Ye, and Qi Tian. Hivit: A simpler and more efficient design of hierarchical vision transformer. In ICLR, 2023

  67. [67]

    Graformer: Graph-oriented transformer for 3d pose estimation

    Weixi Zhao, Weiqiang Wang, and Yunjie Tian. Graformer: Graph-oriented transformer for 3d pose estimation. In CVPR, pages 20438–20447, 2022

  68. [68]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, pages 5122–5130, 2017

  69. [69]

    Vision mamba: Efficient visual representation learning with bidirectional state space model

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. In ICML, 2024

  70. [70]

    Deformable convnets v2: More deformable, better results

    Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In CVPR, pages 9308–9316, 2019

  71. [71]

    1×" indicates models fine-tuned for 12 epochs, while

    Nikola Zubic, Mathias Gehrig, and Davide Scaramuzza. State space models for event cameras. In CVPR, pages 5819–5828, 2024. 14 A Discretization of State Space Models (SSMs) In this section, we explore the correlation between the discretized formulations of State Space Models (SSMs) obtained in Sec. 3 and those derived from the zero-order hold (ZOH) method ...

  72. [72]

    Guidelines: • The answer NA means that the abstract and introduction do not include the claims made in the paper

    Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The abstract and introduction provide a comprehensive overview of the background and motivation of this study, effectively outlining its main contributions point- by-point, thus accurately reflectin...

  73. [73]

    Limitations

    Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We primarily focused on discussing the limitations associated with this study in section 6. Guidelines: • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those ar...

  74. [74]

    Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? 27 Answer: [Yes] Justification: The paper includes the full set of assumptions and correct proofs for each theoretical result, primarily presented in the appendix. Notably, it covers the formulation of...

  75. [75]

    Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: All information regarding ...

  76. [76]

    Instructions for running the code are also provided within the scripts

    Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The supplementary material submitted with the manuscript includes open access to all source code and script...

  77. [77]

    Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The paper specifies detailed experimental configurations in Section E in Appendix, providing readers with ess...

  78. [78]

    However, we have provided the code, hyperparameters, and random seeds used in our experiments to facilitate the reproducibility of our findings

    Experiment Statistical Significance 29 Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: We did not include an analysis of the statistical significance of the experiments mainly due to the prohibitively expensive trai...

  79. [79]

    Guidelines: • The answer NA means that the paper does not include experiments

    Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: All experiments were carried out on an 8 × A100 GPU server, as detailed at the beginning of the experim...

  80. [80]

    Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics

    Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] 30 Justification: After carefully reviewing the referenced document, we certify that the research conducted in the paper conforms, in every respect, with the NeurIPS Code of ...

Showing first 80 references.