Recognition: 2 theorem links
· Lean TheoremVMamba: Visual State Space Model
Pith reviewed 2026-05-16 18:19 UTC · model grok-4.3
The pith
VMamba adapts Mamba's state-space model to vision by scanning 2D images along four fixed routes to reach linear time complexity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VMamba is a vision backbone with linear time complexity built from stacks of Visual State-Space blocks. Each block incorporates the 2D Selective Scan module that traverses 2D data along four scanning routes, thereby bridging the ordered 1D selective scan with the non-sequential structure of images and allowing contextual information to be collected from multiple directions.
What carries the argument
The 2D Selective Scan (SS2D) module, which traverses each 2D feature map along four fixed routes to adapt 1D state-space scanning to vision data.
If this is right
- VMamba achieves promising accuracy on a range of visual perception tasks.
- The model exhibits better scaling with input size than current benchmark architectures.
- A family of VMamba variants can be constructed and further accelerated by successive refinements.
- Linear complexity opens the door to processing larger images or video frames without proportional compute growth.
Where Pith is reading between the lines
- The fixed-route scanning idea could be tested on video or 3D data where temporal or volumetric context must be aggregated.
- Replacing the four routes with learned or adaptive paths might further improve information capture without raising asymptotic cost.
- The same linear-complexity block could be inserted into hybrid models that combine state-space layers with local convolutions.
Load-bearing premise
That four fixed scanning routes are enough to capture all necessary spatial relationships in 2D visual data.
What would settle it
A controlled experiment in which VMamba accuracy falls below a comparable transformer on a task that requires long-range 2D spatial relations the four routes cannot reach, or in which measured runtime grows quadratically rather than linearly with input resolution.
read the original abstract
Designing computationally efficient network architectures remains an ongoing necessity in computer vision. In this paper, we adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity. At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments demonstrate VMamba's promising performance across diverse visual perception tasks, highlighting its superior input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VMamba, a vision backbone that adapts the Mamba state-space model for computer vision. Its core consists of stacked Visual State-Space (VSS) blocks containing a 2D Selective Scan (SS2D) module; SS2D traverses the 2D feature map along four fixed routes (row-wise, column-wise, and two diagonals) to convert non-sequential 2D data into ordered 1D sequences suitable for selective state-space modeling, thereby achieving linear time complexity while collecting contextual information from multiple perspectives. The authors construct a family of VMamba architectures, apply successive optimizations, and report extensive experiments across visual perception tasks that demonstrate competitive accuracy together with superior input scaling efficiency relative to existing benchmarks.
Significance. If the experimental claims hold, the work supplies a concrete, linear-complexity alternative to quadratic-attention vision transformers and large-kernel convolutions. The explicit release of code and the focus on input scaling efficiency constitute reproducible strengths that could influence the design of efficient backbones for high-resolution or long-sequence vision tasks.
major comments (2)
- [§3.2] §3.2 (SS2D module description): the claim that scanning along exactly four fixed routes 'facilitates the collection of contextual information from various sources and perspectives' is load-bearing for the linear-complexity superiority argument, yet no derivation, information-theoretic argument, or ablation on route count/adaptivity is supplied to show that these four traversals are sufficient to avoid systematic blind spots for arbitrary 2D spatial configurations.
- [§4] §4 (experimental section): the headline claim of 'superior input scaling efficiency' is asserted without visible tables reporting FLOPs-vs-accuracy curves, resolution scaling plots, or error bars across multiple runs; without these quantitative details the central efficiency advantage cannot be independently verified from the manuscript.
minor comments (2)
- [§3.1] Notation for the state-space parameters (A, B, C, Δ) is introduced without an explicit reminder of their correspondence to the original Mamba formulation; a brief cross-reference would improve readability.
- [Figure 2] Figure 2 (architecture diagram) would benefit from explicit labeling of the four scan directions and the merge operation that recombines the scanned sequences.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to strengthen the presentation of the SS2D module and experimental results.
read point-by-point responses
-
Referee: [§3.2] §3.2 (SS2D module description): the claim that scanning along exactly four fixed routes 'facilitates the collection of contextual information from various sources and perspectives' is load-bearing for the linear-complexity superiority argument, yet no derivation, information-theoretic argument, or ablation on route count/adaptivity is supplied to show that these four traversals are sufficient to avoid systematic blind spots for arbitrary 2D spatial configurations.
Authors: We agree that additional justification would strengthen the manuscript. In the revision we have added an ablation study (new Table 4 in Section 4.3) that compares performance with 2, 4, 6, and 8 scanning routes; results show that accuracy saturates at four routes with only marginal gains thereafter. Section 3.2 has been expanded with a short explanation that the four fixed directions (horizontal, vertical, and two diagonals) are selected to capture the dominant spatial axes in 2D grids while preserving linear complexity. A formal information-theoretic derivation or proof of absence of blind spots is not supplied, as it lies beyond the scope of the current work; we note this explicitly as an avenue for future research. revision: partial
-
Referee: [§4] §4 (experimental section): the headline claim of 'superior input scaling efficiency' is asserted without visible tables reporting FLOPs-vs-accuracy curves, resolution scaling plots, or error bars across multiple runs; without these quantitative details the central efficiency advantage cannot be independently verified from the manuscript.
Authors: We thank the referee for highlighting this gap. The revised manuscript adds Figure 5, which plots FLOPs against top-1 accuracy for input resolutions ranging from 224×224 to 1024×1024, and updates Table 3 to report mean accuracy and standard deviation over three independent runs. These additions directly support the input-scaling-efficiency claims and allow independent verification. revision: yes
Circularity Check
No significant circularity in VMamba architectural proposal
full rationale
The paper presents VMamba as an adaptation of the external Mamba model via an explicit new SS2D module that traverses four fixed scanning routes to handle 2D data. No equations, performance metrics, or modeling claims reduce by construction to fitted inputs, self-definitions, or self-citation chains; the four-route design is stated as a bridging mechanism whose sufficiency is asserted via experiments rather than derived tautologically. The derivation chain remains self-contained against external benchmarks such as attention or convolution, with no load-bearing uniqueness theorems or ansatzes imported from overlapping prior work.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of scanning directions
- VSS block hyperparameters
axioms (1)
- domain assumption Mamba selective scan can be extended to 2D images by multiple 1D traversals without loss of essential spatial modeling power.
invented entities (2)
-
Visual State-Space (VSS) block
no independent evidence
-
2D Selective Scan (SS2D) module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingeight_tick_forces_D3 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives.
-
IndisputableMonolith.Cost.FunctionalEquationJcost_cosh_identity unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
-
DGSSM: Diffusion guided state-space models for multimodal salient object detection
DGSSM formulates multimodal salient object detection as a progressive denoising process using diffusion-guided Mamba models, achieving better boundary accuracy and outperforming prior methods on 13 benchmarks.
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
-
Annotation-free deep learning for detection and segmentation of fetal germinal matrix-intraventricular hemorrhage in brain MRI
FreeHemoSeg detects fetal GMH-IVH on T2-weighted MRI with high sensitivity and specificity and moderate segmentation accuracy using pseudo-image synthesis from normal scans, outperforming supervised and unsupervised b...
-
EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction
EmambaIR is a visual state space model with cross-modal top-k sparse attention and gated SSM components that outperforms prior CNN and ViT methods on event-guided deblurring, deraining, and HDR reconstruction while re...
-
BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments
BVI-Mamba enhances low-light and underwater videos by combining feature alignment with a UNet architecture built from Visual State Space blocks, claiming better quality and efficiency than prior Transformer or convolu...
-
A Controlled Benchmark of Visual State-Space Backbones with Domain-Shift and Boundary Analysis for Remote-Sensing Segmentation
Controlled tests on LoveDA and ISPRS Potsdam show visual SSM encoders deliver favorable speed-accuracy trade-offs but suffer most from boundary errors under domain shift, indicating that robustness and boundary-aware ...
-
HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.
-
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
-
TopoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media
TopoMamba improves medical image segmentation by combining topology-aware diagonal scans with standard cross-scans and a HSIC Gate for efficient fusion, yielding gains on thin and curved targets like the pancreas.
-
Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation
DGM-Net reaches 82.3% mIoU on Cityscapes and 45.24% on ADE20K using directional geometric guidance inside a linear-complexity Mamba backbone, without heavy pretraining or large models.
-
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
MambaLiteUNet integrates Mamba into U-Net with adaptive fusion, local-global mixing, and cross-gated attention modules to reach 87.12% IoU and 93.09% Dice on skin lesion datasets while cutting parameters by 93.6%.
-
BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving
BEVPredFormer uses attention-based temporal processing and 3D camera projection to match or exceed prior methods on nuScenes for BEV instance prediction.
-
Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression
On scarce dual-view pasture data, a simple two-layer gated depthwise convolution fusion achieves R²=0.903, beating cross-view attention transformers (0.833), bidirectional SSMs (0.819), and Mamba (0.793), while backbo...
-
Beyond ZOH: Advanced Discretization Strategies for Vision Mamba
Bilinear discretization improves Vision Mamba accuracy over zero-order hold on classification, segmentation, and detection benchmarks with only modest extra training cost.
-
The First Challenge on Remote Sensing Infrared Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview
The NTIRE 2026 challenge establishes a benchmark for x4 super-resolution of remote sensing infrared images, with 13 teams submitting valid methods evaluated on a dedicated dataset.
-
The First Challenge on Mobile Real-World Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview
The NTIRE 2026 mobile real-world image super-resolution challenge received 16 valid submissions and overviews methods balancing image quality with mobile execution speed.
Reference graph
Works this paper leans on
-
[1]
Xcit: Cross-covariance image trans- formers
Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross-covariance image trans- formers. NeurIPS, 34:20014–20027, 2021
work page 2021
-
[2]
Prefix sums and their applications
Guy E Blelloch. Prefix sums and their applications. 1990
work page 1990
-
[3]
MMDetection: Open MMLab Detection Toolbox and Benchmark
Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Mmdetection: Open mmlab detection toolbox and b...
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[4]
MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark
MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020
work page 2020
-
[5]
Deformable convolutional networks
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, pages 764–773, 2017
work page 2017
-
[6]
Coatnet: Marrying convolution and attention for all data sizes
Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 34:3965–3977, 2021
work page 2021
-
[7]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2023
work page 2023
-
[8]
Flashattention: Fast and memory- efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. NeurIPS, 35:16344–16359, 2022
work page 2022
-
[9]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009
work page 2009
-
[10]
Davit: Dual attention vision transformers
Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, and Lu Yuan. Davit: Dual attention vision transformers. In ECCV, pages 74–92, 2022
work page 2022
-
[11]
Scaling up your kernels to 31x31: Revisiting large kernel design in cnns
Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang Ding. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In CVPR, pages 11963–11975, 2022
work page 2022
-
[12]
Cswin transformer: A general vision transformer backbone with cross-shaped windows
Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, pages 12124–12134, 2022
work page 2022
-
[13]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021
work page 2021
-
[14]
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, 2018
work page 2018
-
[15]
Rmt: Retentive networks meet vision transformers
Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, and Ran He. Rmt: Retentive networks meet vision transformers. In CVPR, 2024
work page 2024
-
[16]
Hungry hungry hippos: Towards language modeling with state space models
Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. Hungry hungry hippos: Towards language modeling with state space models. In ICLR, 2022
work page 2022
-
[17]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Hippo: Recurrent memory with optimal polynomial projections
Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. NeurIPS, 33:1474–1487, 2020
work page 2020
-
[19]
On the parameterization and initialization of diagonal state space models
Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models. NeurIPS, 35:35971–35983, 2022
work page 2022
-
[20]
Efficiently modeling long sequences with structured state spaces
Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In ICLR, 2021
work page 2021
-
[21]
Combining recurrent, convolutional, and continuous-time models with linear state space layers
Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. NeurIPS, 34:572–585, 2021. 11
work page 2021
-
[22]
Diagonal state spaces are as effective as structured state spaces
Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces. NeurIPS, 35:22982–22994, 2022
work page 2022
-
[23]
On the connection between local attention and dynamic depth-wise convolution
Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu, and Jingdong Wang. On the connection between local attention and dynamic depth-wise convolution. In ICLR, 2021
work page 2021
-
[24]
Liquid structural state-space models
Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models. In ICLR, 2022
work page 2022
-
[25]
Neighborhood attention transformer
Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. In CVPR, pages 6185–6194, 2023
work page 2023
-
[26]
Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–s2969, 2017
work page 2017
-
[27]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016
work page 2016
-
[28]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
Densely connected convolutional networks
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017
work page 2017
-
[30]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, pages 5156–5165, 2020
work page 2020
-
[31]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. NeurIPS, pages 1106–1114, 2012
work page 2012
-
[32]
A new approach to linear filtering and prediction problems
Rudolf Emil Kálmán. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960
work page 1960
-
[33]
Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, pages 740–755, 2014
work page 2014
-
[34]
More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity
Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao Xiao, Boqian Wu, Tommi Kärkkäinen, Mykola Pechenizkiy, Decebal Constantin Mocanu, and Zhangyang Wang. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. In ICLR, 2023
work page 2023
-
[35]
Swin transformer v2: Scaling up capacity and resolution
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In CVPR, pages 12009–12019, 2022
work page 2022
-
[36]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021
work page 2021
-
[37]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022
work page 2022
-
[38]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
Container: Context aggregation networks
Jiasen Lu, Roozbeh Mottaghi, Aniruddha Kembhavi, et al. Container: Context aggregation networks. NeurIPS, 34:19160–19171, 2021
work page 2021
-
[40]
Understanding the effective receptive field in deep convolutional neural networks
Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. NeurIPS, 29:4898–4906, 2016
work page 2016
-
[41]
Mega: Moving average equipped gated attention
Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. Mega: Moving average equipped gated attention. In ICLR, 2022
work page 2022
-
[42]
Parallelizing linear recurrent neural nets over sequence length
Eric Martin and Chris Cundy. Parallelizing linear recurrent neural nets over sequence length. In ICLR, 2018
work page 2018
-
[43]
Long range language modeling via gated state spaces
Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. Long range language modeling via gated state spaces. In ICLR, 2023. 12
work page 2023
-
[44]
S4nd: Modeling images and videos as multidimensional signals with state spaces
Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christo- pher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. NeurIPS, 35:2846–2861, 2022
work page 2022
-
[45]
RWKV: reinventing rnns for the transformer era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV , et al. RWKV: reinventing rnns for the transformer era. In EMNLP, pages 14048–14077, 2023
work page 2023
-
[46]
Designing network design spaces
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In CVPR, pages 10428–10436, 2020
work page 2020
-
[47]
Hornet: Efficient high-order spatial interactions with recursive gated convolutions
Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser Nam Lim, and Jiwen Lu. Hornet: Efficient high-order spatial interactions with recursive gated convolutions. NeurIPS, 35:10353–10366, 2022
work page 2022
-
[48]
Scalable event-by-event processing of neuromorphic sensory signals with deep state-space models
Mark Schöne, Neeraj Mohan Sushma, Jingyue Zhuge, Christian Mayr, Anand Subramoney, and David Kappel. Scalable event-by-event processing of neuromorphic sensory signals with deep state-space models. arXiv preprint arXiv:2404.18508, 2024
-
[49]
Very deep convolutional networks for large-scale image recogni- tion
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. 2015
work page 2015
-
[50]
Simplified state space layers for sequence modeling
Jimmy TH Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. In ICLR, 2022
work page 2022
-
[51]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015
work page 2015
-
[53]
Mingxing Tan and Quoc V . Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114, 2019
work page 2019
-
[54]
Efficient transformers: A survey
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6), 2022
work page 2022
-
[55]
Integrally pre-trained transformer pyramid networks
Yunjie Tian, Lingxi Xie, Zhaozhi Wang, Longhui Wei, Xiaopeng Zhang, Jianbin Jiao, Yaowei Wang, Qi Tian, and Qixiang Ye. Integrally pre-trained transformer pyramid networks. In CVPR, pages 18610– 18620, 2023
work page 2023
-
[56]
Mlp-mixer: An all-mlp architecture for vision
Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems , 34:24261–24272, 2021
work page 2021
-
[57]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021
work page 2021
-
[58]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30:5998–6008, 2017
work page 2017
-
[59]
Selective structured state-spaces for long-form video understanding
Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, and Raffay Hamid. Selective structured state-spaces for long-form video understanding. In CVPR, pages 6387–6397, 2023
work page 2023
-
[60]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[61]
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021
work page 2021
-
[62]
Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019
work page 2019
-
[63]
Unified perceptual parsing for scene understanding
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018. 13
work page 2018
-
[64]
Focal self-attention for local-global interactions in vision transformers
Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021
-
[65]
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Hivit: A simpler and more efficient design of hierarchical vision transformer
Xiaosong Zhang, Yunjie Tian, Lingxi Xie, Wei Huang, Qi Dai, Qixiang Ye, and Qi Tian. Hivit: A simpler and more efficient design of hierarchical vision transformer. In ICLR, 2023
work page 2023
-
[67]
Graformer: Graph-oriented transformer for 3d pose estimation
Weixi Zhao, Weiqiang Wang, and Yunjie Tian. Graformer: Graph-oriented transformer for 3d pose estimation. In CVPR, pages 20438–20447, 2022
work page 2022
-
[68]
Scene parsing through ade20k dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, pages 5122–5130, 2017
work page 2017
-
[69]
Vision mamba: Efficient visual representation learning with bidirectional state space model
Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. In ICML, 2024
work page 2024
-
[70]
Deformable convnets v2: More deformable, better results
Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In CVPR, pages 9308–9316, 2019
work page 2019
-
[71]
1×" indicates models fine-tuned for 12 epochs, while
Nikola Zubic, Mathias Gehrig, and Davide Scaramuzza. State space models for event cameras. In CVPR, pages 5819–5828, 2024. 14 A Discretization of State Space Models (SSMs) In this section, we explore the correlation between the discretized formulations of State Space Models (SSMs) obtained in Sec. 3 and those derived from the zero-order hold (ZOH) method ...
work page 2024
-
[72]
Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The abstract and introduction provide a comprehensive overview of the background and motivation of this study, effectively outlining its main contributions point- by-point, thus accurately reflectin...
-
[73]
Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We primarily focused on discussing the limitations associated with this study in section 6. Guidelines: • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those ar...
-
[74]
Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? 27 Answer: [Yes] Justification: The paper includes the full set of assumptions and correct proofs for each theoretical result, primarily presented in the appendix. Notably, it covers the formulation of...
-
[75]
Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: All information regarding ...
-
[76]
Instructions for running the code are also provided within the scripts
Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The supplementary material submitted with the manuscript includes open access to all source code and script...
-
[77]
Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The paper specifies detailed experimental configurations in Section E in Appendix, providing readers with ess...
-
[78]
Experiment Statistical Significance 29 Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: We did not include an analysis of the statistical significance of the experiments mainly due to the prohibitively expensive trai...
-
[79]
Guidelines: • The answer NA means that the paper does not include experiments
Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: All experiments were carried out on an 8 × A100 GPU server, as detailed at the beginning of the experim...
-
[80]
Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics
Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] 30 Justification: After carefully reviewing the referenced document, we certify that the research conducted in the paper conforms, in every respect, with the NeurIPS Code of ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.