{"paper":{"title":"VMamba: Visual State Space Model","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"VMamba adapts Mamba's state-space model to vision by scanning 2D images along four fixed routes to reach linear time complexity.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Hongtian Yu, Jianbin Jiao, Lingxi Xie, Qixiang Ye, Yaowei Wang, Yue Liu, Yunfan Liu, Yunjie Tian, Yuzhong Zhao","submitted_at":"2024-01-18T17:55:39Z","abstract_excerpt":"Designing computationally efficient network architectures remains an ongoing necessity in computer vision. In this paper, we adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity. At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives. B"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Extensive experiments demonstrate VMamba's promising performance across diverse visual perception tasks, highlighting its superior input scaling efficiency compared to existing benchmark models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That scanning along exactly four fixed routes in the SS2D module collects sufficient contextual information from 2D data to match or exceed the modeling power of full 2D attention or convolution without missing important spatial relationships.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"VMamba adapts Mamba's state-space model to vision by scanning 2D images along four fixed routes to reach linear time complexity.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1f2cde7f03a8238c2138a575b5b389b8fb7773ffd4ad26363031821683db39c9"},"source":{"id":"2401.10166","kind":"arxiv","version":4},"verdict":{"id":"7c51fdb6-4d7c-4f45-b11d-b52b5208c139","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T18:19:04.355281Z","strongest_claim":"Extensive experiments demonstrate VMamba's promising performance across diverse visual perception tasks, highlighting its superior input scaling efficiency compared to existing benchmark models.","one_line_summary":"VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That scanning along exactly four fixed routes in the SS2D module collects sufficient contextual information from 2D data to match or exceed the modeling power of full 2D attention or convolution without missing important spatial relationships.","pith_extraction_headline":"VMamba adapts Mamba's state-space model to vision by scanning 2D images along four fixed routes to reach linear time complexity."},"references":{"count":86,"sample":[{"doi":"","year":2021,"title":"Xcit: Cross-covariance image trans- formers","work_id":"5b4b3b64-9af2-4c7a-b3a5-905d48034645","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1990,"title":"Prefix sums and their applications","work_id":"0a1dea29-0937-468d-b8d6-457aa1163820","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1906,"title":"MMDetection: Open mmlab detection toolbox and benchmark","work_id":"88b51c19-cd39-43c5-89fe-5c199a74250d","ref_index":3,"cited_arxiv_id":"1906.07155","is_internal_anchor":true},{"doi":"","year":2020,"title":"MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark","work_id":"3c5cee10-0a23-4c51-8b97-76ddc3cff1bc","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Deformable convolutional networks","work_id":"9784de48-0b3a-4e45-b041-e0e7dc5ed61a","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":86,"snapshot_sha256":"7f84ebd1d8398e505ef24b30fb842083183d7710e2073b5478a24b82348e452f","internal_anchors":7},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a2fb05a5c93cd8a0c420301b3c87c5a1df928f1ee14f21b81f5c3a90280d0de6"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}