VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
Focal self-attention for local-global interactions in vision transformers
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 5roles
background 2polarities
background 2representative citing papers
GraphScan replaces geometric or coordinate-based scanning in Vision SSMs with learned local semantic graph routing, yielding SOTA results among such models on classification and segmentation tasks.
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
citing papers explorer
-
VMamba: Visual State Space Model
VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
-
Can Graphs Help Vision SSMs See Better?
GraphScan replaces geometric or coordinate-based scanning in Vision SSMs with learned local semantic graph routing, yielding SOTA results among such models on classification and segmentation tasks.
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
-
MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
-
Florence: A New Foundation Model for Computer Vision
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.