Focal self-attention for local-global interactions in vision transformers

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao · 2021 · arXiv 2107.00641

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

VMamba: Visual State Space Model

cs.CV · 2024-01-18 · conditional · novelty 8.0

VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.

Can Graphs Help Vision SSMs See Better?

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

GraphScan replaces geometric or coordinate-based scanning in Vision SSMs with learned local semantic graph routing, yielding SOTA results among such models on classification and segmentation tasks.

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

cs.CV · 2024-01-17 · conditional · novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.

MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

cs.CV · 2026-03-27 · unverdicted · novelty 6.0

MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.

Florence: A New Foundation Model for Computer Vision

cs.CV · 2021-11-22 · unverdicted · novelty 6.0

Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.

citing papers explorer

Showing 5 of 5 citing papers.

VMamba: Visual State Space Model cs.CV · 2024-01-18 · conditional · none · ref 64
VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
Can Graphs Help Vision SSMs See Better? cs.CV · 2026-05-11 · unverdicted · none · ref 73
GraphScan replaces geometric or coordinate-based scanning in Vision SSMs with learned local semantic graph routing, yielding SOTA results among such models on classification and segmentation tasks.
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model cs.CV · 2024-01-17 · conditional · none · ref 78
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model cs.CV · 2026-03-27 · unverdicted · none · ref 73
MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
Florence: A New Foundation Model for Computer Vision cs.CV · 2021-11-22 · unverdicted · none · ref 23
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.

Focal self-attention for local-global interactions in vision transformers

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer