Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.
Rethinking semantic segmen- tation from a sequence-to-sequence perspective with trans- formers
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3years
2021 3representative citing papers
BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.
TransUNet is a hybrid CNN-Transformer architecture that outperforms prior U-Net and Transformer baselines on multi-organ and cardiac medical image segmentation tasks.
citing papers explorer
-
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.
-
BEiT: BERT Pre-Training of Image Transformers
BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.
-
TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
TransUNet is a hybrid CNN-Transformer architecture that outperforms prior U-Net and Transformer baselines on multi-organ and cardiac medical image segmentation tasks.