Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.
Imagenet: A large-scale hierarchical image database
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
DualToken disentangles semantics and appearance via separate codebooks in one tokenizer, reporting 0.25 rFID, 82% ImageNet zero-shot accuracy, and gains over VILA-U on understanding and generation benchmarks.
LPT reduces overfitting during prompt tuning of VLMs by CLIP-based foreground filtering, a structural preservation constraint aligning features to frozen CLIP, and a hierarchical logit constraint at the output, improving generalization on base-to-novel, cross-dataset, and domain-generalization tasks
citing papers explorer
-
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.