arxiv: 2103.14030 · v2 · submitted 2021-03-25 · 💻 cs.CV · cs.LG

Recognition: 1 theorem link

· Lean Theorem

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu , Yutong Lin , Yue Cao , Han Hu , Yixuan Wei , Zheng Zhang , Stephen Lin , Baining Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:24 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords Swin Transformervision transformershifted windowshierarchical architectureimage classificationobject detectionsemantic segmentation

0 comments

The pith

Swin Transformer uses shifted windows in a hierarchical structure to make vision Transformers efficient backbones with linear complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Swin Transformer as a general-purpose backbone for computer vision that adapts the Transformer architecture to handle large scale variations and high image resolutions. It builds a hierarchical representation by computing self-attention only inside non-overlapping local windows and then shifting the windows in successive layers to enable cross-window information flow. This design keeps computational cost linear with image size while allowing modeling at multiple scales. The resulting model serves effectively for image classification on ImageNet as well as dense prediction tasks such as object detection on COCO and semantic segmentation on ADE20K, where it exceeds prior state-of-the-art numbers by clear margins. The same hierarchical shifted-window approach also improves performance when applied to all-MLP vision architectures.

Core claim

The Swin Transformer is a hierarchical vision Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size, making it compatible with a broad range of vision tasks.

What carries the argument

The shifted window self-attention mechanism, which partitions the feature map into non-overlapping local windows for attention computation and shifts the windows across layers to connect information between adjacent windows.

If this is right

The model achieves 87.3 top-1 accuracy on ImageNet-1K classification.
It reaches 58.7 box AP and 51.1 mask AP on COCO test-dev for object detection and instance segmentation.
It obtains 53.5 mIoU on ADE20K validation for semantic segmentation, exceeding the previous best by 3.2 points.
The hierarchical shifted-window design also improves accuracy when used inside all-MLP vision models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The linear scaling property could support processing of images at resolutions far higher than those tested on COCO or ADE20K.
Because window size remains fixed, the architecture may require adjustment when applied to domains with scale distributions very different from natural images.
The same local-plus-shift pattern could be tested on video or volumetric data where temporal or depth dimensions introduce additional scale variation.

Load-bearing premise

The fixed window size and shift pattern chosen during ImageNet training transfer effectively to detection and segmentation heads on COCO and ADE20K without major retuning.

What would settle it

An ablation that swaps the Swin backbone into an existing detection framework while freezing all other components and retraining only the backbone would reveal whether the reported gains come primarily from the Transformer architecture itself.

read the original abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Swin Transformer shows a practical hierarchical design with shifted windows that makes transformers competitive as general vision backbones while keeping linear complexity.

read the letter

The main thing to know is that Swin gives a clean way to build multi-scale features inside a transformer by using patch merging across stages and shifted local windows for attention. This keeps the cost linear in image size instead of quadratic, and the numbers show it works as a backbone for classification, detection, and segmentation after ImageNet pre-training. The shifted-window equations create cross-window connections without going back to full global attention, so the construction is distinct from ViT-style models. Ablations isolate the shift's contribution, and the gains appear consistently on COCO and ADE20K with standard heads like Mask R-CNN and UperNet. Releasing the code and weights helps verify the results directly. The central claims rest on held-out test sets rather than any fitted internal parameters. One softer spot is the fixed window size of 7, chosen on ImageNet and applied unchanged to the dense tasks; the paper does not run a full sensitivity study on that choice or fully separate backbone gains from head tuning details. The hyper-parameter protocol for the detection head also has some gaps in the write-up, though this is common in the area. The citation pattern is straightforward and builds on prior transformer work without circularity. This paper is for people working on vision backbones or efficient attention designs right now. A reader who needs a transformer that handles varying scales and high resolution without custom CNN hybrids will get concrete value from the architecture and the cross-task results. It deserves a serious referee because the empirical evidence is reproducible with the released code and the design choices are described clearly enough to evaluate.

Referee Report

1 major / 3 minor

Summary. The paper proposes Swin Transformer, a hierarchical vision Transformer that uses shifted windows to compute self-attention within non-overlapping local windows while enabling cross-window connections. This yields linear complexity with image size and multi-scale feature modeling. The architecture is evaluated as a backbone on ImageNet-1K classification (87.3% top-1), COCO object detection (58.7 box AP, 51.1 mask AP), and ADE20K semantic segmentation (53.5 mIoU), surpassing prior state-of-the-art by +2.7 box AP, +2.6 mask AP, and +3.2 mIoU respectively.

Significance. If the results hold, the work establishes hierarchical shifted-window Transformers as competitive general-purpose vision backbones, with publicly released code supporting reproducibility. The design addresses key vision-specific challenges (scale variation, high resolution) more efficiently than prior global-attention Transformers, and the consistent gains across three tasks with standard heads (Mask R-CNN, UperNet) indicate broad applicability.

major comments (1)

[§5.2–5.3 (COCO and ADE20K experiments)] The central performance claims on COCO and ADE20K rest on the assumption that the ImageNet-tuned window size M=7 and shift pattern transfer without retuning; the manuscript reports strong numbers but provides no ablation isolating backbone gains from head-specific tuning or window-size sensitivity on these tasks.

minor comments (3)

[§3.2] The complexity analysis in §3.2 claims O(HW) scaling, but the overhead of the cyclic shift operation and window merging is not separately timed or bounded in the reported FLOPs.
[Tables 2–3] Table 2 and Table 3: baseline citations for some competing methods (e.g., recent CNN and Transformer variants) are incomplete; adding the original references would improve traceability.
[Figure 2] Figure 2: the shifted-window diagram would benefit from explicit arrows or annotations showing the shift direction and the resulting cross-window attention links.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough review and the recommendation for minor revision. We are pleased that the significance of the work is recognized. Below we provide a point-by-point response to the major comment.

read point-by-point responses

Referee: [§5.2–5.3 (COCO and ADE20K experiments)] The central performance claims on COCO and ADE20K rest on the assumption that the ImageNet-tuned window size M=7 and shift pattern transfer without retuning; the manuscript reports strong numbers but provides no ablation isolating backbone gains from head-specific tuning or window-size sensitivity on these tasks.

Authors: We thank the referee for this insightful comment. The window size of M=7 was tuned on the ImageNet-1K classification task, and the same configuration, including the shift pattern, is directly transferred to the COCO and ADE20K experiments. This choice was made deliberately to demonstrate that Swin Transformer serves as a general-purpose backbone that does not require task-specific retuning of its core components. The object detection and semantic segmentation heads are standard implementations (Mask R-CNN and UperNet, respectively) without additional hyperparameter optimization beyond what is typical in the literature. Consequently, the reported gains (+2.7 box AP, +2.6 mask AP on COCO; +3.2 mIoU on ADE20K) can be attributed to the hierarchical shifted-window design of the backbone. Nevertheless, we acknowledge the value of further ablations and will add a sensitivity analysis for the window size M on the COCO detection task in the revised manuscript to isolate these effects more clearly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in architecture proposal or empirical claims

full rationale

The paper defines a new hierarchical Transformer architecture with shifted-window attention and patch merging as explicit design choices motivated by efficiency and multi-scale needs. These are not derived from or equivalent to any fitted parameters or prior results inside the paper; the attention formulation (W-MSA/SW-MSA) and complexity analysis (O(HW)) follow directly from the stated window partitioning rules without reduction to inputs. All reported performance numbers (ImageNet top-1, COCO AP, ADE20K mIoU) are obtained via standard training on held-out validation/test sets after ImageNet pre-training, with no internal equations or predictions that collapse to the architecture definition itself. Ablations isolate the shift operation and hierarchy contributions independently. The public code link is a non-load-bearing reference and does not support any central claim. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The architecture rests on the standard transformer self-attention equations plus the new window-partitioning operator; no new physical entities or ad-hoc constants are introduced beyond ordinary hyper-parameters such as window size.

free parameters (1)

window size M
Fixed at 7 for most experiments; chosen by hand rather than learned from data.

axioms (2)

standard math Self-attention inside each window is computed exactly as in the original Transformer paper
Invoked in Section 3.2 without re-derivation.
domain assumption Cyclic shift of windows preserves the linear complexity property
Stated in Section 3.3; follows directly from the non-overlapping window definition.

pith-pipeline@v0.9.0 · 5604 in / 1429 out tokens · 20031 ms · 2026-05-15T19:24:34.375534+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Architecture-Aware Explanation Auditing for Industrial Visual Inspection
cs.LG 2026-05 conditional novelty 7.0

Explanation faithfulness for deep classifiers on wafer maps is highest when the explainer matches the model's native readout structure, with ViT-Tiny plus Attention Rollout achieving lower Deletion AUC than mismatched...
Hierarchical Transformer Preconditioning for Interactive Physics Simulation
cs.GR 2026-05 unverdicted novelty 7.0

A hierarchical transformer preconditioner with H-matrix structure and cosine-Hutchinson training delivers up to 2.7x speedup over prior neural methods on stiff multiphase Poisson systems up to N=16384.
Bidirectional Cross-Attention Fusion of High-Res RGB and Low-Res HSI for Multimodal Automated Waste Sorting
cs.CV 2026-03 accept novelty 7.0

BCAF fuses native-grid high-res RGB and low-res HSI via bidirectional cross-attention in adapted Swin Transformers to reach state-of-the-art mIoU on SpectralWaste and a new industrial dataset while running at real-tim...
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
cs.CV 2023-12 conditional novelty 7.0

Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
iBOT: Image BERT Pre-Training with Online Tokenizer
cs.CV 2021-11 unverdicted novelty 7.0

iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.
BEiT: BERT Pre-Training of Image Transformers
cs.CV 2021-06 conditional novelty 7.0

BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.
Hierarchical Transformer Preconditioning for Interactive Physics Simulation
cs.GR 2026-05 unverdicted novelty 6.0

The Hierarchical Transformer Preconditioner uses a weak-admissibility H-matrix prior and cosine-Hutchinson objective to precondition large Poisson systems, delivering interactive frame rates with up to 28x speedup ove...
Spectral Vision Transformer for Efficient Tokenization with Limited Data
cs.CV 2026-05 unverdicted novelty 6.0

A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.
A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series
cs.CV 2026-05 unverdicted novelty 6.0

A graph-regulated disentangling Mamba model with sparse tokens achieves 93.94% accuracy classifying tree species from MODIS time series in Alberta and outperforms twelve prior models.
Semantic Segmentation for Histopathology using Learned Regularization based on Global Proportions
eess.IV 2026-04 unverdicted novelty 6.0

VSLP infers dense segmentations from global label proportions via a pre-trained transformer for initial confidence maps followed by variational optimization using Wasserstein fidelity and a learned regularizer, outper...
DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection
cs.CV 2026-04 unverdicted novelty 6.0

DVAR turns video authenticity detection into an iterative debate between a generative hypothesis agent and a natural mechanism agent, resolved via minimum description length and a knowledge base for better generalizat...
LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection
cs.CV 2026-04 unverdicted novelty 6.0

LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and p...
InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation
cs.RO 2026-02 unverdicted novelty 6.0

InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
cs.CV 2023-03 accept novelty 6.0

Grounding DINO fuses language and vision via feature enhancer, language-guided query selection, and cross-modality decoder in a DINO backbone, achieving 52.5 AP zero-shot on COCO and a new record of 26.1 AP mean on ODinW.
YOLOX: Exceeding YOLO Series in 2021
cs.CV 2021-07 accept novelty 6.0

YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.
Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy
cs.CV 2026-05 unverdicted novelty 5.0

TREX detects rectal cancer local regrowth from longitudinal endoscopy image pairs with 97% sensitivity and enables early prediction 3-12 months before clinical confirmation, outperforming baselines.
Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank
cs.CV 2026-05 unverdicted novelty 5.0

DL models predict 12 AD risk factors from colored fundus photos, with saliency maps highlighting optic nerve and vessels that also differ in preclinical AD cases.
UniMesh: Unifying 3D Mesh Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment
cs.LG 2026-04 unverdicted novelty 4.0

KAYRA packages a cascade of EfficientNet-B5 + U-Net, Mask R-CNN, and ResNet-18 models into a microservice architecture that supports both cloud and on-premise deployment and reaches 98.91% segmentation accuracy in a p...
The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge
cs.LG 2026-04 unverdicted novelty 2.0

A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 19 Pith papers · 4 internal anchors

[1]

Unilmv2: Pseudo-masked language models for uniﬁed language model pre-training

Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al. Unilmv2: Pseudo-masked language models for uniﬁed language model pre-training. In International Con- ference on Machine Learning, pages 642–652. PMLR, 2020. 5

work page 2020
[2]

Toward transformer-based object detection

Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew Zhai, and Dmitry Kislyuk. Toward transformer-based object detection. arXiv preprint arXiv:2012.09958, 2020. 3

work page arXiv 2012
[3]

Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V . Le. Attention augmented convolutional net- works, 2020. 3

work page 2020
[4]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 7

work page internal anchor Pith review Pith/arXiv arXiv 2004
[5]

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms – improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017. 6, 9

work page 2017
[6]

Cascade r-cnn: Delv- ing into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv- ing into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 6154–6162, 2018. 6, 9

work page 2018
[7]

Gcnet: Non-local networks meet squeeze-excitation net- works and beyond

Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation net- works and beyond. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV) Workshops, Oct 2019. 3, 6, 7, 9

work page 2019
[8]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European Confer- ence on Computer Vision, pages 213–229. Springer, 2020. 3, 6, 9

work page 2020
[9]

Hybrid task cascade for instance segmentation

Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4974– 4983, 2019. 6, 9

work page 2019
[10]

MMDetection: Open MMLab Detection Toolbox and Benchmark

Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection tool- box and benchmark. arXiv preprint arXiv:1906.07155, 2019. 6, 9

work page internal anchor Pith review Pith/arXiv arXiv 1906
[11]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 7

work page 2018
[12]

Reppoints v2: Veriﬁcation meets regres- sion for object detection

Yihong Chen, Zheng Zhang, Yue Cao, Liwei Wang, Stephen Lin, and Han Hu. Reppoints v2: Veriﬁcation meets regres- sion for object detection. In NeurIPS, 2020. 6, 7, 9

work page 2020
[13]

Relationnet++: Bridging visual representations for object detection via trans- former decoder

Cheng Chi, Fangyun Wei, and Han Hu. Relationnet++: Bridging visual representations for object detection via trans- former decoder. In NeurIPS, 2020. 3, 7

work page 2020
[14]

Rethinking attention with performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sar- los, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations ,

work page
[15]

Do we really need explicit position encodings for vision transformers? arXiv preprint arXiv:2102.10882,

Xiangxiang Chu, Bo Zhang, Zhi Tian, Xiaolin Wei, and Huaxia Xia. Do we really need explicit position encodings for vision transformers? arXiv preprint arXiv:2102.10882,

work page arXiv
[16]

MMSegmentation: Openmmlab semantic segmentation toolbox and bench- mark

MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and bench- mark. https://github.com/open-mmlab/ mmsegmentation, 2020. 8, 10

work page 2020
[17]

Randaugment: Practical automated data augmenta- tion with a reduced search space

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmenta- tion with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020. 9

work page 2020
[18]

Deformable convolutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE International Confer- ence on Computer Vision, pages 764–773, 2017. 1, 3

work page 2017
[19]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

work page 2009
[20]

An image is 11 worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is 11 worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 1,...

work page 2021
[21]

Spinenet: Learning scale-permuted backbone for recogni- tion and localization

Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. Spinenet: Learning scale-permuted backbone for recogni- tion and localization. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11592–11601, 2020. 7

work page 2020
[22]

Instaboost: Boosting instance segmentation via probability map guided copy- pasting

Hao-Shu Fang, Jianhua Sun, Runzhong Wang, Minghao Gou, Yong-Lu Li, and Cewu Lu. Instaboost: Boosting instance segmentation via probability map guided copy- pasting. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 682–691, 2019. 6, 9

work page 2019
[23]

Dual attention network for scene segmentation

Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi- wei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3146– 3154, 2019. 3, 7

work page 2019
[24]

Adaptive context network for scene parsing

Jun Fu, Jing Liu, Yuhang Wang, Yong Li, Yongjun Bao, Jin- hui Tang, and Hanqing Lu. Adaptive context network for scene parsing. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 6748–6757,

work page
[25]

Cognitron: A self-organizing multi- layered neural network

Kunihiko Fukushima. Cognitron: A self-organizing multi- layered neural network. Biological cybernetics, 20(3):121– 136, 1975. 3

work page 1975
[26]

Simple copy-paste is a strong data augmentation method for instance segmentation

Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. arXiv preprint arXiv:2012.07177, 2020. 2, 7

work page arXiv 2012
[27]

Learning region features for object detection

Jiayuan Gu, Han Hu, Liwei Wang, Yichen Wei, and Jifeng Dai. Learning region features for object detection. In Pro- ceedings of the European Conference on Computer Vision (ECCV), 2018. 3

work page 2018
[28]

Transformer in transformer.arXiv preprint arXiv:2103.00112, 2021

Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer.arXiv preprint arXiv:2103.00112, 2021. 3

work page arXiv 2021
[29]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 6, 9

work page 2017
[30]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 1, 2, 4

work page 2016
[31]

Augment your batch: Improving generalization through instance repetition

Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoeﬂer, and Daniel Soudry. Augment your batch: Improving generalization through instance repetition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8129–8138, 2020. 6, 9

work page 2020
[32]

Relation networks for object detection

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3588–3597, 2018. 3, 5

work page 2018
[33]

Local relation networks for image recognition

Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3464–3473, October 2019. 2, 3, 5

work page 2019
[34]

Densely connected convolutional net- works

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 1, 2

work page 2017
[35]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision , pages 646–661. Springer, 2016. 9

work page 2016
[36]

Receptive ﬁelds, binocular interaction and functional architecture in the cat’s visual cortex

David H Hubel and Torsten N Wiesel. Receptive ﬁelds, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology , 160(1):106–154,

work page
[37]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Big transfer (bit): General visual representation learning

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. arXiv preprint arXiv:1912.11370, 6(2):8, 2019. 6

work page arXiv 1912
[39]

Imagenet classiﬁcation with deep convolutional neural net- works

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural net- works. In Advances in neural information processing sys- tems, pages 1097–1105, 2012. 1, 2

work page 2012
[40]

Gradient-based learning applied to document recog- nition

Yann LeCun, L ´eon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recog- nition. Proceedings of the IEEE , 86(11):2278–2324, 1998. 2

work page 1998
[41]

Object recognition with gradient-based learning

Yann LeCun, Patrick Haffner, L´eon Bottou, and Yoshua Ben- gio. Object recognition with gradient-based learning. In Shape, contour and grouping in computer vision, pages 319–

work page
[42]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July

work page
[43]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014. 5

work page 2014
[44]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learning Representations, 2019. 6, 9, 10

work page 2019
[45]

Acceleration of stochastic approximation by averaging

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992. 6, 9

work page 1992
[46]

Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution

Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution. arXiv preprint arXiv:2006.02334 ,

work page arXiv 2006
[47]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 1 12

work page 2021
[48]

Designing network design spaces

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 10428– 10436, 2020. 6

work page 2020
[49]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. Journal of Machine Learn- ing Research, 21(140):1–67, 2020. 5

work page 2020
[50]

Stand-alone self- attention in vision models

Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self- attention in vision models. In Advances in Neural Informa- tion Processing Systems, volume 32. Curran Associates, Inc.,

work page
[51]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In International Conference on Medical image com- puting and computer-assisted intervention , pages 234–241. Springer, 2015. 2

work page 2015
[52]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, May 2015. 2, 4

work page 2015
[53]

An analysis of scale in- variance in object detection snip

Bharat Singh and Larry S Davis. An analysis of scale in- variance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3578–3587, 2018. 2

work page 2018
[54]

Sniper: Efﬁcient multi-scale training

Bharat Singh, Mahyar Najibi, and Larry S Davis. Sniper: Efﬁcient multi-scale training. In Advances in Neural Infor- mation Processing Systems, volume 31. Curran Associates, Inc., 2018. 2

work page 2018
[55]

Bottle- neck transformers for visual recognition

Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottle- neck transformers for visual recognition. arXiv preprint arXiv:2101.11605, 2021. 3

work page arXiv 2021
[56]

Sparse r-cnn: End-to-end object detection with learnable proposals

Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chen- feng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, et al. Sparse r-cnn: End-to-end object detection with learnable proposals. arXiv preprint arXiv:2011.12450, 2020. 3, 6, 9

work page arXiv 2011
[57]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1–9, 2015. 2

work page 2015
[58]

Efﬁcientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR,

work page
[59]

Efﬁcientdet: Scalable and efﬁcient object detection

Mingxing Tan, Ruoming Pang, and Quoc V Le. Efﬁcientdet: Scalable and efﬁcient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020. 7

work page 2020
[60]

Long range arena : A bench- mark for efﬁcient transformers

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A bench- mark for efﬁcient transformers. In International Conference on Learning Representations, 2021. 8

work page 2021
[61]

Mlp-mixer: An all-mlp ar- chitecture for vision, 2021

Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lu- cas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp ar- chitecture for vision, 2021. 2, 10, 11

work page 2021
[62]

Resmlp: Feedforward networks for image clas- siﬁcation with data-efﬁcient training, 2021

Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izac- ard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, and Herv´e J´egou. Resmlp: Feedforward networks for image clas- siﬁcation with data-efﬁcient training, 2021. 11

work page 2021
[63]

Training data-efﬁcient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efﬁcient image transformers & distillation through at- tention. arXiv preprint arXiv:2012.12877, 2020. 2, 3, 5, 6, 9, 11

work page arXiv 2012
[64]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. 1, 2, 4

work page 2017
[65]

Deep high-resolution represen- tation learning for visual recognition

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution represen- tation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 2020. 3

work page 2020
[66]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122, 2021. 3

work page arXiv 2021
[67]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018 ,

work page 2018
[68]

Pytorch image mod- els

Ross Wightman. Pytorch image mod- els. https://github.com/rwightman/ pytorch-image-models, 2019. 6, 11

work page 2019
[69]

Uniﬁed perceptual parsing for scene understand- ing

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Uniﬁed perceptual parsing for scene understand- ing. In Proceedings of the European Conference on Com- puter Vision (ECCV), pages 418–434, 2018. 7, 8, 10

work page 2018
[70]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1492– 1500, 2017. 1, 2, 3

work page 2017
[71]

Disentangled non-local neural networks

Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, and Han Hu. Disentangled non-local neural networks. In Proceedings of the European conference on computer vision (ECCV), 2020. 3, 7, 10

work page 2020
[72]

Tokens- to-token vit: Training vision transformers from scratch on imagenet

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens- to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986, 2021. 3

work page arXiv 2021
[73]

Object- contextual representations for semantic segmentation

Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object- contextual representations for semantic segmentation. In 13 16th European Conference Computer Vision (ECCV 2020) , August 2020. 7

work page 2020
[74]

Ocnet: Object context net- work for scene parsing

Yuhui Yuan and Jingdong Wang. Ocnet: Object context net- work for scene parsing. arXiv preprint arXiv:1809.00916 ,

work page arXiv
[75]

Cutmix: Regular- ization strategy to train strong classiﬁers with localizable fea- tures

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regular- ization strategy to train strong classiﬁers with localizable fea- tures. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6023–6032, 2019. 9

work page 2019
[76]

Wide residual net- works

Sergey Zagoruyko and Nikos Komodakis. Wide residual net- works. In BMVC, 2016. 1

work page 2016
[77]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. arXiv preprint arXiv:1710.09412, 2017. 9

work page internal anchor Pith review Pith/arXiv arXiv 2017
[78]

Resnest: Split-attention networks

Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R Manmatha, et al. Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955, 2020. 7, 8

work page arXiv 2004
[79]

Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection

Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9759–9768, 2020. 6, 9

work page 2020
[80]

Explor- ing self-attention for image recognition

Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Explor- ing self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10076–10085, 2020. 3

work page 2020

Showing first 80 references.