pith. machine review for the scientific record. sign in

arxiv: 2103.14030 · v2 · submitted 2021-03-25 · 💻 cs.CV · cs.LG

Recognition: 1 theorem link

· Lean Theorem

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:24 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords Swin Transformervision transformershifted windowshierarchical architectureimage classificationobject detectionsemantic segmentation
0
0 comments X

The pith

Swin Transformer uses shifted windows in a hierarchical structure to make vision Transformers efficient backbones with linear complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Swin Transformer as a general-purpose backbone for computer vision that adapts the Transformer architecture to handle large scale variations and high image resolutions. It builds a hierarchical representation by computing self-attention only inside non-overlapping local windows and then shifting the windows in successive layers to enable cross-window information flow. This design keeps computational cost linear with image size while allowing modeling at multiple scales. The resulting model serves effectively for image classification on ImageNet as well as dense prediction tasks such as object detection on COCO and semantic segmentation on ADE20K, where it exceeds prior state-of-the-art numbers by clear margins. The same hierarchical shifted-window approach also improves performance when applied to all-MLP vision architectures.

Core claim

The Swin Transformer is a hierarchical vision Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size, making it compatible with a broad range of vision tasks.

What carries the argument

The shifted window self-attention mechanism, which partitions the feature map into non-overlapping local windows for attention computation and shifts the windows across layers to connect information between adjacent windows.

If this is right

  • The model achieves 87.3 top-1 accuracy on ImageNet-1K classification.
  • It reaches 58.7 box AP and 51.1 mask AP on COCO test-dev for object detection and instance segmentation.
  • It obtains 53.5 mIoU on ADE20K validation for semantic segmentation, exceeding the previous best by 3.2 points.
  • The hierarchical shifted-window design also improves accuracy when used inside all-MLP vision models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The linear scaling property could support processing of images at resolutions far higher than those tested on COCO or ADE20K.
  • Because window size remains fixed, the architecture may require adjustment when applied to domains with scale distributions very different from natural images.
  • The same local-plus-shift pattern could be tested on video or volumetric data where temporal or depth dimensions introduce additional scale variation.

Load-bearing premise

The fixed window size and shift pattern chosen during ImageNet training transfer effectively to detection and segmentation heads on COCO and ADE20K without major retuning.

What would settle it

An ablation that swaps the Swin backbone into an existing detection framework while freezing all other components and retraining only the backbone would reveal whether the reported gains come primarily from the Transformer architecture itself.

read the original abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper proposes Swin Transformer, a hierarchical vision Transformer that uses shifted windows to compute self-attention within non-overlapping local windows while enabling cross-window connections. This yields linear complexity with image size and multi-scale feature modeling. The architecture is evaluated as a backbone on ImageNet-1K classification (87.3% top-1), COCO object detection (58.7 box AP, 51.1 mask AP), and ADE20K semantic segmentation (53.5 mIoU), surpassing prior state-of-the-art by +2.7 box AP, +2.6 mask AP, and +3.2 mIoU respectively.

Significance. If the results hold, the work establishes hierarchical shifted-window Transformers as competitive general-purpose vision backbones, with publicly released code supporting reproducibility. The design addresses key vision-specific challenges (scale variation, high resolution) more efficiently than prior global-attention Transformers, and the consistent gains across three tasks with standard heads (Mask R-CNN, UperNet) indicate broad applicability.

major comments (1)
  1. [§5.2–5.3 (COCO and ADE20K experiments)] The central performance claims on COCO and ADE20K rest on the assumption that the ImageNet-tuned window size M=7 and shift pattern transfer without retuning; the manuscript reports strong numbers but provides no ablation isolating backbone gains from head-specific tuning or window-size sensitivity on these tasks.
minor comments (3)
  1. [§3.2] The complexity analysis in §3.2 claims O(HW) scaling, but the overhead of the cyclic shift operation and window merging is not separately timed or bounded in the reported FLOPs.
  2. [Tables 2–3] Table 2 and Table 3: baseline citations for some competing methods (e.g., recent CNN and Transformer variants) are incomplete; adding the original references would improve traceability.
  3. [Figure 2] Figure 2: the shifted-window diagram would benefit from explicit arrows or annotations showing the shift direction and the resulting cross-window attention links.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough review and the recommendation for minor revision. We are pleased that the significance of the work is recognized. Below we provide a point-by-point response to the major comment.

read point-by-point responses
  1. Referee: [§5.2–5.3 (COCO and ADE20K experiments)] The central performance claims on COCO and ADE20K rest on the assumption that the ImageNet-tuned window size M=7 and shift pattern transfer without retuning; the manuscript reports strong numbers but provides no ablation isolating backbone gains from head-specific tuning or window-size sensitivity on these tasks.

    Authors: We thank the referee for this insightful comment. The window size of M=7 was tuned on the ImageNet-1K classification task, and the same configuration, including the shift pattern, is directly transferred to the COCO and ADE20K experiments. This choice was made deliberately to demonstrate that Swin Transformer serves as a general-purpose backbone that does not require task-specific retuning of its core components. The object detection and semantic segmentation heads are standard implementations (Mask R-CNN and UperNet, respectively) without additional hyperparameter optimization beyond what is typical in the literature. Consequently, the reported gains (+2.7 box AP, +2.6 mask AP on COCO; +3.2 mIoU on ADE20K) can be attributed to the hierarchical shifted-window design of the backbone. Nevertheless, we acknowledge the value of further ablations and will add a sensitivity analysis for the window size M on the COCO detection task in the revised manuscript to isolate these effects more clearly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in architecture proposal or empirical claims

full rationale

The paper defines a new hierarchical Transformer architecture with shifted-window attention and patch merging as explicit design choices motivated by efficiency and multi-scale needs. These are not derived from or equivalent to any fitted parameters or prior results inside the paper; the attention formulation (W-MSA/SW-MSA) and complexity analysis (O(HW)) follow directly from the stated window partitioning rules without reduction to inputs. All reported performance numbers (ImageNet top-1, COCO AP, ADE20K mIoU) are obtained via standard training on held-out validation/test sets after ImageNet pre-training, with no internal equations or predictions that collapse to the architecture definition itself. Ablations isolate the shift operation and hierarchy contributions independently. The public code link is a non-load-bearing reference and does not support any central claim. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The architecture rests on the standard transformer self-attention equations plus the new window-partitioning operator; no new physical entities or ad-hoc constants are introduced beyond ordinary hyper-parameters such as window size.

free parameters (1)
  • window size M
    Fixed at 7 for most experiments; chosen by hand rather than learned from data.
axioms (2)
  • standard math Self-attention inside each window is computed exactly as in the original Transformer paper
    Invoked in Section 3.2 without re-derivation.
  • domain assumption Cyclic shift of windows preserves the linear complexity property
    Stated in Section 3.3; follows directly from the non-overlapping window definition.

pith-pipeline@v0.9.0 · 5604 in / 1429 out tokens · 20031 ms · 2026-05-15T19:24:34.375534+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Architecture-Aware Explanation Auditing for Industrial Visual Inspection

    cs.LG 2026-05 conditional novelty 7.0

    Explanation faithfulness for deep classifiers on wafer maps is highest when the explainer matches the model's native readout structure, with ViT-Tiny plus Attention Rollout achieving lower Deletion AUC than mismatched...

  2. Hierarchical Transformer Preconditioning for Interactive Physics Simulation

    cs.GR 2026-05 unverdicted novelty 7.0

    A hierarchical transformer preconditioner with H-matrix structure and cosine-Hutchinson training delivers up to 2.7x speedup over prior neural methods on stiff multiphase Poisson systems up to N=16384.

  3. Bidirectional Cross-Attention Fusion of High-Res RGB and Low-Res HSI for Multimodal Automated Waste Sorting

    cs.CV 2026-03 accept novelty 7.0

    BCAF fuses native-grid high-res RGB and low-res HSI via bidirectional cross-attention in adapted Swin Transformers to reach state-of-the-art mIoU on SpectralWaste and a new industrial dataset while running at real-tim...

  4. Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

    cs.CV 2023-12 conditional novelty 7.0

    Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.

  5. iBOT: Image BERT Pre-Training with Online Tokenizer

    cs.CV 2021-11 unverdicted novelty 7.0

    iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.

  6. BEiT: BERT Pre-Training of Image Transformers

    cs.CV 2021-06 conditional novelty 7.0

    BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.

  7. Hierarchical Transformer Preconditioning for Interactive Physics Simulation

    cs.GR 2026-05 unverdicted novelty 6.0

    The Hierarchical Transformer Preconditioner uses a weak-admissibility H-matrix prior and cosine-Hutchinson objective to precondition large Poisson systems, delivering interactive frame rates with up to 28x speedup ove...

  8. Spectral Vision Transformer for Efficient Tokenization with Limited Data

    cs.CV 2026-05 unverdicted novelty 6.0

    A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.

  9. A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series

    cs.CV 2026-05 unverdicted novelty 6.0

    A graph-regulated disentangling Mamba model with sparse tokens achieves 93.94% accuracy classifying tree species from MODIS time series in Alberta and outperforms twelve prior models.

  10. Semantic Segmentation for Histopathology using Learned Regularization based on Global Proportions

    eess.IV 2026-04 unverdicted novelty 6.0

    VSLP infers dense segmentations from global label proportions via a pre-trained transformer for initial confidence maps followed by variational optimization using Wasserstein fidelity and a learned regularizer, outper...

  11. DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    DVAR turns video authenticity detection into an iterative debate between a generative hypothesis agent and a natural mechanism agent, resolved via minimum description length and a knowledge base for better generalizat...

  12. LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and p...

  13. InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation

    cs.RO 2026-02 unverdicted novelty 6.0

    InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.

  14. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    cs.CV 2023-03 accept novelty 6.0

    Grounding DINO fuses language and vision via feature enhancer, language-guided query selection, and cross-modality decoder in a DINO backbone, achieving 52.5 AP zero-shot on COCO and a new record of 26.1 AP mean on ODinW.

  15. YOLOX: Exceeding YOLO Series in 2021

    cs.CV 2021-07 accept novelty 6.0

    YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.

  16. Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy

    cs.CV 2026-05 unverdicted novelty 5.0

    TREX detects rectal cancer local regrowth from longitudinal endoscopy image pairs with 97% sensitivity and enables early prediction 3-12 months before clinical confirmation, outperforming baselines.

  17. Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

    cs.CV 2026-05 unverdicted novelty 5.0

    DL models predict 12 AD risk factors from colored fundus photos, with saliency maps highlighting optic nerve and vessels that also differ in preclinical AD cases.

  18. UniMesh: Unifying 3D Mesh Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.

  19. KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment

    cs.LG 2026-04 unverdicted novelty 4.0

    KAYRA packages a cascade of EfficientNet-B5 + U-Net, Mask R-CNN, and ResNet-18 models into a microservice architecture that supports both cloud and on-premise deployment and reaches 98.91% segmentation accuracy in a p...

  20. The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge

    cs.LG 2026-04 unverdicted novelty 2.0

    A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 19 Pith papers · 4 internal anchors

  1. [1]

    Unilmv2: Pseudo-masked language models for unified language model pre-training

    Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al. Unilmv2: Pseudo-masked language models for unified language model pre-training. In International Con- ference on Machine Learning, pages 642–652. PMLR, 2020. 5

  2. [2]

    Toward transformer-based object detection

    Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew Zhai, and Dmitry Kislyuk. Toward transformer-based object detection. arXiv preprint arXiv:2012.09958, 2020. 3

  3. [3]

    Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V . Le. Attention augmented convolutional net- works, 2020. 3

  4. [4]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 7

  5. [5]

    Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms – improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017. 6, 9

  6. [6]

    Cascade r-cnn: Delv- ing into high quality object detection

    Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv- ing into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 6154–6162, 2018. 6, 9

  7. [7]

    Gcnet: Non-local networks meet squeeze-excitation net- works and beyond

    Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation net- works and beyond. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV) Workshops, Oct 2019. 3, 6, 7, 9

  8. [8]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European Confer- ence on Computer Vision, pages 213–229. Springer, 2020. 3, 6, 9

  9. [9]

    Hybrid task cascade for instance segmentation

    Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4974– 4983, 2019. 6, 9

  10. [10]

    MMDetection: Open MMLab Detection Toolbox and Benchmark

    Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection tool- box and benchmark. arXiv preprint arXiv:1906.07155, 2019. 6, 9

  11. [11]

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 7

  12. [12]

    Reppoints v2: Verification meets regres- sion for object detection

    Yihong Chen, Zheng Zhang, Yue Cao, Liwei Wang, Stephen Lin, and Han Hu. Reppoints v2: Verification meets regres- sion for object detection. In NeurIPS, 2020. 6, 7, 9

  13. [13]

    Relationnet++: Bridging visual representations for object detection via trans- former decoder

    Cheng Chi, Fangyun Wei, and Han Hu. Relationnet++: Bridging visual representations for object detection via trans- former decoder. In NeurIPS, 2020. 3, 7

  14. [14]

    Rethinking attention with performers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sar- los, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations ,

  15. [15]

    Do we really need explicit position encodings for vision transformers? arXiv preprint arXiv:2102.10882,

    Xiangxiang Chu, Bo Zhang, Zhi Tian, Xiaolin Wei, and Huaxia Xia. Do we really need explicit position encodings for vision transformers? arXiv preprint arXiv:2102.10882,

  16. [16]

    MMSegmentation: Openmmlab semantic segmentation toolbox and bench- mark

    MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and bench- mark. https://github.com/open-mmlab/ mmsegmentation, 2020. 8, 10

  17. [17]

    Randaugment: Practical automated data augmenta- tion with a reduced search space

    Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmenta- tion with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020. 9

  18. [18]

    Deformable convolutional networks

    Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE International Confer- ence on Computer Vision, pages 764–773, 2017. 1, 3

  19. [19]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

  20. [20]

    An image is 11 worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is 11 worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 1,...

  21. [21]

    Spinenet: Learning scale-permuted backbone for recogni- tion and localization

    Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. Spinenet: Learning scale-permuted backbone for recogni- tion and localization. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11592–11601, 2020. 7

  22. [22]

    Instaboost: Boosting instance segmentation via probability map guided copy- pasting

    Hao-Shu Fang, Jianhua Sun, Runzhong Wang, Minghao Gou, Yong-Lu Li, and Cewu Lu. Instaboost: Boosting instance segmentation via probability map guided copy- pasting. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 682–691, 2019. 6, 9

  23. [23]

    Dual attention network for scene segmentation

    Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi- wei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3146– 3154, 2019. 3, 7

  24. [24]

    Adaptive context network for scene parsing

    Jun Fu, Jing Liu, Yuhang Wang, Yong Li, Yongjun Bao, Jin- hui Tang, and Hanqing Lu. Adaptive context network for scene parsing. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 6748–6757,

  25. [25]

    Cognitron: A self-organizing multi- layered neural network

    Kunihiko Fukushima. Cognitron: A self-organizing multi- layered neural network. Biological cybernetics, 20(3):121– 136, 1975. 3

  26. [26]

    Simple copy-paste is a strong data augmentation method for instance segmentation

    Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. arXiv preprint arXiv:2012.07177, 2020. 2, 7

  27. [27]

    Learning region features for object detection

    Jiayuan Gu, Han Hu, Liwei Wang, Yichen Wei, and Jifeng Dai. Learning region features for object detection. In Pro- ceedings of the European Conference on Computer Vision (ECCV), 2018. 3

  28. [28]

    Transformer in transformer.arXiv preprint arXiv:2103.00112, 2021

    Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer.arXiv preprint arXiv:2103.00112, 2021. 3

  29. [29]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 6, 9

  30. [30]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 1, 2, 4

  31. [31]

    Augment your batch: Improving generalization through instance repetition

    Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, and Daniel Soudry. Augment your batch: Improving generalization through instance repetition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8129–8138, 2020. 6, 9

  32. [32]

    Relation networks for object detection

    Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3588–3597, 2018. 3, 5

  33. [33]

    Local relation networks for image recognition

    Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3464–3473, October 2019. 2, 3, 5

  34. [34]

    Densely connected convolutional net- works

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 1, 2

  35. [35]

    Deep networks with stochastic depth

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision , pages 646–661. Springer, 2016. 9

  36. [36]

    Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex

    David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology , 160(1):106–154,

  37. [37]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,

  38. [38]

    Big transfer (bit): General visual representation learning

    Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. arXiv preprint arXiv:1912.11370, 6(2):8, 2019. 6

  39. [39]

    Imagenet classification with deep convolutional neural net- works

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works. In Advances in neural information processing sys- tems, pages 1097–1105, 2012. 1, 2

  40. [40]

    Gradient-based learning applied to document recog- nition

    Yann LeCun, L ´eon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recog- nition. Proceedings of the IEEE , 86(11):2278–2324, 1998. 2

  41. [41]

    Object recognition with gradient-based learning

    Yann LeCun, Patrick Haffner, L´eon Bottou, and Yoshua Ben- gio. Object recognition with gradient-based learning. In Shape, contour and grouping in computer vision, pages 319–

  42. [42]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July

  43. [43]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014. 5

  44. [44]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learning Representations, 2019. 6, 9, 10

  45. [45]

    Acceleration of stochastic approximation by averaging

    Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992. 6, 9

  46. [46]

    Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution

    Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution. arXiv preprint arXiv:2006.02334 ,

  47. [47]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 1 12

  48. [48]

    Designing network design spaces

    Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 10428– 10436, 2020. 6

  49. [49]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learn- ing Research, 21(140):1–67, 2020. 5

  50. [50]

    Stand-alone self- attention in vision models

    Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self- attention in vision models. In Advances in Neural Informa- tion Processing Systems, volume 32. Curran Associates, Inc.,

  51. [51]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In International Conference on Medical image com- puting and computer-assisted intervention , pages 234–241. Springer, 2015. 2

  52. [52]

    Simonyan and A

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, May 2015. 2, 4

  53. [53]

    An analysis of scale in- variance in object detection snip

    Bharat Singh and Larry S Davis. An analysis of scale in- variance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3578–3587, 2018. 2

  54. [54]

    Sniper: Efficient multi-scale training

    Bharat Singh, Mahyar Najibi, and Larry S Davis. Sniper: Efficient multi-scale training. In Advances in Neural Infor- mation Processing Systems, volume 31. Curran Associates, Inc., 2018. 2

  55. [55]

    Bottle- neck transformers for visual recognition

    Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottle- neck transformers for visual recognition. arXiv preprint arXiv:2101.11605, 2021. 3

  56. [56]

    Sparse r-cnn: End-to-end object detection with learnable proposals

    Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chen- feng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, et al. Sparse r-cnn: End-to-end object detection with learnable proposals. arXiv preprint arXiv:2011.12450, 2020. 3, 6, 9

  57. [57]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1–9, 2015. 2

  58. [58]

    Efficientnet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR,

  59. [59]

    Efficientdet: Scalable and efficient object detection

    Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020. 7

  60. [60]

    Long range arena : A bench- mark for efficient transformers

    Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A bench- mark for efficient transformers. In International Conference on Learning Representations, 2021. 8

  61. [61]

    Mlp-mixer: An all-mlp ar- chitecture for vision, 2021

    Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lu- cas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp ar- chitecture for vision, 2021. 2, 10, 11

  62. [62]

    Resmlp: Feedforward networks for image clas- sification with data-efficient training, 2021

    Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izac- ard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, and Herv´e J´egou. Resmlp: Feedforward networks for image clas- sification with data-efficient training, 2021. 11

  63. [63]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. arXiv preprint arXiv:2012.12877, 2020. 2, 3, 5, 6, 9, 11

  64. [64]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. 1, 2, 4

  65. [65]

    Deep high-resolution represen- tation learning for visual recognition

    Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution represen- tation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 2020. 3

  66. [66]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122, 2021. 3

  67. [67]

    Non-local neural networks

    Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018 ,

  68. [68]

    Pytorch image mod- els

    Ross Wightman. Pytorch image mod- els. https://github.com/rwightman/ pytorch-image-models, 2019. 6, 11

  69. [69]

    Unified perceptual parsing for scene understand- ing

    Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understand- ing. In Proceedings of the European Conference on Com- puter Vision (ECCV), pages 418–434, 2018. 7, 8, 10

  70. [70]

    Aggregated residual transformations for deep neural networks

    Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1492– 1500, 2017. 1, 2, 3

  71. [71]

    Disentangled non-local neural networks

    Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, and Han Hu. Disentangled non-local neural networks. In Proceedings of the European conference on computer vision (ECCV), 2020. 3, 7, 10

  72. [72]

    Tokens- to-token vit: Training vision transformers from scratch on imagenet

    Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens- to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986, 2021. 3

  73. [73]

    Object- contextual representations for semantic segmentation

    Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object- contextual representations for semantic segmentation. In 13 16th European Conference Computer Vision (ECCV 2020) , August 2020. 7

  74. [74]

    Ocnet: Object context net- work for scene parsing

    Yuhui Yuan and Jingdong Wang. Ocnet: Object context net- work for scene parsing. arXiv preprint arXiv:1809.00916 ,

  75. [75]

    Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6023–6032, 2019. 9

  76. [76]

    Wide residual net- works

    Sergey Zagoruyko and Nikos Komodakis. Wide residual net- works. In BMVC, 2016. 1

  77. [77]

    mixup: Beyond Empirical Risk Minimization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. arXiv preprint arXiv:1710.09412, 2017. 9

  78. [78]

    Resnest: Split-attention networks

    Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R Manmatha, et al. Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955, 2020. 7, 8

  79. [79]

    Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection

    Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9759–9768, 2020. 6, 9

  80. [80]

    Explor- ing self-attention for image recognition

    Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Explor- ing self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10076–10085, 2020. 3

Showing first 80 references.