Recognition: 1 theorem link
· Lean TheoremSwin Transformer: Hierarchical Vision Transformer using Shifted Windows
Pith reviewed 2026-05-15 19:24 UTC · model grok-4.3
The pith
Swin Transformer uses shifted windows in a hierarchical structure to make vision Transformers efficient backbones with linear complexity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Swin Transformer is a hierarchical vision Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size, making it compatible with a broad range of vision tasks.
What carries the argument
The shifted window self-attention mechanism, which partitions the feature map into non-overlapping local windows for attention computation and shifts the windows across layers to connect information between adjacent windows.
If this is right
- The model achieves 87.3 top-1 accuracy on ImageNet-1K classification.
- It reaches 58.7 box AP and 51.1 mask AP on COCO test-dev for object detection and instance segmentation.
- It obtains 53.5 mIoU on ADE20K validation for semantic segmentation, exceeding the previous best by 3.2 points.
- The hierarchical shifted-window design also improves accuracy when used inside all-MLP vision models.
Where Pith is reading between the lines
- The linear scaling property could support processing of images at resolutions far higher than those tested on COCO or ADE20K.
- Because window size remains fixed, the architecture may require adjustment when applied to domains with scale distributions very different from natural images.
- The same local-plus-shift pattern could be tested on video or volumetric data where temporal or depth dimensions introduce additional scale variation.
Load-bearing premise
The fixed window size and shift pattern chosen during ImageNet training transfer effectively to detection and segmentation heads on COCO and ADE20K without major retuning.
What would settle it
An ablation that swaps the Swin backbone into an existing detection framework while freezing all other components and retraining only the backbone would reveal whether the reported gains come primarily from the Transformer architecture itself.
read the original abstract
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Swin Transformer, a hierarchical vision Transformer that uses shifted windows to compute self-attention within non-overlapping local windows while enabling cross-window connections. This yields linear complexity with image size and multi-scale feature modeling. The architecture is evaluated as a backbone on ImageNet-1K classification (87.3% top-1), COCO object detection (58.7 box AP, 51.1 mask AP), and ADE20K semantic segmentation (53.5 mIoU), surpassing prior state-of-the-art by +2.7 box AP, +2.6 mask AP, and +3.2 mIoU respectively.
Significance. If the results hold, the work establishes hierarchical shifted-window Transformers as competitive general-purpose vision backbones, with publicly released code supporting reproducibility. The design addresses key vision-specific challenges (scale variation, high resolution) more efficiently than prior global-attention Transformers, and the consistent gains across three tasks with standard heads (Mask R-CNN, UperNet) indicate broad applicability.
major comments (1)
- [§5.2–5.3 (COCO and ADE20K experiments)] The central performance claims on COCO and ADE20K rest on the assumption that the ImageNet-tuned window size M=7 and shift pattern transfer without retuning; the manuscript reports strong numbers but provides no ablation isolating backbone gains from head-specific tuning or window-size sensitivity on these tasks.
minor comments (3)
- [§3.2] The complexity analysis in §3.2 claims O(HW) scaling, but the overhead of the cyclic shift operation and window merging is not separately timed or bounded in the reported FLOPs.
- [Tables 2–3] Table 2 and Table 3: baseline citations for some competing methods (e.g., recent CNN and Transformer variants) are incomplete; adding the original references would improve traceability.
- [Figure 2] Figure 2: the shifted-window diagram would benefit from explicit arrows or annotations showing the shift direction and the resulting cross-window attention links.
Simulated Author's Rebuttal
We thank the referee for the thorough review and the recommendation for minor revision. We are pleased that the significance of the work is recognized. Below we provide a point-by-point response to the major comment.
read point-by-point responses
-
Referee: [§5.2–5.3 (COCO and ADE20K experiments)] The central performance claims on COCO and ADE20K rest on the assumption that the ImageNet-tuned window size M=7 and shift pattern transfer without retuning; the manuscript reports strong numbers but provides no ablation isolating backbone gains from head-specific tuning or window-size sensitivity on these tasks.
Authors: We thank the referee for this insightful comment. The window size of M=7 was tuned on the ImageNet-1K classification task, and the same configuration, including the shift pattern, is directly transferred to the COCO and ADE20K experiments. This choice was made deliberately to demonstrate that Swin Transformer serves as a general-purpose backbone that does not require task-specific retuning of its core components. The object detection and semantic segmentation heads are standard implementations (Mask R-CNN and UperNet, respectively) without additional hyperparameter optimization beyond what is typical in the literature. Consequently, the reported gains (+2.7 box AP, +2.6 mask AP on COCO; +3.2 mIoU on ADE20K) can be attributed to the hierarchical shifted-window design of the backbone. Nevertheless, we acknowledge the value of further ablations and will add a sensitivity analysis for the window size M on the COCO detection task in the revised manuscript to isolate these effects more clearly. revision: yes
Circularity Check
No significant circularity detected in architecture proposal or empirical claims
full rationale
The paper defines a new hierarchical Transformer architecture with shifted-window attention and patch merging as explicit design choices motivated by efficiency and multi-scale needs. These are not derived from or equivalent to any fitted parameters or prior results inside the paper; the attention formulation (W-MSA/SW-MSA) and complexity analysis (O(HW)) follow directly from the stated window partitioning rules without reduction to inputs. All reported performance numbers (ImageNet top-1, COCO AP, ADE20K mIoU) are obtained via standard training on held-out validation/test sets after ImageNet pre-training, with no internal equations or predictions that collapse to the architecture definition itself. Ablations isolate the shift operation and hierarchy contributions independently. The public code link is a non-load-bearing reference and does not support any central claim. The derivation chain is therefore self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
free parameters (1)
- window size M
axioms (2)
- standard math Self-attention inside each window is computed exactly as in the original Transformer paper
- domain assumption Cyclic shift of windows preserves the linear complexity property
Lean theorems connected to this paper
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Architecture-Aware Explanation Auditing for Industrial Visual Inspection
Explanation faithfulness for deep classifiers on wafer maps is highest when the explainer matches the model's native readout structure, with ViT-Tiny plus Attention Rollout achieving lower Deletion AUC than mismatched...
-
Hierarchical Transformer Preconditioning for Interactive Physics Simulation
A hierarchical transformer preconditioner with H-matrix structure and cosine-Hutchinson training delivers up to 2.7x speedup over prior neural methods on stiff multiphase Poisson systems up to N=16384.
-
Bidirectional Cross-Attention Fusion of High-Res RGB and Low-Res HSI for Multimodal Automated Waste Sorting
BCAF fuses native-grid high-res RGB and low-res HSI via bidirectional cross-attention in adapted Swin Transformers to reach state-of-the-art mIoU on SpectralWaste and a new industrial dataset while running at real-tim...
-
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
-
iBOT: Image BERT Pre-Training with Online Tokenizer
iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.
-
BEiT: BERT Pre-Training of Image Transformers
BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.
-
Hierarchical Transformer Preconditioning for Interactive Physics Simulation
The Hierarchical Transformer Preconditioner uses a weak-admissibility H-matrix prior and cosine-Hutchinson objective to precondition large Poisson systems, delivering interactive frame rates with up to 28x speedup ove...
-
Spectral Vision Transformer for Efficient Tokenization with Limited Data
A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.
-
A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series
A graph-regulated disentangling Mamba model with sparse tokens achieves 93.94% accuracy classifying tree species from MODIS time series in Alberta and outperforms twelve prior models.
-
Semantic Segmentation for Histopathology using Learned Regularization based on Global Proportions
VSLP infers dense segmentations from global label proportions via a pre-trained transformer for initial confidence maps followed by variational optimization using Wasserstein fidelity and a learned regularizer, outper...
-
DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection
DVAR turns video authenticity detection into an iterative debate between a generative hypothesis agent and a natural mechanism agent, resolved via minimum description length and a knowledge base for better generalizat...
-
LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection
LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and p...
-
InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation
InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
-
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Grounding DINO fuses language and vision via feature enhancer, language-guided query selection, and cross-modality decoder in a DINO backbone, achieving 52.5 AP zero-shot on COCO and a new record of 26.1 AP mean on ODinW.
-
YOLOX: Exceeding YOLO Series in 2021
YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.
-
Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy
TREX detects rectal cancer local regrowth from longitudinal endoscopy image pairs with 97% sensitivity and enables early prediction 3-12 months before clinical confirmation, outperforming baselines.
-
Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank
DL models predict 12 AD risk factors from colored fundus photos, with saliency maps highlighting optic nerve and vessels that also differ in preclinical AD cases.
-
UniMesh: Unifying 3D Mesh Understanding and Generation
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
-
KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment
KAYRA packages a cascade of EfficientNet-B5 + U-Net, Mask R-CNN, and ResNet-18 models into a microservice architecture that supports both cloud and on-premise deployment and reaches 98.91% segmentation accuracy in a p...
-
The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge
A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.
Reference graph
Works this paper leans on
-
[1]
Unilmv2: Pseudo-masked language models for unified language model pre-training
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al. Unilmv2: Pseudo-masked language models for unified language model pre-training. In International Con- ference on Machine Learning, pages 642–652. PMLR, 2020. 5
work page 2020
-
[2]
Toward transformer-based object detection
Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew Zhai, and Dmitry Kislyuk. Toward transformer-based object detection. arXiv preprint arXiv:2012.09958, 2020. 3
-
[3]
Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V . Le. Attention augmented convolutional net- works, 2020. 3
work page 2020
-
[4]
YOLOv4: Optimal Speed and Accuracy of Object Detection
Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 7
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[5]
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms – improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017. 6, 9
work page 2017
-
[6]
Cascade r-cnn: Delv- ing into high quality object detection
Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv- ing into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 6154–6162, 2018. 6, 9
work page 2018
-
[7]
Gcnet: Non-local networks meet squeeze-excitation net- works and beyond
Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation net- works and beyond. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV) Workshops, Oct 2019. 3, 6, 7, 9
work page 2019
-
[8]
End-to- end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European Confer- ence on Computer Vision, pages 213–229. Springer, 2020. 3, 6, 9
work page 2020
-
[9]
Hybrid task cascade for instance segmentation
Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4974– 4983, 2019. 6, 9
work page 2019
-
[10]
MMDetection: Open MMLab Detection Toolbox and Benchmark
Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection tool- box and benchmark. arXiv preprint arXiv:1906.07155, 2019. 6, 9
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[11]
Encoder-decoder with atrous separable convolution for semantic image segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 7
work page 2018
-
[12]
Reppoints v2: Verification meets regres- sion for object detection
Yihong Chen, Zheng Zhang, Yue Cao, Liwei Wang, Stephen Lin, and Han Hu. Reppoints v2: Verification meets regres- sion for object detection. In NeurIPS, 2020. 6, 7, 9
work page 2020
-
[13]
Relationnet++: Bridging visual representations for object detection via trans- former decoder
Cheng Chi, Fangyun Wei, and Han Hu. Relationnet++: Bridging visual representations for object detection via trans- former decoder. In NeurIPS, 2020. 3, 7
work page 2020
-
[14]
Rethinking attention with performers
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sar- los, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations ,
-
[15]
Xiangxiang Chu, Bo Zhang, Zhi Tian, Xiaolin Wei, and Huaxia Xia. Do we really need explicit position encodings for vision transformers? arXiv preprint arXiv:2102.10882,
-
[16]
MMSegmentation: Openmmlab semantic segmentation toolbox and bench- mark
MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and bench- mark. https://github.com/open-mmlab/ mmsegmentation, 2020. 8, 10
work page 2020
-
[17]
Randaugment: Practical automated data augmenta- tion with a reduced search space
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmenta- tion with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020. 9
work page 2020
-
[18]
Deformable convolutional networks
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE International Confer- ence on Computer Vision, pages 764–773, 2017. 1, 3
work page 2017
-
[19]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5
work page 2009
-
[20]
An image is 11 worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is 11 worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 1,...
work page 2021
-
[21]
Spinenet: Learning scale-permuted backbone for recogni- tion and localization
Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. Spinenet: Learning scale-permuted backbone for recogni- tion and localization. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11592–11601, 2020. 7
work page 2020
-
[22]
Instaboost: Boosting instance segmentation via probability map guided copy- pasting
Hao-Shu Fang, Jianhua Sun, Runzhong Wang, Minghao Gou, Yong-Lu Li, and Cewu Lu. Instaboost: Boosting instance segmentation via probability map guided copy- pasting. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 682–691, 2019. 6, 9
work page 2019
-
[23]
Dual attention network for scene segmentation
Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi- wei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3146– 3154, 2019. 3, 7
work page 2019
-
[24]
Adaptive context network for scene parsing
Jun Fu, Jing Liu, Yuhang Wang, Yong Li, Yongjun Bao, Jin- hui Tang, and Hanqing Lu. Adaptive context network for scene parsing. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 6748–6757,
-
[25]
Cognitron: A self-organizing multi- layered neural network
Kunihiko Fukushima. Cognitron: A self-organizing multi- layered neural network. Biological cybernetics, 20(3):121– 136, 1975. 3
work page 1975
-
[26]
Simple copy-paste is a strong data augmentation method for instance segmentation
Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. arXiv preprint arXiv:2012.07177, 2020. 2, 7
-
[27]
Learning region features for object detection
Jiayuan Gu, Han Hu, Liwei Wang, Yichen Wei, and Jifeng Dai. Learning region features for object detection. In Pro- ceedings of the European Conference on Computer Vision (ECCV), 2018. 3
work page 2018
-
[28]
Transformer in transformer.arXiv preprint arXiv:2103.00112, 2021
Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer.arXiv preprint arXiv:2103.00112, 2021. 3
-
[29]
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 6, 9
work page 2017
-
[30]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 1, 2, 4
work page 2016
-
[31]
Augment your batch: Improving generalization through instance repetition
Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, and Daniel Soudry. Augment your batch: Improving generalization through instance repetition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8129–8138, 2020. 6, 9
work page 2020
-
[32]
Relation networks for object detection
Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3588–3597, 2018. 3, 5
work page 2018
-
[33]
Local relation networks for image recognition
Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3464–3473, October 2019. 2, 3, 5
work page 2019
-
[34]
Densely connected convolutional net- works
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. 1, 2
work page 2017
-
[35]
Deep networks with stochastic depth
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision , pages 646–661. Springer, 2016. 9
work page 2016
-
[36]
Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex
David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology , 160(1):106–154,
-
[37]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Big transfer (bit): General visual representation learning
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. arXiv preprint arXiv:1912.11370, 6(2):8, 2019. 6
-
[39]
Imagenet classification with deep convolutional neural net- works
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works. In Advances in neural information processing sys- tems, pages 1097–1105, 2012. 1, 2
work page 2012
-
[40]
Gradient-based learning applied to document recog- nition
Yann LeCun, L ´eon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recog- nition. Proceedings of the IEEE , 86(11):2278–2324, 1998. 2
work page 1998
-
[41]
Object recognition with gradient-based learning
Yann LeCun, Patrick Haffner, L´eon Bottou, and Yoshua Ben- gio. Object recognition with gradient-based learning. In Shape, contour and grouping in computer vision, pages 319–
-
[42]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July
-
[43]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014. 5
work page 2014
-
[44]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learning Representations, 2019. 6, 9, 10
work page 2019
-
[45]
Acceleration of stochastic approximation by averaging
Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992. 6, 9
work page 1992
-
[46]
Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution
Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution. arXiv preprint arXiv:2006.02334 ,
-
[47]
Learning transferable visual models from natural language supervision, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 1 12
work page 2021
-
[48]
Designing network design spaces
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 10428– 10436, 2020. 6
work page 2020
-
[49]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learn- ing Research, 21(140):1–67, 2020. 5
work page 2020
-
[50]
Stand-alone self- attention in vision models
Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self- attention in vision models. In Advances in Neural Informa- tion Processing Systems, volume 32. Curran Associates, Inc.,
-
[51]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In International Conference on Medical image com- puting and computer-assisted intervention , pages 234–241. Springer, 2015. 2
work page 2015
-
[52]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, May 2015. 2, 4
work page 2015
-
[53]
An analysis of scale in- variance in object detection snip
Bharat Singh and Larry S Davis. An analysis of scale in- variance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3578–3587, 2018. 2
work page 2018
-
[54]
Sniper: Efficient multi-scale training
Bharat Singh, Mahyar Najibi, and Larry S Davis. Sniper: Efficient multi-scale training. In Advances in Neural Infor- mation Processing Systems, volume 31. Curran Associates, Inc., 2018. 2
work page 2018
-
[55]
Bottle- neck transformers for visual recognition
Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottle- neck transformers for visual recognition. arXiv preprint arXiv:2101.11605, 2021. 3
-
[56]
Sparse r-cnn: End-to-end object detection with learnable proposals
Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chen- feng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, et al. Sparse r-cnn: End-to-end object detection with learnable proposals. arXiv preprint arXiv:2011.12450, 2020. 3, 6, 9
-
[57]
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1–9, 2015. 2
work page 2015
-
[58]
Efficientnet: Rethinking model scaling for convolutional neural networks
Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR,
-
[59]
Efficientdet: Scalable and efficient object detection
Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020. 7
work page 2020
-
[60]
Long range arena : A bench- mark for efficient transformers
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A bench- mark for efficient transformers. In International Conference on Learning Representations, 2021. 8
work page 2021
-
[61]
Mlp-mixer: An all-mlp ar- chitecture for vision, 2021
Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lu- cas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp ar- chitecture for vision, 2021. 2, 10, 11
work page 2021
-
[62]
Resmlp: Feedforward networks for image clas- sification with data-efficient training, 2021
Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izac- ard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, and Herv´e J´egou. Resmlp: Feedforward networks for image clas- sification with data-efficient training, 2021. 11
work page 2021
-
[63]
Training data-efficient image transformers & distillation through at- tention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. arXiv preprint arXiv:2012.12877, 2020. 2, 3, 5, 6, 9, 11
-
[64]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. 1, 2, 4
work page 2017
-
[65]
Deep high-resolution represen- tation learning for visual recognition
Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution represen- tation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 2020. 3
work page 2020
-
[66]
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122, 2021. 3
-
[67]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018 ,
work page 2018
-
[68]
Ross Wightman. Pytorch image mod- els. https://github.com/rwightman/ pytorch-image-models, 2019. 6, 11
work page 2019
-
[69]
Unified perceptual parsing for scene understand- ing
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understand- ing. In Proceedings of the European Conference on Com- puter Vision (ECCV), pages 418–434, 2018. 7, 8, 10
work page 2018
-
[70]
Aggregated residual transformations for deep neural networks
Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1492– 1500, 2017. 1, 2, 3
work page 2017
-
[71]
Disentangled non-local neural networks
Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, and Han Hu. Disentangled non-local neural networks. In Proceedings of the European conference on computer vision (ECCV), 2020. 3, 7, 10
work page 2020
-
[72]
Tokens- to-token vit: Training vision transformers from scratch on imagenet
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens- to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986, 2021. 3
-
[73]
Object- contextual representations for semantic segmentation
Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object- contextual representations for semantic segmentation. In 13 16th European Conference Computer Vision (ECCV 2020) , August 2020. 7
work page 2020
-
[74]
Ocnet: Object context net- work for scene parsing
Yuhui Yuan and Jingdong Wang. Ocnet: Object context net- work for scene parsing. arXiv preprint arXiv:1809.00916 ,
-
[75]
Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6023–6032, 2019. 9
work page 2019
-
[76]
Sergey Zagoruyko and Nikos Komodakis. Wide residual net- works. In BMVC, 2016. 1
work page 2016
-
[77]
mixup: Beyond Empirical Risk Minimization
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. arXiv preprint arXiv:1710.09412, 2017. 9
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[78]
Resnest: Split-attention networks
Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R Manmatha, et al. Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955, 2020. 7, 8
-
[79]
Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9759–9768, 2020. 6, 9
work page 2020
-
[80]
Explor- ing self-attention for image recognition
Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Explor- ing self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10076–10085, 2020. 3
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.