Self-Supervised Learning for Real-World Object Detection: a Survey
Pith reviewed 2026-05-23 19:00 UTC · model grok-4.3
The pith
Instance discrimination SSL methods pair best with CNN encoders while masked image modeling suits ViT architectures for object detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Instance discrimination methods perform well with CNN-based encoders, while MIM methods are better suited for ViT-based architectures and custom dataset pre-training. Choosing an appropriate SSL pre-training strategy along with a suitable encoder significantly enhances performance in real-world object detection, particularly for small object detection in frugal settings.
What carries the argument
Head-to-head comparison of instance discrimination versus masked image modeling SSL pre-training, each paired with either CNN or ViT encoders, measured on COCO and domain-specific infrared imagery for small-object detection accuracy.
If this is right
- CNN-based detectors should default to instance discrimination pre-training to improve small-object recall.
- ViT-based detectors and custom-domain pre-training should use masked image modeling instead.
- The architecture-strategy matching yields measurable gains on real-world tasks such as infrared vehicle detection.
- Practitioners can consult the survey table to pick the combination that matches their backbone and data constraints.
Where Pith is reading between the lines
- The reported pairings may generalize to other dense-prediction tasks that also struggle with small objects.
- Repeating the benchmarks on additional small-object domains such as medical or aerial imagery would test the stability of the CNN-versus-ViT rule.
- Future SSL designs could combine elements of both instance discrimination and masked image modeling to reduce architecture dependence.
Load-bearing premise
The reported performance differences between SSL strategies and encoders arise purely from those choices rather than from uncontrolled differences in training hyperparameters, data curation, or object-size distributions across runs.
What would settle it
A re-run of the COCO and infrared benchmarks that fixes every training detail except the SSL method and encoder type, then shows no consistent accuracy gap between the claimed best pairings.
Figures
read the original abstract
Self-Supervised Learning (SSL) has emerged as a promising approach in computer vision, enabling networks to learn meaningful representations from large unlabeled datasets. SSL methods fall into two main categories: instance discrimination and Masked Image Modeling (MIM). While instance discrimination is fundamental to SSL, it was originally designed for classification and may be less effective for object detection, particularly for small objects. In this survey, we focus on SSL methods specifically tailored for real-world object detection, with an emphasis on detecting small objects in complex environments. Unlike previous surveys, we offer a detailed comparison of SSL strategies, including object-level instance discrimination and MIM methods, and assess their effectiveness for small object detection using both CNN and ViT-based architectures. Specifically, our benchmark is performed on the widely-used COCO dataset, as well as on a specialized real-world dataset focused on vehicle detection in infrared remote sensing imagery. We also assess the impact of pre-training on custom domain-specific datasets, highlighting how certain SSL strategies are better suited for handling uncurated data. Our findings highlight that instance discrimination methods perform well with CNN-based encoders, while MIM methods are better suited for ViT-based architectures and custom dataset pre-training. This survey provides a practical guide for selecting optimal SSL strategies, taking into account factors such as backbone architecture, object size, and custom pre-training requirements. Ultimately, we show that choosing an appropriate SSL pre-training strategy, along with a suitable encoder, significantly enhances performance in real-world object detection, particularly for small object detection in frugal settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys self-supervised learning (SSL) methods for real-world object detection with emphasis on small objects in complex environments. It contrasts instance discrimination and masked image modeling (MIM) approaches, benchmarks them on COCO and an infrared remote-sensing vehicle detection dataset using both CNN and ViT encoders, evaluates the effect of custom domain-specific pre-training, and concludes that instance discrimination performs well with CNN encoders while MIM is better suited to ViT architectures and custom pre-training, yielding gains especially for small-object detection in frugal settings.
Significance. If the reported benchmarks fairly isolate SSL strategy and encoder effects, the survey supplies a practical selection guide for SSL pre-training in object detection that accounts for backbone type, object scale, and domain-specific data. The inclusion of an infrared remote-sensing benchmark and explicit attention to small-object and frugal regimes adds applied relevance beyond generic classification-focused SSL surveys.
major comments (1)
- [Abstract and benchmark description] Abstract and benchmark description: the central claim that instance discrimination suits CNN encoders while MIM suits ViTs (and custom pre-training) rests on COCO and infrared dataset comparisons. The provided text supplies no indication that training schedules, augmentations, optimizer settings, or object-size stratified splits were held fixed across SSL variants and backbones; without such controls the observed differences could arise from confounding factors rather than the claimed strategy–architecture interaction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our survey. We address the single major comment below and commit to revisions that clarify the experimental protocol without altering the reported findings.
read point-by-point responses
-
Referee: [Abstract and benchmark description] Abstract and benchmark description: the central claim that instance discrimination suits CNN encoders while MIM suits ViTs (and custom pre-training) rests on COCO and infrared dataset comparisons. The provided text supplies no indication that training schedules, augmentations, optimizer settings, or object-size stratified splits were held fixed across SSL variants and backbones; without such controls the observed differences could arise from confounding factors rather than the claimed strategy–architecture interaction.
Authors: We agree that the abstract and high-level benchmark description do not explicitly enumerate the controls. The full manuscript's experimental section standardizes training epochs, batch size, optimizer, and learning-rate schedule across all SSL variants for a given backbone, re-uses the same augmentation pipeline from the original SSL papers where feasible, and evaluates on the official COCO small/medium/large object-size splits. Nevertheless, to eliminate any ambiguity we will (i) expand the abstract with a sentence on controlled variables and (ii) insert a dedicated paragraph in the benchmark description that lists the fixed hyperparameters and confirms object-size stratification. These additions will make the isolation of SSL-strategy and encoder effects explicit. revision: yes
Circularity Check
No circularity: survey with external benchmarks
full rationale
This is a literature survey plus new benchmark results on COCO and an infrared remote-sensing dataset. No equations, derivations, fitted parameters, or self-citation chains appear in the provided text. Claims rest on reported experimental outcomes and cited prior work rather than reducing to self-definition or fitted inputs by construction. The paper is self-contained against external benchmarks and therefore receives the default non-circularity outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-supervised pre-training on unlabeled data yields representations transferable to downstream object detection
Reference graph
Works this paper leans on
-
[1]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755
work page 2014
-
[2]
Vehicle detection in aerial imagery: A small target detection benchmark,
S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,” Journal of Visual Communication and Image Representation, vol. 34, pp. 187–203, 2016
work page 2016
-
[3]
Momentum contrast for unsupervised visual representation learning,
K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738
work page 2020
-
[4]
Bootstrap your own latent- a new approach to self-supervised learning,
J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent- a new approach to self-supervised learning,” Advances in neural information processing systems, vol. 33, pp. 21 271– 21 284, 2020
work page 2020
-
[5]
Emerging properties in self-supervised vision transformers,
M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660
work page 2021
-
[6]
A Survey on Contrastive Self- supervised Learning,
A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A Survey on Contrastive Self- supervised Learning,” arXiv:2011.00362, Feb. 2021. [Online]. Available: http://arxiv.org/abs/2011.00362
-
[7]
Know your self-supervised learning: A survey on image- based generative and discriminative training,
U. Ozbulak, H. J. Lee, B. Boga, E. T. Anzaku, H. Park, A. Van Messem, W. De Neve, and J. Vankerschaver, “Know your self-supervised learning: A survey on image- based generative and discriminative training,”arXiv preprint arXiv:2305.13689, 2023
-
[8]
A survey on self-supervised learning: Algorithms, applications, and future trends,
J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao, “A survey on self-supervised learning: Algorithms, applications, and future trends,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[9]
Masked autoencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009
work page 2022
-
[10]
Semantic understanding of scenes through the ade20k dataset,
B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision, vol. 127, pp. 302–321, 2019. 17
work page 2019
-
[11]
Region similarity representation learning,
T. Xiao, C. J. Reed, X. Wang, K. Keutzer, and T. Darrell, “Region similarity representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 539–10 548
work page 2021
-
[12]
Dense contrastive learning for self-supervised visual pre-training,
X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li, “Dense contrastive learning for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3024–3033
work page 2021
-
[13]
Z. Xie, Y . Lin, Z. Zhang, Y . Cao, S. Lin, and H. Hu, “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 684–16 693
work page 2021
-
[14]
Ob- ject discovery and representation networks,
O. J. H´enaff, S. Koppula, E. Shelhamer, D. Zoran, A. Jaegle, A. Zisserman, J. Carreira, and R. Arandjelovi ´c, “Ob- ject discovery and representation networks,” in European Conference on Computer Vision. Springer, 2022, pp. 123– 143
work page 2022
-
[15]
Exploring set similarity for dense self-supervised representation learning,
Z. Wang, Q. Li, G. Zhang, P. Wan, W. Zheng, N. Wang, M. Gong, and T. Liu, “Exploring set similarity for dense self-supervised representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 590–16 599
work page 2022
-
[16]
Self-supervised learning of object parts for semantic segmentation,
A. Ziegler and Y . M. Asano, “Self-supervised learning of object parts for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 502–14 511
work page 2022
-
[17]
Bench- marking detection transfer learning with vision transformers,
Y . Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick, “Benchmarking detection transfer learning with vision transformers,” arXiv preprint arXiv:2111.11429, 2021
-
[18]
What do self-supervised vision transformers learn?
N. Park, W. Kim, B. Heo, T. Kim, and S. Yun, “What do self-supervised vision transformers learn?” in The Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[19]
Revealing the dark secrets of masked image modeling,
Z. Xie, Z. Geng, J. Hu, Z. Zhang, H. Hu, and Y . Cao, “Revealing the dark secrets of masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 475–14 485
work page 2023
-
[20]
J. Gao, S. Lin, S. Wang, Y . Kou, Z. Li, L. Li, C. Zhang, X. Zhang, Y . Wang, and W. Hu, “Observation, analysis, and solution: Exploring strong lightweight vision transformers via masked image modeling pre-training,” arXiv preprint arXiv:2404.12210, 2024
-
[21]
A survey of self-supervised and few-shot object detection,
G. Huang, I. Laradji, D. Vazquez, S. Lacoste-Julien, and P. Rodriguez, “A survey of self-supervised and few-shot object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4071–4089, 2022
work page 2022
-
[22]
Spatially consistent representation learning,
B. Roh, W. Shin, I. Kim, and S. Kim, “Spatially consistent representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1144–1153
work page 2021
-
[23]
Self-supervised visual representations learning by con- trastive mask prediction,
Y . Zhao, G. Wang, C. Luo, W. Zeng, and Z.-J. Zha, “Self-supervised visual representations learning by con- trastive mask prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 160–10 169
work page 2021
-
[24]
Aligning pretraining for detection via object-level contrastive learn- ing,
F. Wei, Y . Gao, Z. Wu, H. Hu, and S. Lin, “Aligning pretraining for detection via object-level contrastive learn- ing,” Advances in Neural Information Processing Systems, vol. 34, pp. 22 682–22 694, 2021
work page 2021
-
[25]
Casting your model: Learning to localize improves self-supervised representations,
R. R. Selvaraju, K. Desai, J. Johnson, and N. Naik, “Casting your model: Learning to localize improves self-supervised representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 058–11 067
work page 2021
-
[26]
Crafting better contrastive views for siamese representation learning,
X. Peng, K. Wang, Z. Zhu, M. Wang, and Y . You, “Crafting better contrastive views for siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 031–16 040
work page 2022
-
[27]
Instance localization for self-supervised detection pretraining,
C. Yang, Z. Wu, B. Zhou, and S. Lin, “Instance localization for self-supervised detection pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3987–3996
work page 2021
-
[28]
Cp 2: Copy-paste contrastive pretraining for semantic seg- mentation,
F. Wang, H. Wang, C. Wei, A. Yuille, and W. Shen, “Cp 2: Copy-paste contrastive pretraining for semantic seg- mentation,” in European Conference on Computer Vision. Springer, 2022, pp. 499–515
work page 2022
-
[29]
Unsupervised object-level representation learning from scene images,
J. Xie, X. Zhan, Z. Liu, Y . S. Ong, and C. C. Loy, “Unsupervised object-level representation learning from scene images,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 864–28 876, 2021
work page 2021
-
[30]
Unsupervised learning of visual features by contrasting cluster assignments,
M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020
work page 2020
-
[31]
Inscon: Instance consistency feature representation via self-supervised learning,
J. Yang, K. Zhang, Z. Cui, J. Su, J. Luo, and X. Wei, “Inscon: Instance consistency feature representation via self-supervised learning,” arXiv preprint arXiv:2203.07688, 2022
-
[32]
Unsupervised learning of dense visual representations,
P. O. O Pinheiro, A. Almahairi, R. Benmalek, F. Golemo, and A. C. Courville, “Unsupervised learning of dense visual representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 4489–4500, 2020
work page 2020
-
[33]
Deeply unsupervised patch re-identification for pre- training object detectors,
J. Ding, E. Xie, H. Xu, C. Jiang, Z. Li, P. Luo, and G.-S. Xia, “Deeply unsupervised patch re-identification for pre- training object detectors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
work page 2022
-
[34]
Self-supervised learning with local contrastive loss for detection and semantic segmentation,
A. Islam, B. Lundell, H. Sawhney, S. N. Sinha, P. Morales, and R. J. Radke, “Self-supervised learning with local contrastive loss for detection and semantic segmentation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5624–5633
work page 2023
-
[35]
Self-supervised learning of contextualized local visual embeddings,
T. Silva, H. Pedrini, and A. Ram ´ırez, “Self-supervised learning of contextualized local visual embeddings,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 177–186
work page 2023
-
[36]
Self-emd: Self-supervised object detection without imagenet,
S. Liu, Z. Li, and J. Sun, “Self-emd: Self-supervised object detection without imagenet,” arXiv preprint arXiv:2011.13677, 2020
-
[37]
Vicregl: Self- supervised learning of local visual features,
A. Bardes, J. Ponce, and Y . LeCun, “Vicregl: Self- supervised learning of local visual features,” Advances in Neural Information Processing Systems, vol. 35, pp. 8799– 8810, 2022
work page 2022
-
[38]
Efficient visual pretraining with contrastive detection,
O. J. H ´enaff, S. Koppula, J.-B. Alayrac, A. Van den Oord, O. Vinyals, and J. Carreira, “Efficient visual pretraining with contrastive detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 086–10 096
work page 2021
-
[39]
Efficient 18 graph-based image segmentation,
P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient 18 graph-based image segmentation,” International journal of computer vision, vol. 59, pp. 167–181, 2004
work page 2004
-
[40]
Beit: Bert pre- training of image transformers,
H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre- training of image transformers,” in International Conference on Learning Representations, 2021
work page 2021
-
[41]
Image bert pre-training with online tokenizer,
J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “Image bert pre-training with online tokenizer,” in International Conference on Learning Representations, 2021
work page 2021
-
[42]
Self-supervised learning from images with a joint-embedding predictive architecture,
M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 619–15 629
work page 2023
-
[43]
Simmim: A simple framework for masked image modeling,
Z. Xie, Z. Zhang, Y . Cao, Y . Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9653–9663
work page 2022
-
[44]
Context encoders: Feature learning by inpainting,
D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2536–2544
work page 2016
-
[45]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255
work page 2009
-
[46]
Z. Li, Y . Wang, C. Xiao, Q. Ling, Z. Lin, and W. An, “You only train once: Learning a general anomaly enhancement network with random masks for hyperspectral anomaly detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–18, 2023
work page 2023
-
[47]
The inaturalist species classification and detection dataset,
G. Van Horn, O. Mac Aodha, Y . Song, Y . Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The inaturalist species classification and detection dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8769–8778
work page 2018
-
[48]
Corrupted image modeling for self-supervised visual pre-training,
Y . Fang, L. Dong, H. Bao, X. Wang, and F. Wei, “Corrupted image modeling for self-supervised visual pre-training,” in The Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[49]
Pixmim: Rethinking pixel reconstruction in masked image modeling,
Y . Liu, S. Zhang, J. Chen, K. Chen, and D. Lin, “Pixmim: Rethinking pixel reconstruction in masked image modeling,” Transactions on Machine Learning Research, 2024
work page 2024
-
[50]
Mst: Masked self- supervised transformer for visual representation,
Z. Li, Z. Chen, F. Yang, W. Li, Y . Zhu, C. Zhao, R. Deng, L. Wu, R. Zhao, M. Tang et al., “Mst: Masked self- supervised transformer for visual representation,” Advances in Neural Information Processing Systems, vol. 34, pp. 13 165–13 176, 2021
work page 2021
-
[51]
What to hide from your students: Attention-guided masked image modeling,
I. Kakogeorgiou, S. Gidaris, B. Psomas, Y . Avrithis, A. Bur- suc, K. Karantzalos, and N. Komodakis, “What to hide from your students: Attention-guided masked image modeling,” in European Conference on Computer Vision. Springer, 2022, pp. 300–318
work page 2022
-
[52]
Good helper is around you: Attention-driven masked image modeling,
Z. Liu, J. Gui, and H. Luo, “Good helper is around you: Attention-driven masked image modeling,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1799–1807
work page 2023
-
[53]
Milan: Masked image pretraining on language assisted representation,
Z. Hou, F. Sun, Y .-K. Chen, Y . Xie, and S.-Y . Kung, “Milan: Masked image pretraining on language assisted representation,” arXiv preprint arXiv:2208.06049, 2022
-
[54]
Learning transferable visual models from natural lan- guage supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural lan- guage supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763
work page 2021
-
[55]
Semmae: Semantic-guided masking for learning masked autoencoders,
G. Li, H. Zheng, D. Liu, C. Wang, B. Su, and C. Zheng, “Semmae: Semantic-guided masking for learning masked autoencoders,” Advances in Neural Information Processing Systems, vol. 35, pp. 14 290–14 302, 2022
work page 2022
-
[56]
Dppmask: Masked image mod- eling with determinantal point processes,
J. Xu, Z. Lin, D. Zhou, Y . Yang, X. Liao, Q. Wang, B. Wu, G. Chen, and P.-A. Heng, “Dppmask: Masked image mod- eling with determinantal point processes,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2266–2276
work page 2024
-
[57]
Extracting and composing robust features with denoising autoencoders,
P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103
work page 2008
-
[58]
H. Liu, X. Jiang, X. Li, A. Guo, Y . Hu, D. Jiang, and B. Ren, “The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1649–1656
work page 2023
-
[59]
Architecture- agnostic masked image modeling from vit back to cnn,
S. Li, D. Wu, F. Wu, Z. Zang, and S. Z. Li, “Architecture- agnostic masked image modeling from vit back to cnn,” in International Conference on Machine Learning. PMLR, 2023, pp. 20 149–20 167
work page 2023
-
[60]
Masked feature prediction for self-supervised visual pre-training,
C. Wei, H. Fan, S. Xie, C.-Y . Wu, A. Yuille, and C. Feichten- hofer, “Masked feature prediction for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 668–14 678
work page 2022
-
[61]
Self- supervised masking for unsupervised anomaly detection and localization,
C. Huang, Q. Xu, Y . Wang, Y . Wang, and Y . Zhang, “Self- supervised masking for unsupervised anomaly detection and localization,” IEEE Transactions on Multimedia, 2022
work page 2022
-
[62]
A unified view of masked image modeling,
Z. Peng, L. Dong, H. Bao, F. Wei, and Q. Ye, “A unified view of masked image modeling,” Transactions on Machine Learning Research, 2022
work page 2022
-
[63]
Stare at what you see: Masked image modeling without reconstruction,
H. Xue, P. Gao, H. Li, Y . Qiao, H. Sun, H. Li, and J. Luo, “Stare at what you see: Masked image modeling without reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 732–22 741
work page 2023
-
[64]
Are large-scale datasets necessary for self- supervised pre-training?
A. El-Nouby, G. Izacard, H. Touvron, I. Laptev, H. Jegou, and E. Grave, “Are large-scale datasets necessary for self- supervised pre-training?” arXiv preprint arXiv:2112.10740, 2021
-
[65]
Exploring target representations for masked autoencoders,
X. Liu, J. Zhou, T. Kong, X. Lin, and R. Ji, “Exploring target representations for masked autoencoders,” arXiv preprint arXiv:2209.03917, 2022
-
[66]
Designing bert for convolutional networks: Sparse and hier- archical masked modeling,
K. Tian, Y . Jiang, C. Lin, L. Wang, Z. Yuan et al., “Designing bert for convolutional networks: Sparse and hier- archical masked modeling,” in The Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[67]
Convmae: Masked convolution meets masked autoencoders,
P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y . Qiao, “Convmae: Masked convolution meets masked autoencoders,” arXiv preprint arXiv:2205.03892, 2022
-
[68]
J. Liu, X. Huang, J. Zheng, Y . Liu, and H. Li, “Mixmae: Mixed and masked autoencoder for efficient pretraining 19 of hierarchical vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6252–6261
work page 2023
-
[69]
Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986
work page 2022
-
[70]
K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask R-CNN,” arXiv:1703.06870, Jan. 2018. [Online]. Available: http://arxiv.org/abs/1703.06870
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[71]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[72]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
P. Goyal, P. Doll ´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[73]
Unleashing vanilla vision transformer with masked image modeling for object detection,
Y . Fang, S. Yang, S. Wang, Y . Ge, Y . Shan, and X. Wang, “Unleashing vanilla vision transformer with masked image modeling for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6244–6253
work page 2023
-
[74]
Lsotb-tir: A large-scale high- diversity thermal infrared object tracking benchmark,
Q. Liu, X. Li, Z. He, C. Li, J. Li, Z. Zhou, D. Yuan, J. Li, K. Yang, N. Fan et al., “Lsotb-tir: A large-scale high- diversity thermal infrared object tracking benchmark,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 3847–3856
work page 2020
-
[75]
H. Sun, J. Bai, F. Yang, and X. Bai, “Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset irdst,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023
work page 2023
-
[76]
T. Imaging, “Flir data set dataset,” https://universe.roboflow. com/thermal-imaging-0hwfw/flir-data-set, mar 2024, visited on 2024-07-16. [Online]. Available: https://universe. roboflow.com/thermal-imaging-0hwfw/flir-data-set
work page 2024
-
[77]
H. Wang, L. Zhou, and L. Wang, “Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8509–8518
work page 2019
-
[78]
People detection and tracking from aerial thermal views,
J. Portmann, S. Lynen, M. Chli, and R. Siegwart, “People detection and tracking from aerial thermal views,” in 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 1794–1800
work page 2014
-
[79]
J. Suo, T. Wang, X. Zhang, H. Chen, W. Zhou, and W. Shi, “Hit-uav: A high-altitude infrared thermal dataset for unmanned aerial vehicle-based object detection,” Scientific Data, vol. 10, no. 1, p. 227, 2023
work page 2023
-
[80]
Isnet: Shape matters for infrared small target detection,
M. Zhang, R. Zhang, Y . Yang, H. Bai, J. Zhang, and J. Guo, “Isnet: Shape matters for infrared small target detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 877–886
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.