pith. sign in

arxiv: 2410.07442 · v2 · submitted 2024-10-09 · 💻 cs.CV

Self-Supervised Learning for Real-World Object Detection: a Survey

Pith reviewed 2026-05-23 19:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learningobject detectioninstance discriminationmasked image modelingCNNViTsmall object detectionremote sensing
0
0 comments X

The pith

Instance discrimination SSL methods pair best with CNN encoders while masked image modeling suits ViT architectures for object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews self-supervised learning methods tailored for real-world object detection, with special attention to small objects in complex scenes. It contrasts instance discrimination approaches against masked image modeling and evaluates them across CNN and vision transformer backbones. Experiments on the COCO dataset and an infrared remote-sensing vehicle dataset indicate that instance discrimination benefits CNN encoders, whereas masked image modeling works better with ViTs and when pre-training occurs on custom uncurated data. The results supply a decision guide for choosing the SSL strategy and encoder combination to lift detection accuracy, especially under limited data or compute constraints.

Core claim

Instance discrimination methods perform well with CNN-based encoders, while MIM methods are better suited for ViT-based architectures and custom dataset pre-training. Choosing an appropriate SSL pre-training strategy along with a suitable encoder significantly enhances performance in real-world object detection, particularly for small object detection in frugal settings.

What carries the argument

Head-to-head comparison of instance discrimination versus masked image modeling SSL pre-training, each paired with either CNN or ViT encoders, measured on COCO and domain-specific infrared imagery for small-object detection accuracy.

If this is right

  • CNN-based detectors should default to instance discrimination pre-training to improve small-object recall.
  • ViT-based detectors and custom-domain pre-training should use masked image modeling instead.
  • The architecture-strategy matching yields measurable gains on real-world tasks such as infrared vehicle detection.
  • Practitioners can consult the survey table to pick the combination that matches their backbone and data constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reported pairings may generalize to other dense-prediction tasks that also struggle with small objects.
  • Repeating the benchmarks on additional small-object domains such as medical or aerial imagery would test the stability of the CNN-versus-ViT rule.
  • Future SSL designs could combine elements of both instance discrimination and masked image modeling to reduce architecture dependence.

Load-bearing premise

The reported performance differences between SSL strategies and encoders arise purely from those choices rather than from uncontrolled differences in training hyperparameters, data curation, or object-size distributions across runs.

What would settle it

A re-run of the COCO and infrared benchmarks that fixes every training detail except the SSL method and encoder type, then shows no consistent accuracy gap between the claimed best pairings.

Figures

Figures reproduced from arXiv: 2410.07442 by Alina Ciocarlan, Arnaud Woiselle, Sidonie Lefebvre, Sylvie Le H\'egarat-Mascle.

Figure 1
Figure 1. Figure 1: Example of images dealing with object detection. The first row [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of object-level instance discrimination pipeline. Here, we represented the ReSim framework, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dense instance discrimination loss. proposed: a) Geometric alignment: VaDeR [32], PixCon￾trast [13], PixPro [13], DUPR [33], InsCon [31], Leopart [16], LC-Loss [34] and CLOVE [35] assume that the geometric transforms between the positive images are known (thanks to the knowledge of the data-augmentation process), and use them to perform spatial alignment. Leopart [16] additionally relies on the attention m… view at source ↗
Figure 4
Figure 4. Figure 4: Common masking strategies for masked image modelling. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Self-Supervised Learning (SSL) has emerged as a promising approach in computer vision, enabling networks to learn meaningful representations from large unlabeled datasets. SSL methods fall into two main categories: instance discrimination and Masked Image Modeling (MIM). While instance discrimination is fundamental to SSL, it was originally designed for classification and may be less effective for object detection, particularly for small objects. In this survey, we focus on SSL methods specifically tailored for real-world object detection, with an emphasis on detecting small objects in complex environments. Unlike previous surveys, we offer a detailed comparison of SSL strategies, including object-level instance discrimination and MIM methods, and assess their effectiveness for small object detection using both CNN and ViT-based architectures. Specifically, our benchmark is performed on the widely-used COCO dataset, as well as on a specialized real-world dataset focused on vehicle detection in infrared remote sensing imagery. We also assess the impact of pre-training on custom domain-specific datasets, highlighting how certain SSL strategies are better suited for handling uncurated data. Our findings highlight that instance discrimination methods perform well with CNN-based encoders, while MIM methods are better suited for ViT-based architectures and custom dataset pre-training. This survey provides a practical guide for selecting optimal SSL strategies, taking into account factors such as backbone architecture, object size, and custom pre-training requirements. Ultimately, we show that choosing an appropriate SSL pre-training strategy, along with a suitable encoder, significantly enhances performance in real-world object detection, particularly for small object detection in frugal settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper surveys self-supervised learning (SSL) methods for real-world object detection with emphasis on small objects in complex environments. It contrasts instance discrimination and masked image modeling (MIM) approaches, benchmarks them on COCO and an infrared remote-sensing vehicle detection dataset using both CNN and ViT encoders, evaluates the effect of custom domain-specific pre-training, and concludes that instance discrimination performs well with CNN encoders while MIM is better suited to ViT architectures and custom pre-training, yielding gains especially for small-object detection in frugal settings.

Significance. If the reported benchmarks fairly isolate SSL strategy and encoder effects, the survey supplies a practical selection guide for SSL pre-training in object detection that accounts for backbone type, object scale, and domain-specific data. The inclusion of an infrared remote-sensing benchmark and explicit attention to small-object and frugal regimes adds applied relevance beyond generic classification-focused SSL surveys.

major comments (1)
  1. [Abstract and benchmark description] Abstract and benchmark description: the central claim that instance discrimination suits CNN encoders while MIM suits ViTs (and custom pre-training) rests on COCO and infrared dataset comparisons. The provided text supplies no indication that training schedules, augmentations, optimizer settings, or object-size stratified splits were held fixed across SSL variants and backbones; without such controls the observed differences could arise from confounding factors rather than the claimed strategy–architecture interaction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We address the single major comment below and commit to revisions that clarify the experimental protocol without altering the reported findings.

read point-by-point responses
  1. Referee: [Abstract and benchmark description] Abstract and benchmark description: the central claim that instance discrimination suits CNN encoders while MIM suits ViTs (and custom pre-training) rests on COCO and infrared dataset comparisons. The provided text supplies no indication that training schedules, augmentations, optimizer settings, or object-size stratified splits were held fixed across SSL variants and backbones; without such controls the observed differences could arise from confounding factors rather than the claimed strategy–architecture interaction.

    Authors: We agree that the abstract and high-level benchmark description do not explicitly enumerate the controls. The full manuscript's experimental section standardizes training epochs, batch size, optimizer, and learning-rate schedule across all SSL variants for a given backbone, re-uses the same augmentation pipeline from the original SSL papers where feasible, and evaluates on the official COCO small/medium/large object-size splits. Nevertheless, to eliminate any ambiguity we will (i) expand the abstract with a sentence on controlled variables and (ii) insert a dedicated paragraph in the benchmark description that lists the fixed hyperparameters and confirms object-size stratification. These additions will make the isolation of SSL-strategy and encoder effects explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: survey with external benchmarks

full rationale

This is a literature survey plus new benchmark results on COCO and an infrared remote-sensing dataset. No equations, derivations, fitted parameters, or self-citation chains appear in the provided text. Claims rest on reported experimental outcomes and cited prior work rather than reducing to self-definition or fitted inputs by construction. The paper is self-contained against external benchmarks and therefore receives the default non-circularity outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey with added benchmarks, the work rests on standard domain assumptions of self-supervised learning and computer-vision evaluation rather than new postulates. No free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption Self-supervised pre-training on unlabeled data yields representations transferable to downstream object detection
    Invoked throughout the abstract as the premise enabling the surveyed methods.

pith-pipeline@v0.9.0 · 5819 in / 1110 out tokens · 22907 ms · 2026-05-23T19:00:10.553797+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 3 internal anchors

  1. [1]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755

  2. [2]

    Vehicle detection in aerial imagery: A small target detection benchmark,

    S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,” Journal of Visual Communication and Image Representation, vol. 34, pp. 187–203, 2016

  3. [3]

    Momentum contrast for unsupervised visual representation learning,

    K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

  4. [4]

    Bootstrap your own latent- a new approach to self-supervised learning,

    J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent- a new approach to self-supervised learning,” Advances in neural information processing systems, vol. 33, pp. 21 271– 21 284, 2020

  5. [5]

    Emerging properties in self-supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660

  6. [6]

    A Survey on Contrastive Self- supervised Learning,

    A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A Survey on Contrastive Self- supervised Learning,” arXiv:2011.00362, Feb. 2021. [Online]. Available: http://arxiv.org/abs/2011.00362

  7. [7]

    Know your self-supervised learning: A survey on image- based generative and discriminative training,

    U. Ozbulak, H. J. Lee, B. Boga, E. T. Anzaku, H. Park, A. Van Messem, W. De Neve, and J. Vankerschaver, “Know your self-supervised learning: A survey on image- based generative and discriminative training,”arXiv preprint arXiv:2305.13689, 2023

  8. [8]

    A survey on self-supervised learning: Algorithms, applications, and future trends,

    J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao, “A survey on self-supervised learning: Algorithms, applications, and future trends,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  9. [9]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

  10. [10]

    Semantic understanding of scenes through the ade20k dataset,

    B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision, vol. 127, pp. 302–321, 2019. 17

  11. [11]

    Region similarity representation learning,

    T. Xiao, C. J. Reed, X. Wang, K. Keutzer, and T. Darrell, “Region similarity representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 539–10 548

  12. [12]

    Dense contrastive learning for self-supervised visual pre-training,

    X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li, “Dense contrastive learning for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3024–3033

  13. [13]

    Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,

    Z. Xie, Y . Lin, Z. Zhang, Y . Cao, S. Lin, and H. Hu, “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 684–16 693

  14. [14]

    Ob- ject discovery and representation networks,

    O. J. H´enaff, S. Koppula, E. Shelhamer, D. Zoran, A. Jaegle, A. Zisserman, J. Carreira, and R. Arandjelovi ´c, “Ob- ject discovery and representation networks,” in European Conference on Computer Vision. Springer, 2022, pp. 123– 143

  15. [15]

    Exploring set similarity for dense self-supervised representation learning,

    Z. Wang, Q. Li, G. Zhang, P. Wan, W. Zheng, N. Wang, M. Gong, and T. Liu, “Exploring set similarity for dense self-supervised representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 590–16 599

  16. [16]

    Self-supervised learning of object parts for semantic segmentation,

    A. Ziegler and Y . M. Asano, “Self-supervised learning of object parts for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 502–14 511

  17. [17]

    Bench- marking detection transfer learning with vision transformers,

    Y . Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick, “Benchmarking detection transfer learning with vision transformers,” arXiv preprint arXiv:2111.11429, 2021

  18. [18]

    What do self-supervised vision transformers learn?

    N. Park, W. Kim, B. Heo, T. Kim, and S. Yun, “What do self-supervised vision transformers learn?” in The Eleventh International Conference on Learning Representations, 2022

  19. [19]

    Revealing the dark secrets of masked image modeling,

    Z. Xie, Z. Geng, J. Hu, Z. Zhang, H. Hu, and Y . Cao, “Revealing the dark secrets of masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 475–14 485

  20. [20]

    Observation, analysis, and solution: Exploring strong lightweight vision transformers via masked image modeling pre-training,

    J. Gao, S. Lin, S. Wang, Y . Kou, Z. Li, L. Li, C. Zhang, X. Zhang, Y . Wang, and W. Hu, “Observation, analysis, and solution: Exploring strong lightweight vision transformers via masked image modeling pre-training,” arXiv preprint arXiv:2404.12210, 2024

  21. [21]

    A survey of self-supervised and few-shot object detection,

    G. Huang, I. Laradji, D. Vazquez, S. Lacoste-Julien, and P. Rodriguez, “A survey of self-supervised and few-shot object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4071–4089, 2022

  22. [22]

    Spatially consistent representation learning,

    B. Roh, W. Shin, I. Kim, and S. Kim, “Spatially consistent representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1144–1153

  23. [23]

    Self-supervised visual representations learning by con- trastive mask prediction,

    Y . Zhao, G. Wang, C. Luo, W. Zeng, and Z.-J. Zha, “Self-supervised visual representations learning by con- trastive mask prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 160–10 169

  24. [24]

    Aligning pretraining for detection via object-level contrastive learn- ing,

    F. Wei, Y . Gao, Z. Wu, H. Hu, and S. Lin, “Aligning pretraining for detection via object-level contrastive learn- ing,” Advances in Neural Information Processing Systems, vol. 34, pp. 22 682–22 694, 2021

  25. [25]

    Casting your model: Learning to localize improves self-supervised representations,

    R. R. Selvaraju, K. Desai, J. Johnson, and N. Naik, “Casting your model: Learning to localize improves self-supervised representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 058–11 067

  26. [26]

    Crafting better contrastive views for siamese representation learning,

    X. Peng, K. Wang, Z. Zhu, M. Wang, and Y . You, “Crafting better contrastive views for siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 031–16 040

  27. [27]

    Instance localization for self-supervised detection pretraining,

    C. Yang, Z. Wu, B. Zhou, and S. Lin, “Instance localization for self-supervised detection pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3987–3996

  28. [28]

    Cp 2: Copy-paste contrastive pretraining for semantic seg- mentation,

    F. Wang, H. Wang, C. Wei, A. Yuille, and W. Shen, “Cp 2: Copy-paste contrastive pretraining for semantic seg- mentation,” in European Conference on Computer Vision. Springer, 2022, pp. 499–515

  29. [29]

    Unsupervised object-level representation learning from scene images,

    J. Xie, X. Zhan, Z. Liu, Y . S. Ong, and C. C. Loy, “Unsupervised object-level representation learning from scene images,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 864–28 876, 2021

  30. [30]

    Unsupervised learning of visual features by contrasting cluster assignments,

    M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020

  31. [31]

    Inscon: Instance consistency feature representation via self-supervised learning,

    J. Yang, K. Zhang, Z. Cui, J. Su, J. Luo, and X. Wei, “Inscon: Instance consistency feature representation via self-supervised learning,” arXiv preprint arXiv:2203.07688, 2022

  32. [32]

    Unsupervised learning of dense visual representations,

    P. O. O Pinheiro, A. Almahairi, R. Benmalek, F. Golemo, and A. C. Courville, “Unsupervised learning of dense visual representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 4489–4500, 2020

  33. [33]

    Deeply unsupervised patch re-identification for pre- training object detectors,

    J. Ding, E. Xie, H. Xu, C. Jiang, Z. Li, P. Luo, and G.-S. Xia, “Deeply unsupervised patch re-identification for pre- training object detectors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

  34. [34]

    Self-supervised learning with local contrastive loss for detection and semantic segmentation,

    A. Islam, B. Lundell, H. Sawhney, S. N. Sinha, P. Morales, and R. J. Radke, “Self-supervised learning with local contrastive loss for detection and semantic segmentation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5624–5633

  35. [35]

    Self-supervised learning of contextualized local visual embeddings,

    T. Silva, H. Pedrini, and A. Ram ´ırez, “Self-supervised learning of contextualized local visual embeddings,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 177–186

  36. [36]

    Self-emd: Self-supervised object detection without imagenet,

    S. Liu, Z. Li, and J. Sun, “Self-emd: Self-supervised object detection without imagenet,” arXiv preprint arXiv:2011.13677, 2020

  37. [37]

    Vicregl: Self- supervised learning of local visual features,

    A. Bardes, J. Ponce, and Y . LeCun, “Vicregl: Self- supervised learning of local visual features,” Advances in Neural Information Processing Systems, vol. 35, pp. 8799– 8810, 2022

  38. [38]

    Efficient visual pretraining with contrastive detection,

    O. J. H ´enaff, S. Koppula, J.-B. Alayrac, A. Van den Oord, O. Vinyals, and J. Carreira, “Efficient visual pretraining with contrastive detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 086–10 096

  39. [39]

    Efficient 18 graph-based image segmentation,

    P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient 18 graph-based image segmentation,” International journal of computer vision, vol. 59, pp. 167–181, 2004

  40. [40]

    Beit: Bert pre- training of image transformers,

    H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre- training of image transformers,” in International Conference on Learning Representations, 2021

  41. [41]

    Image bert pre-training with online tokenizer,

    J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “Image bert pre-training with online tokenizer,” in International Conference on Learning Representations, 2021

  42. [42]

    Self-supervised learning from images with a joint-embedding predictive architecture,

    M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 619–15 629

  43. [43]

    Simmim: A simple framework for masked image modeling,

    Z. Xie, Z. Zhang, Y . Cao, Y . Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9653–9663

  44. [44]

    Context encoders: Feature learning by inpainting,

    D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2536–2544

  45. [45]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

  46. [46]

    You only train once: Learning a general anomaly enhancement network with random masks for hyperspectral anomaly detection,

    Z. Li, Y . Wang, C. Xiao, Q. Ling, Z. Lin, and W. An, “You only train once: Learning a general anomaly enhancement network with random masks for hyperspectral anomaly detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–18, 2023

  47. [47]

    The inaturalist species classification and detection dataset,

    G. Van Horn, O. Mac Aodha, Y . Song, Y . Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The inaturalist species classification and detection dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8769–8778

  48. [48]

    Corrupted image modeling for self-supervised visual pre-training,

    Y . Fang, L. Dong, H. Bao, X. Wang, and F. Wei, “Corrupted image modeling for self-supervised visual pre-training,” in The Eleventh International Conference on Learning Representations, 2022

  49. [49]

    Pixmim: Rethinking pixel reconstruction in masked image modeling,

    Y . Liu, S. Zhang, J. Chen, K. Chen, and D. Lin, “Pixmim: Rethinking pixel reconstruction in masked image modeling,” Transactions on Machine Learning Research, 2024

  50. [50]

    Mst: Masked self- supervised transformer for visual representation,

    Z. Li, Z. Chen, F. Yang, W. Li, Y . Zhu, C. Zhao, R. Deng, L. Wu, R. Zhao, M. Tang et al., “Mst: Masked self- supervised transformer for visual representation,” Advances in Neural Information Processing Systems, vol. 34, pp. 13 165–13 176, 2021

  51. [51]

    What to hide from your students: Attention-guided masked image modeling,

    I. Kakogeorgiou, S. Gidaris, B. Psomas, Y . Avrithis, A. Bur- suc, K. Karantzalos, and N. Komodakis, “What to hide from your students: Attention-guided masked image modeling,” in European Conference on Computer Vision. Springer, 2022, pp. 300–318

  52. [52]

    Good helper is around you: Attention-driven masked image modeling,

    Z. Liu, J. Gui, and H. Luo, “Good helper is around you: Attention-driven masked image modeling,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1799–1807

  53. [53]

    Milan: Masked image pretraining on language assisted representation,

    Z. Hou, F. Sun, Y .-K. Chen, Y . Xie, and S.-Y . Kung, “Milan: Masked image pretraining on language assisted representation,” arXiv preprint arXiv:2208.06049, 2022

  54. [54]

    Learning transferable visual models from natural lan- guage supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural lan- guage supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

  55. [55]

    Semmae: Semantic-guided masking for learning masked autoencoders,

    G. Li, H. Zheng, D. Liu, C. Wang, B. Su, and C. Zheng, “Semmae: Semantic-guided masking for learning masked autoencoders,” Advances in Neural Information Processing Systems, vol. 35, pp. 14 290–14 302, 2022

  56. [56]

    Dppmask: Masked image mod- eling with determinantal point processes,

    J. Xu, Z. Lin, D. Zhou, Y . Yang, X. Liao, Q. Wang, B. Wu, G. Chen, and P.-A. Heng, “Dppmask: Masked image mod- eling with determinantal point processes,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2266–2276

  57. [57]

    Extracting and composing robust features with denoising autoencoders,

    P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103

  58. [58]

    The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training,

    H. Liu, X. Jiang, X. Li, A. Guo, Y . Hu, D. Jiang, and B. Ren, “The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1649–1656

  59. [59]

    Architecture- agnostic masked image modeling from vit back to cnn,

    S. Li, D. Wu, F. Wu, Z. Zang, and S. Z. Li, “Architecture- agnostic masked image modeling from vit back to cnn,” in International Conference on Machine Learning. PMLR, 2023, pp. 20 149–20 167

  60. [60]

    Masked feature prediction for self-supervised visual pre-training,

    C. Wei, H. Fan, S. Xie, C.-Y . Wu, A. Yuille, and C. Feichten- hofer, “Masked feature prediction for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 668–14 678

  61. [61]

    Self- supervised masking for unsupervised anomaly detection and localization,

    C. Huang, Q. Xu, Y . Wang, Y . Wang, and Y . Zhang, “Self- supervised masking for unsupervised anomaly detection and localization,” IEEE Transactions on Multimedia, 2022

  62. [62]

    A unified view of masked image modeling,

    Z. Peng, L. Dong, H. Bao, F. Wei, and Q. Ye, “A unified view of masked image modeling,” Transactions on Machine Learning Research, 2022

  63. [63]

    Stare at what you see: Masked image modeling without reconstruction,

    H. Xue, P. Gao, H. Li, Y . Qiao, H. Sun, H. Li, and J. Luo, “Stare at what you see: Masked image modeling without reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 732–22 741

  64. [64]

    Are large-scale datasets necessary for self- supervised pre-training?

    A. El-Nouby, G. Izacard, H. Touvron, I. Laptev, H. Jegou, and E. Grave, “Are large-scale datasets necessary for self- supervised pre-training?” arXiv preprint arXiv:2112.10740, 2021

  65. [65]

    Exploring target representations for masked autoencoders,

    X. Liu, J. Zhou, T. Kong, X. Lin, and R. Ji, “Exploring target representations for masked autoencoders,” arXiv preprint arXiv:2209.03917, 2022

  66. [66]

    Designing bert for convolutional networks: Sparse and hier- archical masked modeling,

    K. Tian, Y . Jiang, C. Lin, L. Wang, Z. Yuan et al., “Designing bert for convolutional networks: Sparse and hier- archical masked modeling,” in The Eleventh International Conference on Learning Representations, 2022

  67. [67]

    Convmae: Masked convolution meets masked autoencoders,

    P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y . Qiao, “Convmae: Masked convolution meets masked autoencoders,” arXiv preprint arXiv:2205.03892, 2022

  68. [68]

    Mixmae: Mixed and masked autoencoder for efficient pretraining 19 of hierarchical vision transformers,

    J. Liu, X. Huang, J. Zheng, Y . Liu, and H. Li, “Mixmae: Mixed and masked autoencoder for efficient pretraining 19 of hierarchical vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6252–6261

  69. [69]

    A convnet for the 2020s,

    Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986

  70. [70]

    Mask R-CNN

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask R-CNN,” arXiv:1703.06870, Jan. 2018. [Online]. Available: http://arxiv.org/abs/1703.06870

  71. [71]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

  72. [72]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    P. Goyal, P. Doll ´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017

  73. [73]

    Unleashing vanilla vision transformer with masked image modeling for object detection,

    Y . Fang, S. Yang, S. Wang, Y . Ge, Y . Shan, and X. Wang, “Unleashing vanilla vision transformer with masked image modeling for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6244–6253

  74. [74]

    Lsotb-tir: A large-scale high- diversity thermal infrared object tracking benchmark,

    Q. Liu, X. Li, Z. He, C. Li, J. Li, Z. Zhou, D. Yuan, J. Li, K. Yang, N. Fan et al., “Lsotb-tir: A large-scale high- diversity thermal infrared object tracking benchmark,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 3847–3856

  75. [75]

    Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset irdst,

    H. Sun, J. Bai, F. Yang, and X. Bai, “Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset irdst,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023

  76. [76]

    Flir data set dataset,

    T. Imaging, “Flir data set dataset,” https://universe.roboflow. com/thermal-imaging-0hwfw/flir-data-set, mar 2024, visited on 2024-07-16. [Online]. Available: https://universe. roboflow.com/thermal-imaging-0hwfw/flir-data-set

  77. [77]

    Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,

    H. Wang, L. Zhou, and L. Wang, “Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8509–8518

  78. [78]

    People detection and tracking from aerial thermal views,

    J. Portmann, S. Lynen, M. Chli, and R. Siegwart, “People detection and tracking from aerial thermal views,” in 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 1794–1800

  79. [79]

    Hit-uav: A high-altitude infrared thermal dataset for unmanned aerial vehicle-based object detection,

    J. Suo, T. Wang, X. Zhang, H. Chen, W. Zhou, and W. Shi, “Hit-uav: A high-altitude infrared thermal dataset for unmanned aerial vehicle-based object detection,” Scientific Data, vol. 10, no. 1, p. 227, 2023

  80. [80]

    Isnet: Shape matters for infrared small target detection,

    M. Zhang, R. Zhang, Y . Yang, H. Bai, J. Zhang, and J. Guo, “Isnet: Shape matters for infrared small target detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 877–886

Showing first 80 references.